Re: [PR] feat(planner): Allowing setting sort order of parquet files without specifying the schema [datafusion]

via GitHub Mon, 16 Sep 2024 11:09:34 -0700


alamb commented on code in PR #12466:
URL: https://github.com/apache/datafusion/pull/12466#discussion_r1761626074



##########
datafusion/sql/src/statement.rs:
##########
@@ -1028,8 +1030,26 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
             .into_iter()
             .collect();
 
-        let schema = self.build_schema(columns)?;
-        let df_schema = schema.to_dfschema_ref()?;
+        let df_schema = match file_type.as_str() {

Review Comment:
   I am sorry for the delayed feeback @devanbenz  -- I swear I typed this 
feedback but i must not have clicked "submit"
   
   Basically my concerns about this approach are twofold:
   1. This code assumes the parquet file is on the local filesystem (when for 
many systems it may be on remote object storage)
   2. It also adds a dependency in sql parsing to the parquet format. Since 
`parquet` has quite a few dependencies, this new dependency is likely non ideal 
for systems that are using DataFusion for sql parsing (like dask-sql for 
example)
   
   
   Perhaps you could delay the creation of the ORDER BY until the table 
provider is resolved? 
   
   The table provider: 
https://github.com/apache/datafusion/blob/2521043ddcb3895a2010b8e328f3fa10f77fc094/datafusion/expr/src/planner.rs#L35-L34
   
   Once the table provider is resolved then the schema's table can be known
   
   Another benefit of this approach is that it would work for all formats, not 
just parquet



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(planner): Allowing setting sort order of parquet files without specifying the schema [datafusion]

Reply via email to