Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-16 Thread via GitHub


alamb commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2976163501

   > Regarding the location of the code, if it is in datafusion proper rather 
than the CLI, it would be available in datafusion python, and any other 
projects that want to offer functionality backed by datafusion. I think it 
increases the utility of datafusion as a library and will get used.
   
   That is an interesting idea -- I agree that having `CREATE EXTERNAL TABLE` 
support this kind of URL / multiple files would be useful
   
   I am not sure I fully understand the ramifications either -- if we simply 
update the SQL planner (SqlToRel) to split the URL list on `' '` or `', '` that 
certainly seems straightforward to me (and would be backwards compatible...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-13 Thread via GitHub


robtandy commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2970581452

   Thanks for creating this issue @alamb !!   
   
   Regarding the location of the code, if it is in datafusion proper rather 
than the CLI, it would be available in datafusion python, and any other 
projects that want to offer functionality backed by datafusion.  I think it 
increases the utility of datafusion as a library and will get used.  
   
   Is it possible that it is a configuration option about whether to enable it? 
 Like how `datafusion.catalog.information_schema` enables the info schema in 
the `SessionState`?   I do understand that it will be more code to maintain, 
but my intuition is that this is generally useful enough to offer within the 
core as i think it will provide value.Its possible though i don't fully 
appreciate the ramifications of this choice though.  
   
   Curious what people think about this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-13 Thread via GitHub


alamb commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2969936296

   > Hi [@comphead](https://github.com/comphead) and 
[@alamb](https://github.com/alamb) I thought it might be a good idea to split 
this issue to several PRs 1 - add the support to use `CREATE TABLE` syntax with 
glob patterns and remote URL schemes just as with local ones (The new PR above 
tried to handle this). 2 - add table functions (`read_parquet`, `read_csv`, 
etc) to support glob reading (Im working on your comments regarding this one).
   > 
   > Hope this makes sense, feel free to comment also if not...
   
   I agree a few smaller focused PRs will make sense


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-12 Thread via GitHub


a-agmon commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-293348

   Hi @comphead  and @alamb 
   I thought it might be a good idea to split this issue to several PRs
   1 - add the support to use `CREATE TABLE` syntax with glob patterns and 
remote URL schemes just as with local ones (The new PR above tried to handle 
this).  
   2 - add table functions (`read_parquet`, `read_csv`, etc) to support glob 
reading (Im working on your comments regarding this one).
   
   Hope this makes sense, feel free to comment also if not... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954755281

   Hi @alamb , 
   @comphead raises a couple of good questions about the PR, so I'm linking it 
here to hear you thoughts. 
   https://github.com/apache/datafusion/pull/16332#discussion_r2134795185


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954199489

   I have added a draft for this PR. Would be happy for your comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub


alamb commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953904462

   > I can see that ListingTableUrl::parse supports glob strings, so does it 
make sense to simply implement this as a listing table?
   
   
   
   Yes this is what I would expect -- that the result of calling `read_parquet` 
is / uses the LIstingTable implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub


alamb commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953903299

   Thnks @a-agmon  -- maybe this example would help: 
https://docs.rs/datafusion/latest/datafusion/catalog/trait.AsyncSchemaProvider.html
   
   I agree the trick will be figuring out how to async calls. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953899593

   I gave it a shot but it ended up being somewhat messy. Thats mostly due to 
the fact that on the one hand  `TableFunctionImpl::call()` is synchronous, yet, 
on the other hand, it also has to get a hold of the schema of the data, which 
in the case of remote blobs (like s3), requires IO and async to be done right. 
   I was trying to work around this by using the `call()` method to create a 
`TableProvider` that initially reports an empty schema. This satisfies the 
planner's synchronous API. The actual schema discovery is deferred until the 
scan() method is called during the asynchronous execution phase. But this 
creates an issue with projections that require to validate schema, i.e, `select 
X from read_csv(some-glob-pattern)` though `select * from 
read_csv(some-glob-pattern)` will work
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-07 Thread via GitHub


a-agmon commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2952859549

   @alamb  - I'm less familiar with this area in datafusion but might be able 
to give this a shot. 
   The idea is to add this as a table function right?
   I can see that `ListingTableUrl::parse` supports glob strings, so does it 
make sense to simply implement this as a listing table?  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-06 Thread via GitHub


alamb commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2950888923

   Maybe @robtandy  could help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-06 Thread via GitHub


comphead commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2949738093

   I'll try to take in 2 weeks if no one else beats me to it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-06 Thread via GitHub


comphead commented on issue #16303:
URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2949616516

   That actually was on my backlog couple of months. It is nice to support an 
array of files or globs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]