Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2976163501 > Regarding the location of the code, if it is in datafusion proper rather than the CLI, it would be available in datafusion python, and any other projects that want to offer functionality backed by datafusion. I think it increases the utility of datafusion as a library and will get used. That is an interesting idea -- I agree that having `CREATE EXTERNAL TABLE` support this kind of URL / multiple files would be useful I am not sure I fully understand the ramifications either -- if we simply update the SQL planner (SqlToRel) to split the URL list on `' '` or `', '` that certainly seems straightforward to me (and would be backwards compatible...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
robtandy commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2970581452 Thanks for creating this issue @alamb !! Regarding the location of the code, if it is in datafusion proper rather than the CLI, it would be available in datafusion python, and any other projects that want to offer functionality backed by datafusion. I think it increases the utility of datafusion as a library and will get used. Is it possible that it is a configuration option about whether to enable it? Like how `datafusion.catalog.information_schema` enables the info schema in the `SessionState`? I do understand that it will be more code to maintain, but my intuition is that this is generally useful enough to offer within the core as i think it will provide value.Its possible though i don't fully appreciate the ramifications of this choice though. Curious what people think about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2969936296 > Hi [@comphead](https://github.com/comphead) and [@alamb](https://github.com/alamb) I thought it might be a good idea to split this issue to several PRs 1 - add the support to use `CREATE TABLE` syntax with glob patterns and remote URL schemes just as with local ones (The new PR above tried to handle this). 2 - add table functions (`read_parquet`, `read_csv`, etc) to support glob reading (Im working on your comments regarding this one). > > Hope this makes sense, feel free to comment also if not... I agree a few smaller focused PRs will make sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-293348 Hi @comphead and @alamb I thought it might be a good idea to split this issue to several PRs 1 - add the support to use `CREATE TABLE` syntax with glob patterns and remote URL schemes just as with local ones (The new PR above tried to handle this). 2 - add table functions (`read_parquet`, `read_csv`, etc) to support glob reading (Im working on your comments regarding this one). Hope this makes sense, feel free to comment also if not... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954755281 Hi @alamb , @comphead raises a couple of good questions about the PR, so I'm linking it here to hear you thoughts. https://github.com/apache/datafusion/pull/16332#discussion_r2134795185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954199489 I have added a draft for this PR. Would be happy for your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953904462 > I can see that ListingTableUrl::parse supports glob strings, so does it make sense to simply implement this as a listing table? Yes this is what I would expect -- that the result of calling `read_parquet` is / uses the LIstingTable implementation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953903299 Thnks @a-agmon -- maybe this example would help: https://docs.rs/datafusion/latest/datafusion/catalog/trait.AsyncSchemaProvider.html I agree the trick will be figuring out how to async calls. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953899593 I gave it a shot but it ended up being somewhat messy. Thats mostly due to the fact that on the one hand `TableFunctionImpl::call()` is synchronous, yet, on the other hand, it also has to get a hold of the schema of the data, which in the case of remote blobs (like s3), requires IO and async to be done right. I was trying to work around this by using the `call()` method to create a `TableProvider` that initially reports an empty schema. This satisfies the planner's synchronous API. The actual schema discovery is deferred until the scan() method is called during the asynchronous execution phase. But this creates an issue with projections that require to validate schema, i.e, `select X from read_csv(some-glob-pattern)` though `select * from read_csv(some-glob-pattern)` will work -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2952859549 @alamb - I'm less familiar with this area in datafusion but might be able to give this a shot. The idea is to add this as a table function right? I can see that `ListingTableUrl::parse` supports glob strings, so does it make sense to simply implement this as a listing table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2950888923 Maybe @robtandy could help -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
comphead commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2949738093 I'll try to take in 2 weeks if no one else beats me to it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]
comphead commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2949616516 That actually was on my backlog couple of months. It is nice to support an array of files or globs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
