GitHub user collimarco closed a discussion: Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder)
I am getting started with Arrow Datafusion and looking at the examples: https://arrow.apache.org/datafusion/user-guide/example-usage.html I don't see any way to execute a SQL query on multiple files at the same time. Is that possible? Let's say that you have thousands of Parquet files already stored in a folder. The schema is similar, but it is not identical for all the files. For example: - some may have some additional columns or less columns - rarely a column may be of different type (like a status column may be an integer but sometimes a string). **Is it possible to use Datafusion to query all the files in a directory?** **Or it possible to give Datafusion a long list of files to query dynamically?** Ideally each query uses a different set of files (they are grouped in partitions), so it would be better to be able to execute the queries directly on a list of files, without having to perform too many intermediate steps. Is this possible with Datafusion? GitHub link: https://github.com/apache/datafusion/discussions/6728 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
