Hi John, > Is DataFusion a good solution for validating and converting large csv files (+20M, ~400 columns) existing in S3 buckets to parquet?
In my opinion yes -- that is well within the usecase of DataFusion. There is an example [1] of how to convert csv files into parquet files that might be helpful if you go in that direction. Note that DataFusion doesn't (yet) have built in integration for S3, so you would likely need to use another crate to write the files there. Here is an example of how we did it in IOx [2]. I personally think the part of DataFusion that shines the most is its higher level query features, either SQL or the DataFrame API. If your usecase is only converting csv -> parquet, your development may go faster by using the pyarrow library (which wraps the C++ implementation) given python's developer productivity and its broad ecosystem support. Andrew [1] https://github.com/apache/arrow-datafusion/blob/0125451e5fc194b1b1e4828bae5350bcd8ac24f9/benchmarks/src/bin/tpch.rs#L379-L448 [2] https://github.com/influxdata/influxdb_iox/blob/d41b44d3121c8a4b061ee13150d1d9f97a77ad88/parquet_file/src/storage.rs#L175-L193 On Tue, Aug 10, 2021 at 3:12 PM John E. Conlon <[email protected]> wrote: > Is DataFusion a good solution for validating and converting large csv > files (+20M, ~400 columns) existing in S3 buckets to parquet? > > Have ran some of the Arrow,ArrowFlight/Java and Arrow/JS examples and > now are looking at DataFusion because I see that it can work directly > with CSV files. Not a Rust programmer but I like some of the features > described in the DataFusion docs. > > I will deploying the transformed parquet files to S3 that will then be > processed further by Dremio into virtual datasets. For the end users I > will be offering a browser based visualizer for adhoc data analysis and > since ArrowJS does not implement ArrowFlight I plan to create a gateway > between the ArrowJS and Dremio/ArrowFlight. > > So... Arrow is in my dev plans but should I bite the bullet to learn > Rust and DataFusion?? > > thanks for any directions, > > John > > >
