Hi John,

> Is DataFusion a good solution for validating and converting large csv
files (+20M, ~400 columns) existing in S3 buckets to parquet?

In my opinion yes -- that is well within the usecase of DataFusion.  There
is an example [1] of how to convert csv files into parquet files that might
be helpful if you go in that direction.

Note that DataFusion doesn't (yet) have built in integration for S3, so you
would likely need to use another crate to write the files there. Here is an
example of how we did it in IOx [2].

I personally think the part of DataFusion that shines the most is its
higher level query features, either SQL or the DataFrame API. If your
usecase is only converting csv -> parquet, your development may go faster
by using the pyarrow library (which wraps the C++ implementation) given
python's developer productivity and its broad ecosystem support.

Andrew

[1]
https://github.com/apache/arrow-datafusion/blob/0125451e5fc194b1b1e4828bae5350bcd8ac24f9/benchmarks/src/bin/tpch.rs#L379-L448
[2]
https://github.com/influxdata/influxdb_iox/blob/d41b44d3121c8a4b061ee13150d1d9f97a77ad88/parquet_file/src/storage.rs#L175-L193

On Tue, Aug 10, 2021 at 3:12 PM John E. Conlon <[email protected]> wrote:

> Is DataFusion a good solution for validating and converting large csv
> files (+20M, ~400 columns) existing in S3 buckets to parquet?
>
> Have ran some of the Arrow,ArrowFlight/Java and Arrow/JS examples and
> now are looking at DataFusion because I see that it can work directly
> with CSV files. Not a Rust programmer but I like some of the features
> described in the DataFusion docs.
>
> I will deploying the transformed parquet files to S3 that will then be
> processed further by Dremio into virtual datasets.  For the end users I
> will be offering a browser based visualizer for adhoc data analysis and
> since ArrowJS does not implement ArrowFlight I plan to create a gateway
> between the ArrowJS and Dremio/ArrowFlight.
>
> So... Arrow is in my dev plans but should I bite the bullet to learn
> Rust and DataFusion??
>
> thanks for any directions,
>
> John
>
>
>

Reply via email to