RE: CSV to parquet preserving partitioning

2016-11-23 Thread benoitdr
Best solution I've found so far (no shuffling and as many threads as input dirs) : Create an rdd of input dirs, with as many partitions as input dirs Transform it to an rdd of input files (preserving the partitions by dirs) Flat-map it with a custom csv parser Convert rdd to dataframe Write

RE: CSV to parquet preserving partitioning

2016-11-18 Thread benoitdr
This is more or less how I'm doing it now. Problem is that it creates shuffling in the cluster because the input data are not collocated according to the partition scheme. If a reload the output parquet files as a new dataframe, then everything is fine, but I'd like to avoid shuffling also during

RE: CSV to parquet preserving partitioning

2016-11-16 Thread benoitdr
Yes, by parsing the file content, it's possible to recover in which directory they are. From: neil90 [via Apache Spark User List] [mailto:ml-node+s1001560n28083...@n3.nabble.com] Sent: mercredi 16 novembre 2016 17:41 To: Drooghaag, Benoit (Nokia - BE) Subject: Re:

CSV to parquet preserving partitioning

2016-11-15 Thread benoitdr
Hello, I'm trying to convert a bunch of csv files to parquet, with the interesting case that the input csv files are already "partitioned" by directory. All the input files have the same set of columns. The input files structure looks like : /path/dir1/file1.csv /path/dir1/file2.csv