schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Manoj Samel
Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,

Re: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Michael Armbrust
You can use coalesce or repartition to control the number of file output by any Spark operation. On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large

RE: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Felix C
number of small parquet files ... Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,