Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large number of small (~1MB ) parquet part-x- files.
Any way to control so that smaller number of large files are created ?
Thanks,
You can use coalesce or repartition to control the number of file output by
any Spark operation.
On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large
number of small parquet
files ...
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large number of small (~1MB ) parquet part-x- files.
Any way to control so that smaller number of large files are created ?
Thanks,