RE: schemaRDD.saveAsParquetFile creates large number of small parquet files ...
Try rdd.coalesce(1).saveAsParquetFile(...) http://spark.apache.org/docs/1.2.0/programming-guide.html#transformations --- Original Message --- From: "Manoj Samel" Sent: January 29, 2015 9:28 AM To: user@spark.apache.org Subject: schemaRDD.saveAsParquetFile creates large number of small parquet files ... Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,
Re: schemaRDD.saveAsParquetFile creates large number of small parquet files ...
You can use coalesce or repartition to control the number of file output by any Spark operation. On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel wrote: > Spark 1.2 on Hadoop 2.3 > > Read one big csv file, create a schemaRDD on it and saveAsParquetFile. > > It creates a large number of small (~1MB ) parquet part-x- files. > > Any way to control so that smaller number of large files are created ? > > Thanks, >