You can use coalesce or repartition to control the number of file output by any Spark operation.
On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel <manojsamelt...@gmail.com> wrote: > Spark 1.2 on Hadoop 2.3 > > Read one big csv file, create a schemaRDD on it and saveAsParquetFile. > > It creates a large number of small (~1MB ) parquet part-x- files. > > Any way to control so that smaller number of large files are created ? > > Thanks, >