Re: Control number of parquet generated from JavaSchemaRDD

Michael Armbrust Tue, 25 Nov 2014 09:48:21 -0800

repartition and coalesce should both allow you to achieve what you
describe.  Can you maybe share the code that is not working?


On Mon, Nov 24, 2014 at 8:24 PM, tridib <tridib.sama...@live.com> wrote:

> Hello,
> I am reading around 1000 input files from disk in an RDD and generating
> parquet. It always produces same number of parquet files as number of input
> files. I tried to merge them using
>
> rdd.coalesce(n) and/or rdd.repatition(n).
> also tried using:
>
>         int MB_128 = 128*1024*1024;
>         sc.hadoopConfiguration().setInt("dfs.blocksize", MB_128);
>         sc.hadoopConfiguration().setInt("parquet.block.size", MB_128);
>
> No luck.
> Is there a way to control the size/number of parquet files generated?
>
> Thanks
> Tridib
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Control number of parquet generated from JavaSchemaRDD

Reply via email to