Re: increasing concurrency of saveAsNewAPIHadoopFile?

Nicholas Chammas Thu, 19 Jun 2014 14:28:18 -0700

The main thing that will affect the concurrency of any saveAs...()
operations is a) the number of partitions of your RDD, and b) how many
cores your cluster has.


How big is the RDD in question? How many partitions does it have?



On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh <sand...@clusterbeep.org>
wrote:

> I'm trying to write a JavaPairRDD to a downstream database using
> saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty
> slow.
>
> Is there a way to boost the concurrency of the save process? For example,
> something like splitting the RDD into multiple smaller RDDs and using Java
> threads to write the data out? That seems foreign to the way Spark works so
> not sure if there's a better way.
>
>

Re: increasing concurrency of saveAsNewAPIHadoopFile?

Reply via email to