On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra <luispelay...@gmail.com> wrote:
> Hi everyone,
>
> Is it possible to fix the number of tasks related to a saveAsTextFile in
> Pyspark?
>
> I am loading several files from HDFS, fixing the number of partitions to X
> (let's say 40 for instance). Then some transformations, like joins and
> filters are carried out. The weird thing here is that the number of tasks
> involved in these transformations are 80, i.e. the double of the fixed
> number of partitions. However, when the saveAsTextFile action is carried
> out, there are only 4 tasks to do this (and I have not been able to increase
> that number). My problem here is that those 4 tasks make rapidly increase
> the used memory and take too long to finish.

> I am launching my process from windows to a cluster in ubuntu, with 13
> computers (4 cores each) with 32 gb of memory, and using pyspark 1.0.2.

The saveAsTextFile() is an mapper RDD, so the number of partitions of it
is determined by previous RDD.

In Spark 1.0.2, groupByKey() or reduceByKey() will take the number of CPUs
on driver (locally) as the default partitions, so it's 4. You need to change it
to 40 or 80 in this case.

BTW, In Spark 1.1, groupByKey() and reduceByKey() will use the number of
partitions of previous RDD as the default value.

Davies

> Any clue with this?
>
> Thanks in advance

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to