Sure, I'll post to the mail list.
groupByKey(self, numPartitions=None)source code
<http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd-pysrc.html#RDD.groupByKey>


Group the values for each key in the RDD into a single sequence.
Hash-partitions the resulting RDD with into numPartitions partitions.


So instead of using default I'll provide numPartitions , but what is the
best practice to calculate the number of partitions? and how number of
partitions related to my original problem?


Thanks

Oleg.

http://spark.apache.org/docs/1.0.2/api/python/frames.html



On Wed, Sep 17, 2014 at 9:25 PM, Eric Friedman <eric.d.fried...@gmail.com>
wrote:

> Look at the API for text file and groupByKey. Please don't take threads
> off list. Other people have the same questions.
>
> ----
> Eric Friedman
>
> On Sep 17, 2014, at 6:19 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote:
>
> Can hou please explain how to configure partitions?
> Thanks
> Oleg
>
> On Wednesday, September 17, 2014, Eric Friedman <eric.d.fried...@gmail.com>
> wrote:
>
>> Yeah, you need to increase partitions. You only have one on your text
>> file. On groupByKey you're getting the pyspark default, which is too low.
>>
>> ----
>> Eric Friedman
>>
>> On Sep 17, 2014, at 5:29 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote:
>>
>> This is very good question :-).
>>
>> Here is my code:
>>
>> sc = SparkContext(appName="CAD")
>>     lines = sc.textFile(sys.argv[1], 1)
>>     result = lines.map(doSplit).groupByKey().mapValues(lambda vc:
>> my_custom_function(vc))
>>     result.saveAsTextFile(sys.argv[2])
>>
>> Should I configure partitioning manually ? Where should I configure it?
>> Where can I read about partitioning best practices?
>>
>> Thanks
>> Oleg.
>>
>> On Wed, Sep 17, 2014 at 8:22 PM, Eric Friedman <eric.d.fried...@gmail.com
>> > wrote:
>>
>>> How many partitions do you have in your input rdd?  Are you specifying
>>> numPartitions in subsequent calls to groupByKey/reduceByKey?
>>>
>>> On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets <oruchov...@gmail.com>
>>> wrote:
>>>
>>> Hi ,
>>>   I am execution pyspark on yarn.
>>> I have successfully executed initial dataset but now I growed it 10
>>> times more.
>>>
>>> during execution I got all the time this error:
>>>   14/09/17 19:28:50 ERROR cluster.YarnClientClusterScheduler: Lost
>>> executor 68 on UCS-NODE1.sms1.local: remote Akka client disassociated
>>>
>>>  tasks are failed a resubmitted again:
>>>
>>> 14/09/17 18:40:42 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 21, 23, 26,
>>> 29, 32, 33, 48, 75, 86, 91, 93, 94
>>> 14/09/17 18:44:18 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 31, 52, 60, 93
>>> 14/09/17 18:46:33 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 19, 20, 23,
>>> 27, 39, 51, 64
>>> 14/09/17 18:48:27 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 51, 68, 80
>>> 14/09/17 18:50:47 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 1, 20, 34,
>>> 42, 61, 67, 77, 81, 91
>>> 14/09/17 18:58:50 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 8, 21, 23,
>>> 29, 34, 40, 46, 67, 69, 86
>>> 14/09/17 19:00:44 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 6, 13, 15,
>>> 17, 18, 19, 23, 32, 38, 39, 44, 49, 53, 54, 55, 56, 57, 59, 68, 74, 81, 85,
>>> 89
>>> 14/09/17 19:06:24 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 20, 43, 59,
>>> 79, 92
>>> 14/09/17 19:16:13 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 0, 2, 3, 11,
>>> 24, 31, 43, 65, 73
>>> 14/09/17 19:27:40 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (RDD
>>> at PythonRDD.scala:252) because some of its tasks had failed: 3, 7, 41, 72,
>>> 75, 84
>>>
>>>
>>>
>>> *QUESTION:*
>>>    how to debug / tune the problem.
>>> What can cause to such behavior?
>>> I have 5 machine cluster with 32 GB ram.
>>>  Dataset - 3G.
>>>
>>> command for execution:
>>>
>>>
>>>  /usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/bin/spark-submit
>>> --master yarn  --num-executors 12  --driver-memory 4g --executor-memory 2g
>>> --py-files tad.zip --executor-cores 4   /usr/lib/cad/PrepareDataSetYarn.py
>>>  /input/tad/inpuut.csv  /output/cad_model_500_2
>>>
>>>
>>> Where can I find description of the parameters?
>>> --num-executors 12
>>> --driver-memory 4g
>>> --executor-memory 2g
>>>
>>> What parameters should be used for tuning?
>>>
>>> Thanks
>>> Oleg.
>>>
>>>
>>>
>>>
>>

Reply via email to