Hi Oleg, Those parameters control the number and size of Spark's daemons on the cluster. If you're interested in how these daemons relate to each other and interact with YARN, I wrote a post on this a little while ago - http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
In general, typing "spark-submit --help" will list the available options and what they control. To fetch the executor logs for an application, you can use "yarn logs -applicationId <the application ID>". -Sandy On Thu, Sep 18, 2014 at 5:47 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote: > Great. > Upgrade helped. > > Still need some inputs: > 1) Is there any log files of spark job execution? > 2) Where can I read about tuning / parameter configuration: > > For example: > --num-executors 12 > --driver-memory 4g > --executor-memory 2g > > what is the meaning of thous parameters? > > Thanks > Oleg. > > On Thu, Sep 18, 2014 at 12:15 AM, Davies Liu <dav...@databricks.com> > wrote: > >> Maybe the Python worker use too much memory during groupByKey(), >> groupByKey() with larger numPartitions can help. >> >> Also, can you upgrade your cluster to 1.1? It can spilling the data >> into disks if the memory can not hold all the data during groupByKey(). >> >> Also, If there is hot key with dozens of millions of values, the PR [1] >> can help it, it actually helped someone with large datasets (3T). >> >> Davies >> >> [1] https://github.com/apache/spark/pull/1977 >> >> On Wed, Sep 17, 2014 at 7:31 AM, Oleg Ruchovets <oruchov...@gmail.com> >> wrote: >> > >> > Sure, I'll post to the mail list. >> > >> > groupByKey(self, numPartitions=None) >> > >> > source code >> > >> > Group the values for each key in the RDD into a single sequence. >> Hash-partitions the resulting RDD with into numPartitions partitions. >> > >> > >> > So instead of using default I'll provide numPartitions , but what is >> the best practice to calculate the number of partitions? and how number of >> partitions related to my original problem? >> > >> > >> > Thanks >> > >> > Oleg. >> > >> > >> > http://spark.apache.org/docs/1.0.2/api/python/frames.html >> > >> > >> > >> > On Wed, Sep 17, 2014 at 9:25 PM, Eric Friedman < >> eric.d.fried...@gmail.com> wrote: >> >> >> >> Look at the API for text file and groupByKey. Please don't take >> threads off list. Other people have the same questions. >> >> >> >> ---- >> >> Eric Friedman >> >> >> >> On Sep 17, 2014, at 6:19 AM, Oleg Ruchovets <oruchov...@gmail.com> >> wrote: >> >> >> >> Can hou please explain how to configure partitions? >> >> Thanks >> >> Oleg >> >> >> >> On Wednesday, September 17, 2014, Eric Friedman < >> eric.d.fried...@gmail.com> wrote: >> >>> >> >>> Yeah, you need to increase partitions. You only have one on your text >> file. On groupByKey you're getting the pyspark default, which is too low. >> >>> >> >>> ---- >> >>> Eric Friedman >> >>> >> >>> On Sep 17, 2014, at 5:29 AM, Oleg Ruchovets <oruchov...@gmail.com> >> wrote: >> >>> >> >>> This is very good question :-). >> >>> >> >>> Here is my code: >> >>> >> >>> sc = SparkContext(appName="CAD") >> >>> lines = sc.textFile(sys.argv[1], 1) >> >>> result = lines.map(doSplit).groupByKey().mapValues(lambda vc: >> my_custom_function(vc)) >> >>> result.saveAsTextFile(sys.argv[2]) >> >>> >> >>> Should I configure partitioning manually ? Where should I configure >> it? Where can I read about partitioning best practices? >> >>> >> >>> Thanks >> >>> Oleg. >> >>> >> >>> On Wed, Sep 17, 2014 at 8:22 PM, Eric Friedman < >> eric.d.fried...@gmail.com> wrote: >> >>>> >> >>>> How many partitions do you have in your input rdd? Are you >> specifying numPartitions in subsequent calls to groupByKey/reduceByKey? >> >>>> >> >>>> On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets <oruchov...@gmail.com> >> wrote: >> >>>> >> >>>> Hi , >> >>>> I am execution pyspark on yarn. >> >>>> I have successfully executed initial dataset but now I growed it 10 >> times more. >> >>>> >> >>>> during execution I got all the time this error: >> >>>> 14/09/17 19:28:50 ERROR cluster.YarnClientClusterScheduler: Lost >> executor 68 on UCS-NODE1.sms1.local: remote Akka client disassociated >> >>>> >> >>>> tasks are failed a resubmitted again: >> >>>> >> >>>> 14/09/17 18:40:42 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 21, 23, >> 26, 29, 32, 33, 48, 75, 86, 91, 93, 94 >> >>>> 14/09/17 18:44:18 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 31, 52, >> 60, 93 >> >>>> 14/09/17 18:46:33 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 19, 20, >> 23, 27, 39, 51, 64 >> >>>> 14/09/17 18:48:27 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 51, 68, >> 80 >> >>>> 14/09/17 18:50:47 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 1, 20, >> 34, 42, 61, 67, 77, 81, 91 >> >>>> 14/09/17 18:58:50 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 8, 21, >> 23, 29, 34, 40, 46, 67, 69, 86 >> >>>> 14/09/17 19:00:44 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 6, 13, >> 15, 17, 18, 19, 23, 32, 38, 39, 44, 49, 53, 54, 55, 56, 57, 59, 68, 74, 81, >> 85, 89 >> >>>> 14/09/17 19:06:24 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 20, 43, >> 59, 79, 92 >> >>>> 14/09/17 19:16:13 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 0, 2, 3, >> 11, 24, 31, 43, 65, 73 >> >>>> 14/09/17 19:27:40 INFO scheduler.DAGScheduler: Resubmitting Stage 1 >> (RDD at PythonRDD.scala:252) because some of its tasks had failed: 3, 7, >> 41, 72, 75, 84 >> >>>> >> >>>> >> >>>> >> >>>> QUESTION: >> >>>> how to debug / tune the problem. >> >>>> What can cause to such behavior? >> >>>> I have 5 machine cluster with 32 GB ram. >> >>>> Dataset - 3G. >> >>>> >> >>>> command for execution: >> >>>> >> >>>> >> /usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/bin/spark-submit >> --master yarn --num-executors 12 --driver-memory 4g --executor-memory 2g >> --py-files tad.zip --executor-cores 4 /usr/lib/cad/PrepareDataSetYarn.py >> /input/tad/inpuut.csv /output/cad_model_500_2 >> >>>> >> >>>> >> >>>> Where can I find description of the parameters? >> >>>> --num-executors 12 >> >>>> --driver-memory 4g >> >>>> --executor-memory 2g >> >>>> >> >>>> What parameters should be used for tuning? >> >>>> >> >>>> Thanks >> >>>> Oleg. >> >>>> >> >>>> >> >>>> >> >>> >> > >> > >