Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Jeff Zhang
Data skew ? May your partition key has some special value like null or
empty string

On Fri, Aug 14, 2015 at 11:01 AM, randylu randyl...@gmail.com wrote:

   It is strange that there are always two tasks slower than others, and the
 corresponding partitions's data are larger, no matter how many partitions?


 Executor ID Address Task Time   Shuffle Read Size /
 Records
 1   slave129.vsvs.com:56691 16 s1   99.5 MB / 18865432
 *10 slave317.vsvs.com:59281 0 ms0   413.5 MB / 311001318*
 100 slave290.vsvs.com:60241 19 s1   110.8 MB / 27075926
 101 slave323.vsvs.com:36246 14 s1   126.1 MB / 25052808

   Task time and records of Executor 10 seems strange, and the cpus on the
 node are all 100% busy.

   Anyone meets the same problem,  Thanks in advance for any answer!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Best Regards

Jeff Zhang


Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Zoltán Zvara
Data skew is still a problem with Spark.

- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many
tasks for each stage, but tasks are light-weight in Spark, so it should not
introduce a heavy overhead. If you have your own domain-partitioner, try to
rewrite it by introducing a secondary-key.

I hope I gave some insights and help.

On Fri, Aug 14, 2015 at 9:37 AM Jeff Zhang zjf...@gmail.com wrote:

 Data skew ? May your partition key has some special value like null or
 empty string

 On Fri, Aug 14, 2015 at 11:01 AM, randylu randyl...@gmail.com wrote:

   It is strange that there are always two tasks slower than others, and
 the
 corresponding partitions's data are larger, no matter how many partitions?


 Executor ID Address Task Time   Shuffle Read Size
 /
 Records
 1   slave129.vsvs.com:56691 16 s1   99.5 MB / 18865432
 *10 slave317.vsvs.com:59281 0 ms0   413.5 MB / 311001318*
 100 slave290.vsvs.com:60241 19 s1   110.8 MB / 27075926
 101 slave323.vsvs.com:36246 14 s1   126.1 MB / 25052808

   Task time and records of Executor 10 seems strange, and the cpus on the
 node are all 100% busy.

   Anyone meets the same problem,  Thanks in advance for any answer!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Best Regards

 Jeff Zhang



Always two tasks slower than others, and then job fails

2015-08-13 Thread randylu
  It is strange that there are always two tasks slower than others, and the
corresponding partitions's data are larger, no matter how many partitions?


Executor ID Address Task Time   Shuffle Read Size /
Records
1   slave129.vsvs.com:56691 16 s1   99.5 MB / 18865432
*10 slave317.vsvs.com:59281 0 ms0   413.5 MB / 311001318*
100 slave290.vsvs.com:60241 19 s1   110.8 MB / 27075926
101 slave323.vsvs.com:36246 14 s1   126.1 MB / 25052808

  Task time and records of Executor 10 seems strange, and the cpus on the
node are all 100% busy.

  Anyone meets the same problem,  Thanks in advance for any answer!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org