Re: Partitions with zero records & variable task times

2015-09-09 Thread Akhil Das
This post here has a bit information http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/ Thanks Best Regards On Wed, Sep 9, 2015 at 6:44 AM, mark wrote: > As I understand things (maybe naively), my

Re: Partitions with zero records & variable task times

2015-09-09 Thread mark
The article is interesting but doesn't really help. It has only one sentence about data distribution in partitions. How can I diagnose skewed data distribution? How could evenly sized blocks in HDFS lead to skewed data anyway? On 9 Sep 2015 2:29 pm, "Akhil Das"

Re: Partitions with zero records & variable task times

2015-09-08 Thread Akhil Das
Try using a custom partitioner for the keys so that they will get evenly distributed across tasks Thanks Best Regards On Fri, Sep 4, 2015 at 7:19 PM, mark wrote: > I am trying to tune a Spark job and have noticed some strange behavior - > tasks in a stage vary in

Re: Partitions with zero records & variable task times

2015-09-08 Thread mark
As I understand things (maybe naively), my input data are stored in equal sized blocks in HDFS, and each block represents a partition within Spark when read from HDFS, therefore each block should hold roughly the same number of records. So something is missing in my understanding - what can

Partitions with zero records & variable task times

2015-09-04 Thread mark
I am trying to tune a Spark job and have noticed some strange behavior - tasks in a stage vary in execution time, ranging from 2 seconds to 20 seconds. I assume tasks should all run in roughly the same amount of time in a well tuned job. So I did some investigation - the fast tasks appear to have