Re: Shuffle Problems in 1.2.0

2015-01-12 Thread Sven Krasser
I've filed a ticket for this issue here: https://issues.apache.org/jira/browse/SPARK-5209. (This reproduces the problem on a smaller cluster size.) -Sven On Wed, Jan 7, 2015 at 11:13 AM, Sven Krasser wrote: > Could you try it on AWS using EMR? That'd give you an exact replica of the > environmen

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Sven Krasser
Could you try it on AWS using EMR? That'd give you an exact replica of the environment that causes the error. Sent from my iPhone > On Jan 7, 2015, at 10:53 AM, Davies Liu wrote: > > Hey Sven, > > I tried with all of your configurations, 2 node with 2 executors each, > but in standalone mode

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Davies Liu
Hey Sven, I tried with all of your configurations, 2 node with 2 executors each, but in standalone mode, it worked fine. Could you try to narrow down the possible change of configurations? Davies On Tue, Jan 6, 2015 at 8:03 PM, Sven Krasser wrote: > Hey Davies, > > Here are some more details o

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
Hey Davies, Here are some more details on a configuration that causes this error for me. Launch an AWS Spark EMR cluster as follows: *aws emr create-cluster --region us-west-1 --no-auto-terminate \ --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ --bootstrap-actions Path=

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also, no cache needed anymore) Davies On Tue, Jan 6, 20

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
The issue has been sensitive to the number of executors and input data size. I'm using 2 executors with 4 cores each, 25GB of memory, 3800MB of memory overhead for YARN. This will fit onto Amazon r3 instance types. -Sven On Tue, Jan 6, 2015 at 12:46 AM, Davies Liu wrote: > I had ran your scripts

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser wrote: > Thanks for the input! I've managed to come up with a repro of the error with > test data only (and without any

Re: Shuffle Problems in 1.2.0

2015-01-05 Thread Sven Krasser
Thanks for the input! I've managed to come up with a repro of the error with test data only (and without any of the custom code in the original script), please see here: https://gist.github.com/skrasser/4bd7b41550988c8f6071#file-gistfile1-md The Gist contains a data generator and the script reprod

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
It doesn’t seem like there’s a whole lot of clues to go on here without seeing the job code.  The original "org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad)” error suggests that maybe there’s an issue with PySpark’s serialization / tracking of types, but it’s har

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM is around ~900GB, split across 32 executors). I've managed to run it over a vastly smaller data set without issues. Cur

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey all, Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during shuffle. I've tried reverting from the sort-based shuffle back to the hash one, and that fails as well. Does anyone see similar problems or has an idea on where to look next? For the sort-based shuffle I get a