Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Sven Krasser
Could you try it on AWS using EMR? That'd give you an exact replica of the environment that causes the error. Sent from my iPhone On Jan 7, 2015, at 10:53 AM, Davies Liu dav...@databricks.com wrote: Hey Sven, I tried with all of your configurations, 2 node with 2 executors each, but in

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Davies Liu
Hey Sven, I tried with all of your configurations, 2 node with 2 executors each, but in standalone mode, it worked fine. Could you try to narrow down the possible change of configurations? Davies On Tue, Jan 6, 2015 at 8:03 PM, Sven Krasser kras...@gmail.com wrote: Hey Davies, Here are some

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not reproduce your failure. Should I test it with big memory node? On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote: Thanks for the input! I've managed to come up with a repro of the error with test data only

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also, no cache needed anymore) Davies On Tue, Jan 6,

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
Hey Davies, Here are some more details on a configuration that causes this error for me. Launch an AWS Spark EMR cluster as follows: *aws emr create-cluster --region us-west-1 --no-auto-terminate \ --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ --bootstrap-actions

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
The issue has been sensitive to the number of executors and input data size. I'm using 2 executors with 4 cores each, 25GB of memory, 3800MB of memory overhead for YARN. This will fit onto Amazon r3 instance types. -Sven On Tue, Jan 6, 2015 at 12:46 AM, Davies Liu dav...@databricks.com wrote: I

Re: Shuffle Problems in 1.2.0

2015-01-05 Thread Sven Krasser
Thanks for the input! I've managed to come up with a repro of the error with test data only (and without any of the custom code in the original script), please see here: https://gist.github.com/skrasser/4bd7b41550988c8f6071#file-gistfile1-md The Gist contains a data generator and the script

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
It doesn’t seem like there’s a whole lot of clues to go on here without seeing the job code.  The original org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad)” error suggests that maybe there’s an issue with PySpark’s serialization / tracking of types, but it’s

Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey all, Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during shuffle. I've tried reverting from the sort-based shuffle back to the hash one, and that fails as well. Does anyone see similar problems or has an idea on where to look next? For the sort-based shuffle I get a

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM is around ~900GB, split across 32 executors). I've managed to run it over a vastly smaller data set without issues.