Re: lost executor due to large shuffle spill memory
Shuffle will always spill the local dataset to disk. Changing memory settings does nothing to alter this, so you need to set spark.local.dir appropriately to a fast disk. > On Apr 6, 2016, at 12:32 PM, Lishu Liu wrote: > > Thanks Michael. I use 5 m3.2xlarge nodes. Should I increase > spark.storage.memoryFraction? Also I'm thinking maybe I should repartition > all_pairs so that each partition will be small enough to be handled. > > On Tue, Apr 5, 2016 at 8:03 PM, Michael Slavitch <mailto:slavi...@gmail.com>> wrote: > Do you have enough disk space for the spill? It seems it has lots of memory > reserved but not enough for the spill. You will need a disk that can handle > the entire data partition for each host. Compression of the spilled data > saves about 50% in most if not all cases. > > Given the large data set I would consider a 1TB SATA flash drive, formatted > as EXT4 or XFS and give it exclusive access as spark.local.dir. It will > slow things down but it won’t stop. There are alternatives if you want to > discuss offline. > > > > On Apr 5, 2016, at 6:37 PM, l > <mailto:lishu...@gmail.com>> wrote: > > > > I have a task to remap the index to actual uuid in ALS prediction results. > > But it consistently fail due to lost executors. I noticed there's large > > shuffle spill memory but I don't know how to improve it. > > > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/24.png > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/24.png>> > > > > I've tried to reduce the number of executors while assigning each to have > > bigger memory. > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/31.png > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/31.png>> > > > > But it still doesn't seem big enough. I don't know what to do. > > > > Below is my code: > > user = load_user() > > product = load_product() > > user.cache() > > product.cache() > > model = load_model(model_path) > > all_pairs = user.map(lambda x: x[1]).cartesian(product.map(lambda x: x[1])) > > all_prediction = model.predictAll(all_pairs) > > user_reverse = user.map(lambda r: (r[1], r[0])) > > product_reverse = product.map(lambda r: (r[1], r[0])) > > user_reversed = all_prediction.map(lambda u: (u[0], (u[1], > > u[2]))).join(user_reverse).map(lambda r: (r[1][0][0], (r[1][1], > > r[1][0][1]))) > > both_reversed = user_reversed.join(product_reverse).map(lambda r: > > (r[1][0][0], r[1][1], r[1][0][1])) > > both_reversed.map(lambda x: '{}|{}|{}'.format(x[0], x[1], > > x[2])).saveAsTextFile(recommendation_path) > > > > Both user and products are (uuid, index) tuples. > > > > > > > > -- > > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/lost-executor-due-to-large-shuffle-spill-memory-tp26683.html > > > > <http://apache-spark-user-list.1001560.n3.nabble.com/lost-executor-due-to-large-shuffle-spill-memory-tp26683.html> > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > <mailto:user-unsubscr...@spark.apache.org> > > For additional commands, e-mail: user-h...@spark.apache.org > > <mailto:user-h...@spark.apache.org> > > > >
Re: lost executor due to large shuffle spill memory
Thanks Michael. I use 5 m3.2xlarge nodes. Should I increase spark.storage.memoryFraction? Also I'm thinking maybe I should repartition all_pairs so that each partition will be small enough to be handled. On Tue, Apr 5, 2016 at 8:03 PM, Michael Slavitch wrote: > Do you have enough disk space for the spill? It seems it has lots of > memory reserved but not enough for the spill. You will need a disk that can > handle the entire data partition for each host. Compression of the spilled > data saves about 50% in most if not all cases. > > Given the large data set I would consider a 1TB SATA flash drive, > formatted as EXT4 or XFS and give it exclusive access as spark.local.dir. > It will slow things down but it won’t stop. There are alternatives if you > want to discuss offline. > > > > On Apr 5, 2016, at 6:37 PM, l wrote: > > > > I have a task to remap the index to actual uuid in ALS prediction > results. > > But it consistently fail due to lost executors. I noticed there's large > > shuffle spill memory but I don't know how to improve it. > > > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/24.png> > > > > I've tried to reduce the number of executors while assigning each to have > > bigger memory. > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/31.png> > > > > But it still doesn't seem big enough. I don't know what to do. > > > > Below is my code: > > user = load_user() > > product = load_product() > > user.cache() > > product.cache() > > model = load_model(model_path) > > all_pairs = user.map(lambda x: x[1]).cartesian(product.map(lambda x: > x[1])) > > all_prediction = model.predictAll(all_pairs) > > user_reverse = user.map(lambda r: (r[1], r[0])) > > product_reverse = product.map(lambda r: (r[1], r[0])) > > user_reversed = all_prediction.map(lambda u: (u[0], (u[1], > > u[2]))).join(user_reverse).map(lambda r: (r[1][0][0], (r[1][1], > > r[1][0][1]))) > > both_reversed = user_reversed.join(product_reverse).map(lambda r: > > (r[1][0][0], r[1][1], r[1][0][1])) > > both_reversed.map(lambda x: '{}|{}|{}'.format(x[0], x[1], > > x[2])).saveAsTextFile(recommendation_path) > > > > Both user and products are (uuid, index) tuples. > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/lost-executor-due-to-large-shuffle-spill-memory-tp26683.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > >
Re: lost executor due to large shuffle spill memory
Do you have enough disk space for the spill? It seems it has lots of memory reserved but not enough for the spill. You will need a disk that can handle the entire data partition for each host. Compression of the spilled data saves about 50% in most if not all cases. Given the large data set I would consider a 1TB SATA flash drive, formatted as EXT4 or XFS and give it exclusive access as spark.local.dir. It will slow things down but it won’t stop. There are alternatives if you want to discuss offline. > On Apr 5, 2016, at 6:37 PM, l wrote: > > I have a task to remap the index to actual uuid in ALS prediction results. > But it consistently fail due to lost executors. I noticed there's large > shuffle spill memory but I don't know how to improve it. > > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/24.png> > > I've tried to reduce the number of executors while assigning each to have > bigger memory. > <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/31.png> > > But it still doesn't seem big enough. I don't know what to do. > > Below is my code: > user = load_user() > product = load_product() > user.cache() > product.cache() > model = load_model(model_path) > all_pairs = user.map(lambda x: x[1]).cartesian(product.map(lambda x: x[1])) > all_prediction = model.predictAll(all_pairs) > user_reverse = user.map(lambda r: (r[1], r[0])) > product_reverse = product.map(lambda r: (r[1], r[0])) > user_reversed = all_prediction.map(lambda u: (u[0], (u[1], > u[2]))).join(user_reverse).map(lambda r: (r[1][0][0], (r[1][1], > r[1][0][1]))) > both_reversed = user_reversed.join(product_reverse).map(lambda r: > (r[1][0][0], r[1][1], r[1][0][1])) > both_reversed.map(lambda x: '{}|{}|{}'.format(x[0], x[1], > x[2])).saveAsTextFile(recommendation_path) > > Both user and products are (uuid, index) tuples. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/lost-executor-due-to-large-shuffle-spill-memory-tp26683.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
lost executor due to large shuffle spill memory
I have a task to remap the index to actual uuid in ALS prediction results. But it consistently fail due to lost executors. I noticed there's large shuffle spill memory but I don't know how to improve it. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/24.png> I've tried to reduce the number of executors while assigning each to have bigger memory. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26683/31.png> But it still doesn't seem big enough. I don't know what to do. Below is my code: user = load_user() product = load_product() user.cache() product.cache() model = load_model(model_path) all_pairs = user.map(lambda x: x[1]).cartesian(product.map(lambda x: x[1])) all_prediction = model.predictAll(all_pairs) user_reverse = user.map(lambda r: (r[1], r[0])) product_reverse = product.map(lambda r: (r[1], r[0])) user_reversed = all_prediction.map(lambda u: (u[0], (u[1], u[2]))).join(user_reverse).map(lambda r: (r[1][0][0], (r[1][1], r[1][0][1]))) both_reversed = user_reversed.join(product_reverse).map(lambda r: (r[1][0][0], r[1][1], r[1][0][1])) both_reversed.map(lambda x: '{}|{}|{}'.format(x[0], x[1], x[2])).saveAsTextFile(recommendation_path) Both user and products are (uuid, index) tuples. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/lost-executor-due-to-large-shuffle-spill-memory-tp26683.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org