Re: BlockManager issues
Actually I met similar issue when doing groupByKey and then count if the shuffle size is big e.g. 1tb. Thanks. Zhan Zhang Sent from my iPhone On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote: Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is fetch failures as opposed to too many open files. Not sure if they are related. Please note that the workload runs fine with head set to 066765d. In case you want to reproduce the problem: I'm running slightly modified ScalaPageRank (with KryoSerializer and persistence level memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. Thanks, Nishkam On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BlockManager issues
Hey all. We had also the same problem described by Nishkam almost in the same big data setting. We fixed the fetch failure by increasing the timeout for acks in the driver: set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout for acks between nodes Cheers, Christoph 2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com: Actually I met similar issue when doing groupByKey and then count if the shuffle size is big e.g. 1tb. Thanks. Zhan Zhang Sent from my iPhone On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote: Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is fetch failures as opposed to too many open files. Not sure if they are related. Please note that the workload runs fine with head set to 066765d. In case you want to reproduce the problem: I'm running slightly modified ScalaPageRank (with KryoSerializer and persistence level memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. Thanks, Nishkam On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BlockManager issues
I've run into this with large shuffles - I assumed that there was contention between the shuffle output files and the JVM for memory. Whenever we start getting these fetch failures, it corresponds with high load on the machines the blocks are being fetched from, and in some cases complete unresponsiveness (no ssh etc). Setting the timeout higher, or the JVM heap lower (as a percentage of total machine memory) seemed to help.. On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade christoph.saw...@googlemail.com wrote: Hey all. We had also the same problem described by Nishkam almost in the same big data setting. We fixed the fetch failure by increasing the timeout for acks in the driver: set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout for acks between nodes Cheers, Christoph 2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com: Actually I met similar issue when doing groupByKey and then count if the shuffle size is big e.g. 1tb. Thanks. Zhan Zhang Sent from my iPhone On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote: Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is fetch failures as opposed to too many open files. Not sure if they are related. Please note that the workload runs fine with head set to 066765d. In case you want to reproduce the problem: I'm running slightly modified ScalaPageRank (with KryoSerializer and persistence level memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. Thanks, Nishkam On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
BlockManager issues
Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam
Re: BlockManager issues
It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam
Re: BlockManager issues
Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BlockManager issues
Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BlockManager issues
Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is fetch failures as opposed to too many open files. Not sure if they are related. Please note that the workload runs fine with head set to 066765d. In case you want to reproduce the problem: I'm running slightly modified ScalaPageRank (with KryoSerializer and persistence level memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. Thanks, Nishkam On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam