Re: IOException and appcache FileNotFoundException in Spark 1.02
Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Akhil - I tried your suggestions and tried varying my partition sizes. Reducing the number of partitions led to memory errors (presumably - I saw IOExceptions much sooner). With the settings you provided the program ran for longer but ultimately crashes in the same way. I would like to understand what is going on internally leading to this. Could this be related to garbage collection? On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can try the following workarounds: sc.set(spark.core.connection.ack.wait.timeout,600) sc.set(spark.akka.frameSize,50) Also reduce the number of partitions, you could be hitting the kernel's ulimit. I faced this issue and it was gone when i dropped the partitions from 1600 to 200. Thanks Best Regards On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated.
Re: IOException and appcache FileNotFoundException in Spark 1.02
You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can try the following workarounds: sc.set(spark.core.connection.ack.wait.timeout,600) sc.set(spark.akka.frameSize,50) Also reduce the number of partitions, you could be hitting the kernel's ulimit. I faced this issue and it was gone when i dropped the partitions from 1600 to 200. Thanks Best Regards On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated.
Re: IOException and appcache FileNotFoundException in Spark 1.02
Thank you - I will try this. If I drop the partition count am I not more likely to hit memory issues? Especially if the dataset is rather large? On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can try the following workarounds: sc.set(spark.core.connection.ack.wait.timeout,600) sc.set(spark.akka.frameSize,50) Also reduce the number of partitions, you could be hitting the kernel's ulimit. I faced this issue and it was gone when i dropped the partitions from 1600 to 200. Thanks Best Regards On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated.
Re: IOException and appcache FileNotFoundException in Spark 1.02
Hi Akhil - I tried your suggestions and tried varying my partition sizes. Reducing the number of partitions led to memory errors (presumably - I saw IOExceptions much sooner). With the settings you provided the program ran for longer but ultimately crashes in the same way. I would like to understand what is going on internally leading to this. Could this be related to garbage collection? On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can try the following workarounds: sc.set(spark.core.connection.ack.wait.timeout,600) sc.set(spark.akka.frameSize,50) Also reduce the number of partitions, you could be hitting the kernel's ulimit. I faced this issue and it was gone when i dropped the partitions from 1600 to 200. Thanks Best Regards On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated.
IOException and appcache FileNotFoundException in Spark 1.02
On Oct 9, 2014 10:18 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated. -Ilya Ganelin
IOException and appcache FileNotFoundException in Spark 1.02
Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated.