You could be hitting this issue
<https://issues.apache.org/jira/browse/SPARK-3633> (or similar). You can
try the following workarounds:

sc.set("spark.core.connection.ack.wait.timeout","600")
sc.set("spark.akka.frameSize","50")
Also reduce the number of partitions, you could be hitting the kernel's
ulimit. I faced this issue and it was gone when i dropped the partitions
from 1600 to 200.

Thanks
Best Regards

On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin <ilgan...@gmail.com> wrote:

> Hi all – I could use some help figuring out a couple of exceptions I’ve
> been getting regularly.
>
> I have been running on a fairly large dataset (150 gigs). With smaller
> datasets I don't have any issues.
>
> My sequence of operations is as follows – unless otherwise specified, I am
> not caching:
>
> Map a 30 million row x 70 col string table to approx 30 mil x  5 string
> (For read as textFile I am using 1500 partitions)
>
> From that, map to ((a,b), score) and reduceByKey, numPartitions = 180
>
> Then, extract distinct values for A and distinct values for B. (I cache
> the output of distinct), , numPartitions = 180
>
> Zip with index for A and for B (to remap strings to int)
>
> Join remapped ids with original table
>
> This is then fed into MLLIBs ALS algorithm.
>
> I am running with:
>
> Spark version 1.02 with CDH5.1
>
> numExecutors = 8, numCores = 14
>
> Memory = 12g
>
> MemoryFration = 0.7
>
> KryoSerialization
>
> My issue is that the code runs fine for a while but then will
> non-deterministically crash with either file IOExceptions or the following
> obscure error:
>
> 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
> java.io.IOException: Filesystem closed [duplicate 10]
>
> 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
> java.io.FileNotFoundException
>
> java.io.FileNotFoundException:
> /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
> (No such file or directory)
>
> Looking through the logs, I see the IOException in other places but it
> appears to be non-catastrophic. The FileNotFoundException, however, is. I
> have found the following stack overflow that at least seems to address the
> IOException:
>
>
> http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed
>
> But I have not found anything useful at all with regards to the app cache
> error.
>
> Any help would be much appreciated.
>

Reply via email to