Hi Akhil - I tried your suggestions and tried varying my partition sizes.
Reducing the number of partitions led to memory errors (presumably - I saw
IOExceptions much sooner).

With the settings you provided the program ran for longer but ultimately
crashes in the same way. I would like to understand what is going on
internally leading to this.

Could this be related to garbage collection?
On Oct 10, 2014 3:19 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:

> You could be hitting this issue
> <https://issues.apache.org/jira/browse/SPARK-3633> (or similar). You can
> try the following workarounds:
>
> sc.set("spark.core.connection.ack.wait.timeout","600")
> sc.set("spark.akka.frameSize","50")
> Also reduce the number of partitions, you could be hitting the kernel's
> ulimit. I faced this issue and it was gone when i dropped the partitions
> from 1600 to 200.
>
> Thanks
> Best Regards
>
> On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin <ilgan...@gmail.com> wrote:
>
>> Hi all – I could use some help figuring out a couple of exceptions I’ve
>> been getting regularly.
>>
>> I have been running on a fairly large dataset (150 gigs). With smaller
>> datasets I don't have any issues.
>>
>> My sequence of operations is as follows – unless otherwise specified, I
>> am not caching:
>>
>> Map a 30 million row x 70 col string table to approx 30 mil x  5 string
>> (For read as textFile I am using 1500 partitions)
>>
>> From that, map to ((a,b), score) and reduceByKey, numPartitions = 180
>>
>> Then, extract distinct values for A and distinct values for B. (I cache
>> the output of distinct), , numPartitions = 180
>>
>> Zip with index for A and for B (to remap strings to int)
>>
>> Join remapped ids with original table
>>
>> This is then fed into MLLIBs ALS algorithm.
>>
>> I am running with:
>>
>> Spark version 1.02 with CDH5.1
>>
>> numExecutors = 8, numCores = 14
>>
>> Memory = 12g
>>
>> MemoryFration = 0.7
>>
>> KryoSerialization
>>
>> My issue is that the code runs fine for a while but then will
>> non-deterministically crash with either file IOExceptions or the following
>> obscure error:
>>
>> 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
>> java.io.IOException: Filesystem closed [duplicate 10]
>>
>> 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
>> java.io.FileNotFoundException
>>
>> java.io.FileNotFoundException:
>> /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
>> (No such file or directory)
>>
>> Looking through the logs, I see the IOException in other places but it
>> appears to be non-catastrophic. The FileNotFoundException, however, is. I
>> have found the following stack overflow that at least seems to address the
>> IOException:
>>
>>
>> http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed
>>
>> But I have not found anything useful at all with regards to the app cache
>> error.
>>
>> Any help would be much appreciated.
>>
>
>

Reply via email to