Looks like an OOM issue?  Have you tried persisting your RDDs to allow
disk writes?

I've seen a lot of similar crashes in a Spark app that reads from HDFS
and does joins.  I.e. I've seen "java.io.IOException: Filesystem
closed," "Executor lost," "FetchFailed," etc etc with
non-deterministic crashes.  I've tried persisting RDDs, tuning other
params, and verifying that the Executor JVMs don't come close to their
max allocated memory during operation.

Looking through user@ tonight, there are a ton of email threads with
similar crashes and no answers.  It looks like a lot of people are
struggling with OOMs.

Could one of the Spark committers please comment on this thread, or
one of the other unanswered threads with similar crashes?  Is this
simply how Spark behaves if Executors OOM?  What can the user do other
than increase memory or reduce RDD size?  (And how can one deduce how
much of either is needed?)

One general workaround for OOMs could be to programmatically break the
job input (i.e. from HDFS, input from #parallelize() ) into chunks,
and only create/process RDDs related to one chunk at a time.  However,
this approach has the limitations of Spark Streaming and no formal
library support.  What might be nice is that if tasks fail, Spark
could try to re-partition in order to avoid OOMs.



On Fri, Oct 3, 2014 at 2:55 AM, jamborta <jambo...@gmail.com> wrote:
> I have two nodes with 96G ram 16 cores, my setup is as follows:
>
>     conf = (SparkConf()
>             .setMaster("yarn-cluster")
>             .set("spark.executor.memory", "30G")
>             .set("spark.cores.max", 32)
>             .set("spark.executor.instances", 2)
>             .set("spark.executor.cores", 8)
>             .set("spark.akka.timeout", 10000)
>             .set("spark.akka.askTimeout", 100)
>             .set("spark.akka.frameSize", 500)
>             .set("spark.cleaner.ttl", 86400)
>             .set("spark.tast.maxFailures", 16)
>             .set("spark.worker.timeout", 150)
>
> thanks a lot,
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15674.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to