Re: Understanding Spark/MLlib failures

Burak Yavuz Thu, 23 Apr 2015 16:49:12 -0700

Hi Andrew,

I observed similar behavior under high GC pressure, when running ALS. What
happened to me was that, there would be very long Full GC pauses (over 600
seconds at times). These would prevent the executors from sending
heartbeats to the driver. Then the driver would think that the executor
died, so it would kill it. The scheduler would look at the outputs and say:
`org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1` or `Fetch Failed`, then reschedule the job at a
different executor.


Then these executors would get even more overloaded, causing them to GC
more often, and new jobs would be launched with even smaller tasks. Because
these executors were being killed by the driver, new jobs with the same
name (and less tasks) would be launched. However, it usually led to a
spiral of death, where executors were constantly being killed, and the
stage wasn't being completed, but restarted with different numbers of tasks.

Some configuration parameters that helped me through this process were:

spark.executor.memory  // decrease the executor memory so that Full GC's
take less time, however are more frequent
spark.executor.heartbeatInterval // This I set at 600000 for 600 seconds
(10 minute GC!!)
spark.core.connection.ack.wait.timeout // another timeout to set

Hope these parameters help you. I haven't directly answered your questions,
but there are bits and pieces in there that are hopefully helpful.

Best,
Burak


On Thu, Apr 23, 2015 at 4:11 PM, aleverentz <andylevere...@fico.com> wrote:

> [My apologies if this is a re-post.  I wasn't subscribed the first time I
> sent this message, and I'm hoping this second message will get through.]
>
> I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In
> a
> fit of blind optimism, I decided to try running MLlib’s Principal
> Components
> Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
> rows.
>
> The Spark job has been running for about 5 hours on a small cluster, and it
> has been stuck on a particular job ("treeAggregate at RowMatrix.scala:119")
> for most of that time.  The treeAggregate job is now on "retry 5", and
> after
> each failure it seems that the next retry uses a smaller number of tasks.
> (Initially, there were around 80 tasks; later it was down to 50, then 42;
> now it’s down to 16.)  The web UI shows the following error under "failed
> stages":  "org.apache.spark.shuffle.MetadataFetchFailedException: Missing
> an
> output location for shuffle 1".
>
> This raises a few questions:
>
> 1. What does "missing an output location for shuffle 1" mean?  I’m guessing
> this cryptic error message is indicative of some more fundamental problem
> (out of memory? out of disk space?), but I’m not sure how to diagnose it.
>
> 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
> that
> the algorithm is actually making progress?  Or is the scheduler just
> performing some kind of repartitioning and starting over from scratch?
> (Also, If the algorithm is in fact making progress, should I expect it to
> finish eventually?  Or do repeated failures generally indicate that the
> cluster is too small to perform the given task?)
>
> 3. Is it reasonable to expect that I could get PCA to run on this dataset
> using the same cluster simply by changing some configuration parameters?
> Or
> is a larger cluster with significantly more resources per node the only way
> around this problem?
>
> 4. In general, are there any tips for diagnosing performance issues like
> the
> one above?  I've spent some time trying to get a few different algorithms
> to
> scale to larger and larger datasets, and whenever I run into a failure, I'd
> like to be able to identify the bottleneck that is preventing further
> scaling.  Any general advice for doing that kind of detective work would be
> much appreciated.
>
> Thanks,
>
> ~ Andrew
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Understanding Spark/MLlib failures

Reply via email to