Re: Understanding Spark/MLlib failures

2015-04-24 Thread Hoai-Thu Vuong
Hi Andrew, according to you we should balance the time when gc run and the
batch time, which rdd is processed?

On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh r...@databricks.com wrote:

 Hi Andrew,

 The .principalComponents feature of RowMatrix is currently constrained to
 tall and skinny matrices. Your matrix is barely above the skinny
 requirement (10k columns), though the number of rows is fine.

 What are you looking to do with the principal components? If unnormalized
 PCA is OK for your application, you can instead run RowMatrix.computeSVD,
 and use the 'V' matrix, which can be used the same way as the principal
 components. The computeSVD method can handle square matrices, so it should
 be able to handle your matrix.

 Reza
 On Thu, Apr 23, 2015 at 4:11 PM, aleverentz andylevere...@fico.com
 wrote:

 [My apologies if this is a re-post.  I wasn't subscribed the first time I
 sent this message, and I'm hoping this second message will get through.]

 I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.
 In a
 fit of blind optimism, I decided to try running MLlib’s Principal
 Components
 Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
 rows.

 The Spark job has been running for about 5 hours on a small cluster, and
 it
 has been stuck on a particular job (treeAggregate at
 RowMatrix.scala:119)
 for most of that time.  The treeAggregate job is now on retry 5, and
 after
 each failure it seems that the next retry uses a smaller number of tasks.
 (Initially, there were around 80 tasks; later it was down to 50, then 42;
 now it’s down to 16.)  The web UI shows the following error under failed
 stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing
 an
 output location for shuffle 1.

 This raises a few questions:

 1. What does missing an output location for shuffle 1 mean?  I’m
 guessing
 this cryptic error message is indicative of some more fundamental problem
 (out of memory? out of disk space?), but I’m not sure how to diagnose it.

 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
 that
 the algorithm is actually making progress?  Or is the scheduler just
 performing some kind of repartitioning and starting over from scratch?
 (Also, If the algorithm is in fact making progress, should I expect it to
 finish eventually?  Or do repeated failures generally indicate that the
 cluster is too small to perform the given task?)

 3. Is it reasonable to expect that I could get PCA to run on this dataset
 using the same cluster simply by changing some configuration parameters?
 Or
 is a larger cluster with significantly more resources per node the only
 way
 around this problem?

 4. In general, are there any tips for diagnosing performance issues like
 the
 one above?  I've spent some time trying to get a few different algorithms
 to
 scale to larger and larger datasets, and whenever I run into a failure,
 I'd
 like to be able to identify the bottleneck that is preventing further
 scaling.  Any general advice for doing that kind of detective work would
 be
 much appreciated.

 Thanks,

 ~ Andrew






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Reza,

I’m trying to identify groups of similar variables, with the ultimate goal of 
reducing the dimensionality of the dataset.  I believe SVD would be sufficient 
for this, although I also tried running RowMatrix.computeSVD and observed the 
same behavior:  frequent task failures, with cryptic error messages along the 
lines of “Missing an output location for shuffle.”  Having some way to diagnose 
what’s really going here on would be helpful.

~ Andrew


From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, April 23, 2015 4:58 PM
To: Andrew Leverentz
Cc: user
Subject: Re: Understanding Spark/MLlib failures

Hi Andrew,

The .principalComponents feature of RowMatrix is currently constrained to tall 
and skinny matrices. Your matrix is barely above the skinny requirement (10k 
columns), though the number of rows is fine.

What are you looking to do with the principal components? If unnormalized PCA 
is OK for your application, you can instead run RowMatrix.computeSVD, and use 
the 'V' matrix, which can be used the same way as the principal components. The 
computeSVD method can handle square matrices, so it should be able to handle 
your matrix.

Reza

On Thu, Apr 23, 2015 at 4:11 PM, aleverentz 
andylevere...@fico.commailto:andylevere...@fico.com wrote:
[My apologies if this is a re-post.  I wasn't subscribed the first time I
sent this message, and I'm hoping this second message will get through.]

I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In a
fit of blind optimism, I decided to try running MLlib’s Principal Components
Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
rows.

The Spark job has been running for about 5 hours on a small cluster, and it
has been stuck on a particular job (treeAggregate at RowMatrix.scala:119)
for most of that time.  The treeAggregate job is now on retry 5, and after
each failure it seems that the next retry uses a smaller number of tasks.
(Initially, there were around 80 tasks; later it was down to 50, then 42;
now it’s down to 16.)  The web UI shows the following error under failed
stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 1.

This raises a few questions:

1. What does missing an output location for shuffle 1 mean?  I’m guessing
this cryptic error message is indicative of some more fundamental problem
(out of memory? out of disk space?), but I’m not sure how to diagnose it.

2. Why do subsequent retries use fewer and fewer tasks?  Does this mean that
the algorithm is actually making progress?  Or is the scheduler just
performing some kind of repartitioning and starting over from scratch?
(Also, If the algorithm is in fact making progress, should I expect it to
finish eventually?  Or do repeated failures generally indicate that the
cluster is too small to perform the given task?)

3. Is it reasonable to expect that I could get PCA to run on this dataset
using the same cluster simply by changing some configuration parameters?  Or
is a larger cluster with significantly more resources per node the only way
around this problem?

4. In general, are there any tips for diagnosing performance issues like the
one above?  I've spent some time trying to get a few different algorithms to
scale to larger and larger datasets, and whenever I run into a failure, I'd
like to be able to identify the bottleneck that is preventing further
scaling.  Any general advice for doing that kind of detective work would be
much appreciated.

Thanks,

~ Andrew






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.


RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Burak,

Thanks for this insight.  I’m curious to know, how did you reach the conclusion 
that GC pauses were to blame?  I’d like to gather some more diagnostic 
information to determine whether or not I’m facing a similar scenario.

~ Andrew


From: Burak Yavuz [mailto:brk...@gmail.com]
Sent: Thursday, April 23, 2015 4:46 PM
To: Andrew Leverentz
Cc: user@spark.apache.org
Subject: Re: Understanding Spark/MLlib failures

Hi Andrew,

I observed similar behavior under high GC pressure, when running ALS. What 
happened to me was that, there would be very long Full GC pauses (over 600 
seconds at times). These would prevent the executors from sending heartbeats to 
the driver. Then the driver would think that the executor died, so it would 
kill it. The scheduler would look at the outputs and say:
`org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 1` or `Fetch Failed`, then reschedule the job at a 
different executor.

Then these executors would get even more overloaded, causing them to GC more 
often, and new jobs would be launched with even smaller tasks. Because these 
executors were being killed by the driver, new jobs with the same name (and 
less tasks) would be launched. However, it usually led to a spiral of death, 
where executors were constantly being killed, and the stage wasn't being 
completed, but restarted with different numbers of tasks.

Some configuration parameters that helped me through this process were:

spark.executor.memory  // decrease the executor memory so that Full GC's take 
less time, however are more frequent
spark.executor.heartbeatInterval // This I set at 60 for 600 seconds (10 
minute GC!!)
spark.core.connection.ack.wait.timeout // another timeout to set

Hope these parameters help you. I haven't directly answered your questions, but 
there are bits and pieces in there that are hopefully helpful.

Best,
Burak


On Thu, Apr 23, 2015 at 4:11 PM, aleverentz 
andylevere...@fico.commailto:andylevere...@fico.com wrote:
[My apologies if this is a re-post.  I wasn't subscribed the first time I
sent this message, and I'm hoping this second message will get through.]

I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In a
fit of blind optimism, I decided to try running MLlib’s Principal Components
Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
rows.

The Spark job has been running for about 5 hours on a small cluster, and it
has been stuck on a particular job (treeAggregate at RowMatrix.scala:119)
for most of that time.  The treeAggregate job is now on retry 5, and after
each failure it seems that the next retry uses a smaller number of tasks.
(Initially, there were around 80 tasks; later it was down to 50, then 42;
now it’s down to 16.)  The web UI shows the following error under failed
stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 1.

This raises a few questions:

1. What does missing an output location for shuffle 1 mean?  I’m guessing
this cryptic error message is indicative of some more fundamental problem
(out of memory? out of disk space?), but I’m not sure how to diagnose it.

2. Why do subsequent retries use fewer and fewer tasks?  Does this mean that
the algorithm is actually making progress?  Or is the scheduler just
performing some kind of repartitioning and starting over from scratch?
(Also, If the algorithm is in fact making progress, should I expect it to
finish eventually?  Or do repeated failures generally indicate that the
cluster is too small to perform the given task?)

3. Is it reasonable to expect that I could get PCA to run on this dataset
using the same cluster simply by changing some configuration parameters?  Or
is a larger cluster with significantly more resources per node the only way
around this problem?

4. In general, are there any tips for diagnosing performance issues like the
one above?  I've spent some time trying to get a few different algorithms to
scale to larger and larger datasets, and whenever I run into a failure, I'd
like to be able to identify the bottleneck that is preventing further
scaling.  Any general advice for doing that kind of detective work would be
much appreciated.

Thanks,

~ Andrew






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.


Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew,

I observed similar behavior under high GC pressure, when running ALS. What
happened to me was that, there would be very long Full GC pauses (over 600
seconds at times). These would prevent the executors from sending
heartbeats to the driver. Then the driver would think that the executor
died, so it would kill it. The scheduler would look at the outputs and say:
`org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1` or `Fetch Failed`, then reschedule the job at a
different executor.

Then these executors would get even more overloaded, causing them to GC
more often, and new jobs would be launched with even smaller tasks. Because
these executors were being killed by the driver, new jobs with the same
name (and less tasks) would be launched. However, it usually led to a
spiral of death, where executors were constantly being killed, and the
stage wasn't being completed, but restarted with different numbers of tasks.

Some configuration parameters that helped me through this process were:

spark.executor.memory  // decrease the executor memory so that Full GC's
take less time, however are more frequent
spark.executor.heartbeatInterval // This I set at 60 for 600 seconds
(10 minute GC!!)
spark.core.connection.ack.wait.timeout // another timeout to set

Hope these parameters help you. I haven't directly answered your questions,
but there are bits and pieces in there that are hopefully helpful.

Best,
Burak


On Thu, Apr 23, 2015 at 4:11 PM, aleverentz andylevere...@fico.com wrote:

 [My apologies if this is a re-post.  I wasn't subscribed the first time I
 sent this message, and I'm hoping this second message will get through.]

 I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In
 a
 fit of blind optimism, I decided to try running MLlib’s Principal
 Components
 Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
 rows.

 The Spark job has been running for about 5 hours on a small cluster, and it
 has been stuck on a particular job (treeAggregate at RowMatrix.scala:119)
 for most of that time.  The treeAggregate job is now on retry 5, and
 after
 each failure it seems that the next retry uses a smaller number of tasks.
 (Initially, there were around 80 tasks; later it was down to 50, then 42;
 now it’s down to 16.)  The web UI shows the following error under failed
 stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing
 an
 output location for shuffle 1.

 This raises a few questions:

 1. What does missing an output location for shuffle 1 mean?  I’m guessing
 this cryptic error message is indicative of some more fundamental problem
 (out of memory? out of disk space?), but I’m not sure how to diagnose it.

 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
 that
 the algorithm is actually making progress?  Or is the scheduler just
 performing some kind of repartitioning and starting over from scratch?
 (Also, If the algorithm is in fact making progress, should I expect it to
 finish eventually?  Or do repeated failures generally indicate that the
 cluster is too small to perform the given task?)

 3. Is it reasonable to expect that I could get PCA to run on this dataset
 using the same cluster simply by changing some configuration parameters?
 Or
 is a larger cluster with significantly more resources per node the only way
 around this problem?

 4. In general, are there any tips for diagnosing performance issues like
 the
 one above?  I've spent some time trying to get a few different algorithms
 to
 scale to larger and larger datasets, and whenever I run into a failure, I'd
 like to be able to identify the bottleneck that is preventing further
 scaling.  Any general advice for doing that kind of detective work would be
 much appreciated.

 Thanks,

 ~ Andrew






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Understanding Spark/MLlib failures

2015-04-23 Thread Reza Zadeh
Hi Andrew,

The .principalComponents feature of RowMatrix is currently constrained to
tall and skinny matrices. Your matrix is barely above the skinny
requirement (10k columns), though the number of rows is fine.

What are you looking to do with the principal components? If unnormalized
PCA is OK for your application, you can instead run RowMatrix.computeSVD,
and use the 'V' matrix, which can be used the same way as the principal
components. The computeSVD method can handle square matrices, so it should
be able to handle your matrix.

Reza

On Thu, Apr 23, 2015 at 4:11 PM, aleverentz andylevere...@fico.com wrote:

 [My apologies if this is a re-post.  I wasn't subscribed the first time I
 sent this message, and I'm hoping this second message will get through.]

 I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.  In
 a
 fit of blind optimism, I decided to try running MLlib’s Principal
 Components
 Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
 rows.

 The Spark job has been running for about 5 hours on a small cluster, and it
 has been stuck on a particular job (treeAggregate at RowMatrix.scala:119)
 for most of that time.  The treeAggregate job is now on retry 5, and
 after
 each failure it seems that the next retry uses a smaller number of tasks.
 (Initially, there were around 80 tasks; later it was down to 50, then 42;
 now it’s down to 16.)  The web UI shows the following error under failed
 stages:  org.apache.spark.shuffle.MetadataFetchFailedException: Missing
 an
 output location for shuffle 1.

 This raises a few questions:

 1. What does missing an output location for shuffle 1 mean?  I’m guessing
 this cryptic error message is indicative of some more fundamental problem
 (out of memory? out of disk space?), but I’m not sure how to diagnose it.

 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
 that
 the algorithm is actually making progress?  Or is the scheduler just
 performing some kind of repartitioning and starting over from scratch?
 (Also, If the algorithm is in fact making progress, should I expect it to
 finish eventually?  Or do repeated failures generally indicate that the
 cluster is too small to perform the given task?)

 3. Is it reasonable to expect that I could get PCA to run on this dataset
 using the same cluster simply by changing some configuration parameters?
 Or
 is a larger cluster with significantly more resources per node the only way
 around this problem?

 4. In general, are there any tips for diagnosing performance issues like
 the
 one above?  I've spent some time trying to get a few different algorithms
 to
 scale to larger and larger datasets, and whenever I run into a failure, I'd
 like to be able to identify the bottleneck that is preventing further
 scaling.  Any general advice for doing that kind of detective work would be
 much appreciated.

 Thanks,

 ~ Andrew






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org