Hi Reza, I’m trying to identify groups of similar variables, with the ultimate goal of reducing the dimensionality of the dataset. I believe SVD would be sufficient for this, although I also tried running RowMatrix.computeSVD and observed the same behavior: frequent task failures, with cryptic error messages along the lines of “Missing an output location for shuffle.” Having some way to diagnose what’s really going here on would be helpful.
~ Andrew From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, April 23, 2015 4:58 PM To: Andrew Leverentz Cc: user Subject: Re: Understanding Spark/MLlib failures Hi Andrew, The .principalComponents feature of RowMatrix is currently constrained to tall and skinny matrices. Your matrix is barely above the skinny requirement (10k columns), though the number of rows is fine. What are you looking to do with the principal components? If unnormalized PCA is OK for your application, you can instead run RowMatrix.computeSVD, and use the 'V' matrix, which can be used the same way as the principal components. The computeSVD method can handle square matrices, so it should be able to handle your matrix. Reza On Thu, Apr 23, 2015 at 4:11 PM, aleverentz <andylevere...@fico.com<mailto:andylevere...@fico.com>> wrote: [My apologies if this is a re-post. I wasn't subscribed the first time I sent this message, and I'm hoping this second message will get through.] I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks. In a fit of blind optimism, I decided to try running MLlib’s Principal Components Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000 rows. The Spark job has been running for about 5 hours on a small cluster, and it has been stuck on a particular job ("treeAggregate at RowMatrix.scala:119") for most of that time. The treeAggregate job is now on "retry 5", and after each failure it seems that the next retry uses a smaller number of tasks. (Initially, there were around 80 tasks; later it was down to 50, then 42; now it’s down to 16.) The web UI shows the following error under "failed stages": "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1". This raises a few questions: 1. What does "missing an output location for shuffle 1" mean? I’m guessing this cryptic error message is indicative of some more fundamental problem (out of memory? out of disk space?), but I’m not sure how to diagnose it. 2. Why do subsequent retries use fewer and fewer tasks? Does this mean that the algorithm is actually making progress? Or is the scheduler just performing some kind of repartitioning and starting over from scratch? (Also, If the algorithm is in fact making progress, should I expect it to finish eventually? Or do repeated failures generally indicate that the cluster is too small to perform the given task?) 3. Is it reasonable to expect that I could get PCA to run on this dataset using the same cluster simply by changing some configuration parameters? Or is a larger cluster with significantly more resources per node the only way around this problem? 4. In general, are there any tips for diagnosing performance issues like the one above? I've spent some time trying to get a few different algorithms to scale to larger and larger datasets, and whenever I run into a failure, I'd like to be able to identify the bottleneck that is preventing further scaling. Any general advice for doing that kind of detective work would be much appreciated. Thanks, ~ Andrew -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org> This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately.