Hi Andrew, according to you we should balance the time when gc run and the batch time, which rdd is processed?
On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh <r...@databricks.com> wrote: > Hi Andrew, > > The .principalComponents feature of RowMatrix is currently constrained to > tall and skinny matrices. Your matrix is barely above the skinny > requirement (10k columns), though the number of rows is fine. > > What are you looking to do with the principal components? If unnormalized > PCA is OK for your application, you can instead run RowMatrix.computeSVD, > and use the 'V' matrix, which can be used the same way as the principal > components. The computeSVD method can handle square matrices, so it should > be able to handle your matrix. > > Reza > On Thu, Apr 23, 2015 at 4:11 PM, aleverentz <andylevere...@fico.com> > wrote: > >> [My apologies if this is a re-post. I wasn't subscribed the first time I >> sent this message, and I'm hoping this second message will get through.] >> >> I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks. >> In a >> fit of blind optimism, I decided to try running MLlib’s Principal >> Components >> Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000 >> rows. >> >> The Spark job has been running for about 5 hours on a small cluster, and >> it >> has been stuck on a particular job ("treeAggregate at >> RowMatrix.scala:119") >> for most of that time. The treeAggregate job is now on "retry 5", and >> after >> each failure it seems that the next retry uses a smaller number of tasks. >> (Initially, there were around 80 tasks; later it was down to 50, then 42; >> now it’s down to 16.) The web UI shows the following error under "failed >> stages": "org.apache.spark.shuffle.MetadataFetchFailedException: Missing >> an >> output location for shuffle 1". >> >> This raises a few questions: >> >> 1. What does "missing an output location for shuffle 1" mean? I’m >> guessing >> this cryptic error message is indicative of some more fundamental problem >> (out of memory? out of disk space?), but I’m not sure how to diagnose it. >> >> 2. Why do subsequent retries use fewer and fewer tasks? Does this mean >> that >> the algorithm is actually making progress? Or is the scheduler just >> performing some kind of repartitioning and starting over from scratch? >> (Also, If the algorithm is in fact making progress, should I expect it to >> finish eventually? Or do repeated failures generally indicate that the >> cluster is too small to perform the given task?) >> >> 3. Is it reasonable to expect that I could get PCA to run on this dataset >> using the same cluster simply by changing some configuration parameters? >> Or >> is a larger cluster with significantly more resources per node the only >> way >> around this problem? >> >> 4. In general, are there any tips for diagnosing performance issues like >> the >> one above? I've spent some time trying to get a few different algorithms >> to >> scale to larger and larger datasets, and whenever I run into a failure, >> I'd >> like to be able to identify the bottleneck that is preventing further >> scaling. Any general advice for doing that kind of detective work would >> be >> much appreciated. >> >> Thanks, >> >> ~ Andrew >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>