Hi Andrew, according to you we should balance the time when gc run and the
batch time, which rdd is processed?

On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh <r...@databricks.com> wrote:

> Hi Andrew,
>
> The .principalComponents feature of RowMatrix is currently constrained to
> tall and skinny matrices. Your matrix is barely above the skinny
> requirement (10k columns), though the number of rows is fine.
>
> What are you looking to do with the principal components? If unnormalized
> PCA is OK for your application, you can instead run RowMatrix.computeSVD,
> and use the 'V' matrix, which can be used the same way as the principal
> components. The computeSVD method can handle square matrices, so it should
> be able to handle your matrix.
>
> Reza
> On Thu, Apr 23, 2015 at 4:11 PM, aleverentz <andylevere...@fico.com>
> wrote:
>
>> [My apologies if this is a re-post.  I wasn't subscribed the first time I
>> sent this message, and I'm hoping this second message will get through.]
>>
>> I’ve been using Spark 1.3.0 and MLlib for some machine learning tasks.
>> In a
>> fit of blind optimism, I decided to try running MLlib’s Principal
>> Components
>> Analayis (PCA) on a dataset with approximately 10,000 columns and 200,000
>> rows.
>>
>> The Spark job has been running for about 5 hours on a small cluster, and
>> it
>> has been stuck on a particular job ("treeAggregate at
>> RowMatrix.scala:119")
>> for most of that time.  The treeAggregate job is now on "retry 5", and
>> after
>> each failure it seems that the next retry uses a smaller number of tasks.
>> (Initially, there were around 80 tasks; later it was down to 50, then 42;
>> now it’s down to 16.)  The web UI shows the following error under "failed
>> stages":  "org.apache.spark.shuffle.MetadataFetchFailedException: Missing
>> an
>> output location for shuffle 1".
>>
>> This raises a few questions:
>>
>> 1. What does "missing an output location for shuffle 1" mean?  I’m
>> guessing
>> this cryptic error message is indicative of some more fundamental problem
>> (out of memory? out of disk space?), but I’m not sure how to diagnose it.
>>
>> 2. Why do subsequent retries use fewer and fewer tasks?  Does this mean
>> that
>> the algorithm is actually making progress?  Or is the scheduler just
>> performing some kind of repartitioning and starting over from scratch?
>> (Also, If the algorithm is in fact making progress, should I expect it to
>> finish eventually?  Or do repeated failures generally indicate that the
>> cluster is too small to perform the given task?)
>>
>> 3. Is it reasonable to expect that I could get PCA to run on this dataset
>> using the same cluster simply by changing some configuration parameters?
>> Or
>> is a larger cluster with significantly more resources per node the only
>> way
>> around this problem?
>>
>> 4. In general, are there any tips for diagnosing performance issues like
>> the
>> one above?  I've spent some time trying to get a few different algorithms
>> to
>> scale to larger and larger datasets, and whenever I run into a failure,
>> I'd
>> like to be able to identify the bottleneck that is preventing further
>> scaling.  Any general advice for doing that kind of detective work would
>> be
>> much appreciated.
>>
>> Thanks,
>>
>> ~ Andrew
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-Spark-MLlib-failures-tp22641.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to