Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Kate Ericson Wed, 01 Feb 2012 17:33:44 -0800

If it's thrashing on something, there's a good chance it might miss a
checkpoint.
Like Ted brought up, there may be some very dense areas of your input
causing this problem.
How much memory are you giving to your Hadoop workers?  The default
value is rather small.


-Kate

On Wed, Feb 1, 2012 at 6:23 PM, Nicholas Kolegraff
<nickkolegr...@gmail.com> wrote:
> Thanks for the prompt reply Kate!
>
> The cluster has since been torn down on EC2 but, I did monitor it during
> the job execution and all seemed to be ok.  JobTracker and NameNode would
> continue to report status.
>
> I was aware of the configuration setting and hoping to refrain from playing
> with it :-) I get scared to modify it too large, since that time could get
> unnecessarily charged to my EC2 account. :S
>
> Do you know if it should still report status in the midst of a complex
> task?  Seems questionable that it wouldn't just send a friendly hello?
>
> On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <eric...@cs.colostate.edu>wrote:
>
>> Hi,
>>
>> This *may* just be a Hadoop issue - it sounds like the JobTracker is
>> upset that it hasn't heard from one of the workers in too long (over
>> 600 seconds).
>> Can you check your Hadoop Administration pages for the cluster?  Does
>> the cluster still seem to be functioning?
>> I haven't used Hadoop with EC2, so I'm not sure how difficult it will
>> be to check the cluster :-/
>> If everything seems to be OK, there's a Hadoop setting to modify how
>> long it's willing to wait before assuming a machine has failed and
>> killing a task.
>>
>>
>> -Kate
>>
>> On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
>> <nickkolegr...@gmail.com> wrote:
>> > Hello,
>> > I am attempting to run parallelALS on a very large matrix on EC2.
>> > The matrix is ~8 Million x 1 million. vary sparse .007% has data.
>> > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
>> > (I kept getting OutOfMemory exceptions so I kept upping the ante until I
>> > arrived at the above configuration)
>> >
>> > It makes it through the following jobs no problem:
>> >
>> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe
>> >
>> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce
>> >
>> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re
>> >
>> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
>> > ....
>> >
>> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
>> >
>> > Then crashes here with only the following error messages:
>> > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600
>> > seconds. Killing!
>> >
>> > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
>> > status?
>> >
>> > I'm not sure what is causing this -- I am still trying to wrap my head
>> > around the mahout API.
>> >
>> > Could this still be a memory issue?
>> >
>> > Hopefully i'm not missing something trivial?!?!
>>

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

Reply via email to