Running on a real cluster increases the amount of work done, and
significantly, as compared to one node: now, data actually has to be
transferred on/off the machine!

Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not sure
what instance type you are using but I got better mileage when I used larger
instances (and more of my own workers per instance, of course; it does that
for you too).

You may have trouble correctly extrapolating from the time it takes to hit
1% as there are setup costs as the instance spin up. Try letting it run a
bit more to see how fast it really seems to go.

Are you saying you extrapolate that it would take 1 EMR machine 1000 minutes
to finish? that sounds quite reasonable compared to 300 minutes locally. If
you mean the whole 20 machines is taking 1000 minutes to finish, that sounds
quite bad.


On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu <c...@atypon.com> wrote:

> Hi,
>
> I am running LDA on 18k documents, each document has 5k terms. total 300k
> terms. Topics is set to 100.
>
> Running LDA on Hadoop single node configuration takes about 5 hours per
> stage. And 20 stages would take 100 hours.
>
> However, given 20 machines, running on Amazon EMR is actually much much
> slower. It takes 1000 minutes per stage. (It takes about 10 minutes for 1%
> mapping progress.) Reducing is much faster is counted in seconds, almost
> neglect-able.
>
> Does anyone has similar experience or my setup is wrong?
>
> Chris
>
>

Reply via email to