In my experience, using many small instances hurts since there is more
data transferred (less data is local to any given computation) and the
instance have lower I/O performance.

On the high end, super-big instances become counter-productive because
they are not as cheap on the spot market -- and you should be using
the spot market for everything but your master for sure.

ml.xlarge is a good default. EMR's default config says that each can
handle 3 reducers. So set your parallelism to at least 3 times the
number of workers you run.


If you can get away with computing on one machine, without Hadoop, do
so. .Distributing via Hadoop tends to cost 5x as much computing
resource or more. And, you can rent amazingly huge machines in the
cloud.

There's still a point past which you can't fit on one machine, or it's
not economical -- the huge EC2 instances are expensive and not on the
spot market. But it may be big enough for a lot of problems.




On Thu, Jan 24, 2013 at 2:01 PM, Matti Kokkola <matti.kokk...@iki.fi> wrote:
>
> Hi,
>
> I'm using Mahout to vectorize and cluster data consisting of short
> texts. So far I have done vectorizing on a single multi-core machine
> and been quite happy with the results. However, now we are doing a
> lot of small adjustments to increase the qulity of results and thus
> would like to tighten the feedback loop, ie. get vectors more quickly.
>
> Does anyone have good reference setup for Amazon EMR configuration for such
> a task? I tried with 6 m1.small instances, but terminated the job after 24
> hrs, because I thought there is something wrong with the setup. I pretty
> much followed the guides in Mahout wiki for the basic setup.
>
> In the test case, my seq file size was 50MB and previous seq2sparse runs
> have resulted around 400k vectors from that data.
>
> Rest of the configuration was as follows:
> - mahout v0.7
> - 6 instances, instance type default (m1.small)
> - numReducers 6
> - maxNGramsize 2
>
> Does this sound right (24 hrs and more to come...) for the given data size?
> How mouch improvement should I except, if I use m1.large instances instead?
> Any other recommendations?-)
>
> br, Matti

Reply via email to