Hi,
I'm using Mahout to vectorize and cluster data consisting of short
texts. So far I have done vectorizing on a single multi-core machine
and been quite happy with the results. However, now we are doing a
lot of small adjustments to increase the qulity of results and thus
would like to
In my experience, using many small instances hurts since there is more
data transferred (less data is local to any given computation) and the
instance have lower I/O performance.
On the high end, super-big instances become counter-productive because
they are not as cheap on the spot market -- and