Thanks for the suggestions!

I finally managed to get Hadoop parallel the mapping processes.

I changed not only the "mapred.max.split.size" setting, but also "dfs.block.size", because of how FileInputFormat.java compute the split size.

protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

Now seems all nodes are running in parallel!

Chris


On 09/06/2011 04:44 PM, Jake Mannix wrote:
On Tue, Sep 6, 2011 at 4:44 PM, Chris Lu<c...@atypon.com>  wrote:

I see, thanks!

Seems it should build into Mahout LDA algorithms, since the input file is
usually not too large, but really needs parallel mapping processes.


If your input is not large, running a multithreaded in-memory algorithm on a
relatively beefy box (16+ cores, enough RAM to fit your data + model + some
spare) will be *much* faster than putting the same data on cluster,
actually.

   -jake


Reply via email to