If the merge phase is what's taking a while, I can suggest two
parameter changes to help speed that up. (This is in addition to what
Sebastian said.)

First, I think it's useful to let it do a 100-way segment merge
instead of 10-way. (Or more.) This is controlled by "io.sort.factor"
in Hadoop.

Second, you probably want to let the combiner do more combining, to
reduce the number of records spilled and merged. For this, you set
"io.sort.mb". This job has a Combiner so it's valid. You could set it
up to half of your worker memory or so.

Here's a section of code in RecommenderJob that is used to configure
this all automatically on a JobContext; if it works for you, we could
include it in this job too:

  private static void setIOSort(JobContext job) {
    Configuration conf = job.getConfiguration();
    conf.setInt("io.sort.factor", 100);
    int assumedHeapSize = 512;
    String javaOpts = conf.get("mapred.child.java.opts");
    if (javaOpts != null) {
      Matcher m = Pattern.compile("-Xmx([0-9]+)([mMgG])").matcher(javaOpts);
      if (m.find()) {
        assumedHeapSize = Integer.parseInt(m.group(1));
        String megabyteOrGigabyte = m.group(2);
        if ("g".equalsIgnoreCase(megabyteOrGigabyte)) {
          assumedHeapSize *= 1024;
        }
      }
    }
    // Cap this at 1024MB now; see
https://issues.apache.org/jira/browse/MAPREDUCE-2308
    conf.setInt("io.sort.mb", Math.min(assumedHeapSize / 2, 1024));
    // For some reason the Merger doesn't report status for a long
time; increase
    // timeout when running these jobs
    conf.setInt("mapred.task.timeout", 60 * 60 * 1000);
  }


2011/10/18 WangRamon <[email protected]>:
>
>
>
>
> Hi All I'm running a recommend job on a Hadoop environment with about 600000 
> users and 2000000 items, the total user-pref records is about 66260000, the 
> data file is of 1GB size. I found the 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and 
> get a lot of logs like these in the mapper task output:  2011-10-18 
> 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate 
> segments out of a total of 73
> 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 64
> 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 55
> 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 
> intermediate segments out of a total of 46
> Actually, i do find some similar question from the mail list, e.g. 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%[email protected]%3E
>  , Sebastian said something about to use Mahout 0.5 in that mail thread, and 
> yes i'm using Mahout 0.5, however there is no further discussion, it will be 
> great if you guys can share some ideas/suggestions here, that will be a big 
> help to me, thanks in advance. BTW, i have the following parameters already 
> set in Hadoop:mapred.child.java.opts -> 
> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, 
> each with 32GB RAM, THANKS! CheersRamon

Reply via email to