[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508748 ]
Hudson commented on NUTCH-498: ------------------------------ Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/]) > Use Combiner in LinkDb to increase speed of linkdb generation > ------------------------------------------------------------- > > Key: NUTCH-498 > URL: https://issues.apache.org/jira/browse/NUTCH-498 > Project: Nutch > Issue Type: Improvement > Components: linkdb > Affects Versions: 0.9.0 > Reporter: Espen Amble Kolstad > Assignee: Doğacan Güney > Priority: Minor > Fix For: 1.0.0 > > Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch > > > I tried to add the follwing combiner to LinkDb > public static enum Counters {COMBINED} > public static class LinkDbCombiner extends MapReduceBase implements > Reducer { > private int _maxInlinks; > @Override > public void configure(JobConf job) { > super.configure(job); > _maxInlinks = job.getInt("db.max.inlinks", 10000); > } > public void reduce(WritableComparable key, Iterator values, > OutputCollector output, Reporter reporter) throws IOException { > final Inlinks inlinks = (Inlinks) values.next(); > int combined = 0; > while (values.hasNext()) { > Inlinks val = (Inlinks) values.next(); > for (Iterator it = val.iterator(); it.hasNext();) { > if (inlinks.size() >= _maxInlinks) { > if (combined > 0) { > reporter.incrCounter(Counters.COMBINED, combined); > } > output.collect(key, inlinks); > return; > } > Inlink in = (Inlink) it.next(); > inlinks.add(in); > } > combined++; > } > if (inlinks.size() == 0) { > return; > } > if (combined > 0) { > reporter.incrCounter(Counters.COMBINED, combined); > } > output.collect(key, inlinks); > } > } > This greatly reduced the time it took to generate a new linkdb. In my case it > reduced the time by half. > Map output records 8717810541 > Combined 7632541507 > Resulting output rec 1085269034 > That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.