[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emmanuel Joke updated NUTCH-530: -------------------------------- Attachment: NUTCH-530.patch Patch provided. It reduced the process time by 20%. Output from the task: Map output records=98317 Map input bytes=10907058 Map output bytes=10021579 Combine input records=98317 Combine output records=42390 Reduce input groups=28601 Reduce input records=43005 Reduce output records=28601 I can see a real improvement. > Add a combiner to improve performance on updatedb > ------------------------------------------------- > > Key: NUTCH-530 > URL: https://issues.apache.org/jira/browse/NUTCH-530 > Project: Nutch > Issue Type: Improvement > Environment: java 1.6 > Reporter: Emmanuel Joke > Assignee: Emmanuel Joke > Fix For: 1.0.0 > > Attachments: NUTCH-530.patch > > > We have a lot of similar links with status "linked" generated at the ouput of > the map task when we try to update the crawldb based on the segment fetched. > We can use a combiner to improve the performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.