Use Combiner in LinkDb to increase speed of linkdb generation
-------------------------------------------------------------
Key: NUTCH-498
URL: https://issues.apache.org/jira/browse/NUTCH-498
Project: Nutch
Issue Type: Improvement
Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
I tried to add the follwing combiner to LinkDb
{code}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
{code}
This greatly reduced the time it took to generate a new linkdb. In my case it
reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers