[
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Espen Amble Kolstad updated NUTCH-498:
--------------------------------------
Description:
I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
This greatly reduced the time it took to generate a new linkdb. In my case it
reduced the time by half.
Map output records 8717810541
Combined 7632541507
Resulting output rec 1085269034
That's a 87% reduction of output records from the map phase
was:
I tried to add the follwing combiner to LinkDb
{code}
public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;
@Override
public void configure(JobConf job) {
super.configure(job);
_maxInlinks = job.getInt("db.max.inlinks", 10000);
}
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
if (inlinks.size() >= _maxInlinks) {
output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0) {
return;
}
if (combined > 0) {
reporter.incrCounter(Counters.COMBINED, combined);
}
output.collect(key, inlinks);
}
}
{code}
This greatly reduced the time it took to generate a new linkdb. In my case it
reduced the time by half.
|Map output records|8717810541|
|Combined|7632541507|
|Resulting output rec11085269034|
That's a 87% reduction of output records from the map phase
> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
> Key: NUTCH-498
> URL: https://issues.apache.org/jira/browse/NUTCH-498
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 0.9.0
> Reporter: Espen Amble Kolstad
> Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
> public static enum Counters {COMBINED}
> public static class LinkDbCombiner extends MapReduceBase implements
> Reducer {
> private int _maxInlinks;
> @Override
> public void configure(JobConf job) {
> super.configure(job);
> _maxInlinks = job.getInt("db.max.inlinks", 10000);
> }
> public void reduce(WritableComparable key, Iterator values,
> OutputCollector output, Reporter reporter) throws IOException {
> final Inlinks inlinks = (Inlinks) values.next();
> int combined = 0;
> while (values.hasNext()) {
> Inlinks val = (Inlinks) values.next();
> for (Iterator it = val.iterator(); it.hasNext();) {
> if (inlinks.size() >= _maxInlinks) {
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> return;
> }
> Inlink in = (Inlink) it.next();
> inlinks.add(in);
> }
> combined++;
> }
> if (inlinks.size() == 0) {
> return;
> }
> if (combined > 0) {
> reporter.incrCounter(Counters.COMBINED, combined);
> }
> output.collect(key, inlinks);
> }
> }
> This greatly reduced the time it took to generate a new linkdb. In my case it
> reduced the time by half.
> Map output records 8717810541
> Combined 7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers