[jira] [Issue Comment Edited] (NUTCH-1074) topN is ignored with maxNumSegments

Robert Thomson (JIRA) Sun, 18 Sep 2011 00:56:35 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107369#comment-13107369
 ]


Robert Thomson edited comment on NUTCH-1074 at 9/18/11 7:55 AM:
----------------------------------------------------------------

When generator.max.count is set, the Generator.Selector reduce function 
partitions records so that each segment contains up to the set number of 
entries per host.  The relative size of resulting segments will depend on the 
distribution of hosts in the crawldb.  topN only limits the mean size of the 
segments.

If generator.max.count is not set, each segment will contain topN records.

Anyway, here's my fix.  When using generator.max.count, each segment will 
contain up to topN records with at most generator.max.count from any single 
host.

{code}
Index: src/java/org/apache/nutch/crawl/Generator.java
===================================================================
--- src/java/org/apache/nutch/crawl/Generator.java      (revision 1172165)
+++ src/java/org/apache/nutch/crawl/Generator.java      (working copy)
@@ -115,6 +115,7 @@
     private long limit;
     private long count;
     private HashMap<String,int[]> hostCounts = new HashMap<String,int[]>();
+    private int segCounts[];
     private int maxCount;
     private boolean byDomain = false;
     private Partitioner<Text,Writable> partitioner = new URLPartitioner();
@@ -155,6 +156,7 @@
       schedule = FetchScheduleFactory.getFetchSchedule(job);
       scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
       maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
+      segCounts = new int[maxNumSegments];
     }
 
     public void close() {}
@@ -269,6 +271,12 @@
           // increment hostCount
           hostCount[1]++;
 
+          // check if topN reached, select next segment if it is
+          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] < 
maxNumSegments) {
+            hostCount[0]++;
+            hostCount[1] = 0;
+          }
+
           // reached the limit of allowed URLs per host / domain
           // see if we can put it in the next segment?
           if (hostCount[1] > maxCount) {
@@ -285,7 +293,11 @@
             }
           }
           entry.segnum = new IntWritable(hostCount[0]);
-        } else entry.segnum = new IntWritable(currentsegmentnum);
+          segCounts[hostCount[0]-1]++;
+        } else {
+          entry.segnum = new IntWritable(currentsegmentnum);
+          segCounts[currentsegmentnum-1]++;
+        }
 
         output.collect(key, entry);
{code}

      was (Author: robthomson):
    As far as I can tell, when generator.max.count is set, the 
Generator.Selector reduce function partitions records so that each segment 
contains up to the set number of entries per host.  The relative size of 
resulting segments will depend on the distribution of hosts in the crawldb.  
topN only limits the mean size of the segments.

If generator.max.count is not set, each segment will contain topN records.
  
> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. 
> It looks like the first generated segment contains topN * maxNumSegments of 
> URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-1074) topN is ignored with maxNumSegments

Reply via email to