[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587786#action_12587786
 ] 

Otis Gospodnetic commented on NUTCH-570:
----------------------------------------

Ned - are you still using this?  Still happy with it?  Did you make any updates 
to this since October 2007?

The logical of spreading URLs from the same hosts as far apart as possible 
makes sense to me, and I think it really is correct.  I tried this and nothing 
broke, though I have no easy way to verify if the patch indeed does what you 
described.  I think my fetchlists of less than 10K URLs are too small for 
improvement of this approach to be visible.

Has anyone else tried this?

I think this is extra - not used anywhere and thus not needed:

+  public static final String UPDATE_WITHOUT_CHECKING_TIME = "crawl.check.time";
+  public static final String MAX_URLS_PER_HOST = "generate.max.per.host";

And in here we count on the host name in the URL already being 
normalized/lowercased elsewhere, so there is no need to lowercase it here, 
right?

+      try {
+        u = new URL(((Text)key).toString());
+        host = u.getHost();
+      } catch (MalformedURLException e) {
+        host = ((Text)key).toString();
+      }

And I see 2 LOG.warn calls that look like they might be left-overs from 
debugging-through-logging:

+      LOG.warn("Current key: " + ((IntWritable)key).toString());
+      LOG.warn("current output: " + currUrl.toString());

Those should be changed to LOG.debug or they should be removed.


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
> (50-100M at a time).  I found that the URLs generated are not optimal because 
> they are simply randomized by a hash comparator.  In one crawl on 24 machines 
> it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by 
> randomization, but in order to get optimal ordering, urls from the same host 
> should be as far apart in the list as possible.  So I wrote a series of 2 
> map/reduces to optimize the ordering and for a list of 25M documents it takes 
> about 10 minutes on our cluster.  Right now I have it in its own class, but I 
> figured it can go in Generator.java and just add a flag in nutch-default.xml 
> determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to