Re: Update to URL ordering from Generator.java

Ken Krugler Wed, 24 Oct 2007 10:25:37 -0700

Ned Rockson wrote:
So recently I switched to Fetcher2 over Fetcher for larger wholeweb fetches (50-100M at a time). I found that the URLs generatedare not optimal because they are simply randomized by a hashcomparator. In one crawl on 24 machines it took about 3 days tocrawl 30M URLs. In comparison with old benchmarks I had set withregular Fetcher.java this was at least 3 fold more time.
Anyway, I realized that the best situation for ordering can beapproached by randomization, but in order to get optimal ordering,urls from the same host should be as far apart in the list aspossible. So I wrote a series of 2 map/reduces to optimize theordering and for a list of 25M documents it takes about 10 minuteson our cluster. Right now I have it in its own class, but Ifigured it can go in Generator.java and just add a flag innutch-default.xml determining if the user wants to use it.So, should I submit the code by email? Is there some way to changeGenerator.java or should I just submit the function in its ownclass?
This sounds intriguing. Nutch mailing lists strip attachments - youshould create a JIRA issue, copy this description, and attach apatch in unified diff format (svn diff will do).

For what it's worth, a year ago I had a series of discussions withpeople who manage large web sites, w.r.t. acceptable crawl rates.

This was in the context of whether using keep-alive made sense, tooptimize crawl performance.

For sites that mostly serve up static content, keep-alive was seen asa good thing, in that it reduced the load on the server constantlycreating/destroying connections.

For sites that build pages from multiple data sources, this was aminor issue - they were more concerned about total # of pages fetchedper time interval, given the high cost of assembling the page.

Which is the interesting point - nobody talked about the amount oftime between requests. They all were interested in monitoring pagesper time interval - typically an hour for smaller sites, and pagesper minute for ones with heavier traffic.

Based on the above, what seems like the most efficient approach tofetching would be to use a pages/minute/domain restriction, and use akept-alive connection (w/HTTP 1.1) with no delay until that limit isreached, then switch to another set of URLs from a different domain.

This assumes a maximum of one thread/domain, and that URLs aresegmented such that each domain is handled inside of one task, suchthat the one thread/domain and pages/minute/domain restrictions canbe enforced properly.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Update to URL ordering from Generator.java

Reply via email to