Update to URL ordering from Generator.java

2007-10-23 Thread Ned Rockson
So recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with ol

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Andrzej Bialecki
Ned Rockson wrote: So recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ken Krugler
Ned Rockson wrote: So recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. I

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ned Rockson
That's really interesting. From the literature I have been reading, I always though that the politeness policies required a 1s - 5s wait between queries to the same host. However, I always thought that regardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, do we really want to a

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ned Rockson
Sweet, I'll look into how to go about doing that later today. On Oct 24, 2007, at 6:06 AM, Andrzej Bialecki wrote: Ned Rockson wrote: So recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because th

Re: Update to URL ordering from Generator.java

2007-10-24 Thread Ken Krugler
That's really interesting. From the literature I have been reading, I always though that the politeness policies required a 1s - 5s wait between queries to the same host. However, I always thought that regardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, do we really want to add t