Hi Paul,
the most likely problem seems to me the depth of 15.
If your first page and every consecutive one had 10 links,
your crawler would have to fetch roughly 24414062500 
GigyByte from the Internet.

Depending on your data, start with a much smaller depth.

Kind regards,
Olaf



On Mon, 28 Feb 2005 08:20:22 -0800, sub paul <[EMAIL PROTECTED]> wrote:
> Hello All,
> 
> I was running an intranet crawl and It seems like it did not finish, properly.
> It is a pretty default setup, but crawl's depth was 15, and I had
> turned on queries by commenting out
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> other than bunch of fetch messages, and bunch of Exceeding max.delays
> meaning message I am seeing the following..
> 
> crawl starts normally...
> 050228 064335 status: segment 20050228044354, 6300 pages, 91 errors,
> 140194211 bytes, 7163124 ms
> 050228 064335 status: 0.8795045 pages/s, 152.90356 kb/s, 22253.049 bytes/page
> .......
> 050228 064551 status: segment 20050228044354, 6400 pages, 97 errors,
> 142348797 bytes, 7298549 ms
> 050228 064551 status: 0.87688667 pages/s, 152.37276 kb/s, 22242.0 bytes/page
> .....
> 050228 064759 status: segment 20050228044354, 6500 pages, 102 errors,
> 144522915 bytes, 7427113 ms
> 
> Results of all this was a nutch-seacher-dir looked like this:
> du -h nutch-searcher.dir/
> 5.3M    nutch-searcher.dir/db/webdb/pagesByURL
> 3.4M    nutch-searcher.dir/db/webdb/pagesByMD5
> 14M     nutch-searcher.dir/db/webdb/linksByMD5
> 14M     nutch-searcher.dir/db/webdb/linksByURL
> 36M     nutch-searcher.dir/db/webdb
> 36M     nutch-searcher.dir/db
> 12K     nutch-searcher.dir/segments/20050228020140/fetchlist
> 12K     nutch-searcher.dir/segments/20050228020140/fetcher
> 20K     nutch-searcher.dir/segments/20050228020140/content
> 12K     nutch-searcher.dir/segments/20050228020140/parse_text
> 16K     nutch-searcher.dir/segments/20050228020140/parse_data
> 76K     nutch-searcher.dir/segments/20050228020140
> 16K     nutch-searcher.dir/segments/20050228020146/fetchlist
> 16K     nutch-searcher.dir/segments/20050228020146/fetcher
> 316K    nutch-searcher.dir/segments/20050228020146/content
> 52K     nutch-searcher.dir/segments/20050228020146/parse_text
> 144K    nutch-searcher.dir/segments/20050228020146/parse_data
> 548K    nutch-searcher.dir/segments/20050228020146
> 56K     nutch-searcher.dir/segments/20050228020257/fetchlist
> 68K     nutch-searcher.dir/segments/20050228020257/fetcher
> 2.2M    nutch-searcher.dir/segments/20050228020257/content
> 260K    nutch-searcher.dir/segments/20050228020257/parse_text
> 912K    nutch-searcher.dir/segments/20050228020257/parse_data
> 3.5M    nutch-searcher.dir/segments/20050228020257
> 232K    nutch-searcher.dir/segments/20050228020931/fetchlist
> 276K    nutch-searcher.dir/segments/20050228020931/fetcher
> 9.4M    nutch-searcher.dir/segments/20050228020931/content
> 1.1M    nutch-searcher.dir/segments/20050228020931/parse_text
> 4.1M    nutch-searcher.dir/segments/20050228020931/parse_data
> 15M     nutch-searcher.dir/segments/20050228020931
> 900K    nutch-searcher.dir/segments/20050228024012/fetchlist
> 1.1M    nutch-searcher.dir/segments/20050228024012/fetcher
> 37M     nutch-searcher.dir/segments/20050228024012/content
> 3.9M    nutch-searcher.dir/segments/20050228024012/parse_text
> 16M     nutch-searcher.dir/segments/20050228024012/parse_data
> 58M     nutch-searcher.dir/segments/20050228024012
> 3.2M    nutch-searcher.dir/segments/20050228044354/fetchlist
> 1.1M    nutch-searcher.dir/segments/20050228044354/fetcher
> 39M     nutch-searcher.dir/segments/20050228044354/content
> 3.6M    nutch-searcher.dir/segments/20050228044354/parse_text
> 16M     nutch-searcher.dir/segments/20050228044354/parse_data
> 62M     nutch-searcher.dir/segments/20050228044354
> 139M    nutch-searcher.dir/segments
> 175M    nutch-searcher.dir
> 
> Crawl ran for about 2 hours and 43 minutes.
> 
> when I search, it looks at the right searcher.dir, but its not
> returning anything for me:
> 050228 085819 10 query request from 64.171.1.207
> 050228 085819 10 query: bhangra
> 050228 085819 10 searching for 20 raw hits
> 050228 085819 10 total hits: 0
> 
> what am I doing wrong? TIA for the help.
> 
> Regards,
> Paul
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 


-- 

<SimpleHuman gender="male">
   <Physical name="Olaf Thiele" />
   <Virtual adress="http://www.olafthiele.de"; />
</SimpleHuman>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to