Re: [Nutch-general] DB_unfetched status

Andrzej Bialecki Thu, 18 Jan 2007 00:10:01 -0800

cesar voulgaris wrote:
> OK, in the DB_unfetched there are pages still to be fetched..thanks. 
> What I
> don´t still get is why if I increase the number of threads in a fetch
> process (to speed it up), at the end of the process (some predefined 
> depth)
> I got less "DB_fetched" pages than when I fetch will less threads 
> (exactly
> the same crawl!!). At the end, the meaning of speeding up the process 
> is to
> have certain number of "fetched" pages in aa lapse of time. I start a 
> crawl
> 22 hrs ago with ~500.000 pages in db and ~100.000 DB_fetched.
> fetcher.threads.fetch is set to 20. I´ll try the next set for 24 hrs 
> with 10
> threads and I´ll coment you the results. Perhaps theres something with 
> the
> stack or the jvm which I don´t understand and affect the threading 
> process
> performance


Most likely you can't fetch pages that fast because of "politeness" 
settings, i.e. maximum number of threads accessing a single host, and 
delay between requests. Look for "Exceeded http.max.delays" errors in 
your log.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] DB_unfetched status

Reply via email to