Re: [Nutch-general] DB_unfetched status

cesar voulgaris Wed, 17 Jan 2007 17:03:13 -0800

OK, in the DB_unfetched there are pages still to be fetched..thanks. What I
don´t still get is why if I increase the number of threads in a fetch
process (to speed it up), at the end of the process (some predefined depth)
I got less "DB_fetched" pages than when I fetch will less threads (exactly
the same crawl!!). At the end, the meaning of speeding up the process is to
have certain number of "fetched" pages in aa lapse of time. I start a crawl
22 hrs ago with ~500.000 pages in db and ~100.000 DB_fetched.
fetcher.threads.fetch is set to 20. I´ll try the next set for 24 hrs with 10
threads and I´ll coment you the results. Perhaps theres something with the
stack or the jvm which I don´t understand and affect the threading process
performance


On 1/17/07, Sean Dean <[EMAIL PROTECTED]> wrote:


The DB_unfetched value will grow after every successful crawl. This
comprises of any left over URLs (that were not fetched) and new URLs that
were on the pages you crawled previously.

This is not an error, nor is it missing URLs (its actually finding you new
ones) but if you don't want it to function in this way you can always change
the setting "db.max.outlinks.per.page" to 0 instead of 100 (default).

The "fetcher.threads.fetch" setting has nothing to do with "what" is
crawled, but more to do with the overall speed on the crawl. Also, when you
do start that big crawl of yours, make sure its not set to 1 as this will
preform poorly.

----- Original Message ----
From: cesar voulgaris <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, January 16, 2007 11:57:21 PM
Subject: DB_unfetched status


hi I.m using the 0.8.1 version. I have some problem with the
fetcher.threads.fetch property.
Setting the value of this property to 100, then crawling (depth 3)an a
readdb, I get:

CrawlDb statistics start: crawltest/crawldb/
Statistics for CrawlDb: crawltest/crawldb/
TOTAL urls:     82
retry 0:        82
min score:      0.0
avg score:      0.026024392
max score:      1.018
status 1 (DB_unfetched ):        68
status 2 (DB_fetched):  10
status 3 (DB_gone):     4
CrawlDb statistics: done

Lowering the value I get more TOTAL urls and a bigger
DB_fetched/DB_unfetched ratio (always exactly the same crawl )
Does this means that the crawler is missing urls?. What means anyway
DB_unfetched, if i try with fetcher.threads.fetch =1, I still
get lots DB_unfetched results

The hadoop.log doesn`t show any error or distintive message (in nutch
0.7.2I got https exceeded max delay messages with high values)
I preciate any comments, I do´t want to start a big crawl without knowing
the performance of the crawler...thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] DB_unfetched status

Reply via email to