While the fetch is in progress, or any other operation for that matter all the
working information will be kept in the Hadoop temp directory. This will be
"/tmp/hadoop-<username>" unless you specify something else using the
"hadoop.tmp.dir" property in your hadoop-site.xml file.
When the fetch is complete, and Hadoop finishes its parse-reduce stage you will
then notice all the information will have been copied to the applicable segment
directory.
----- Original Message ----
From: chee wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 11:23:56 AM
Subject: Re: nutch81 pages seems were not kept but no error message found
Thanks Sean, I see.. The fetcher process only update the status in the
segments,but the status of readdb is from crawldb....
Another question in this mail thread is why the size of the crawl dir,which
include crawldb and segements, always remains unchanged ? The pages already
fetched should be kept in segements and the size of the segements directory
should also increase accordingly, is this true ?
----- Original Message -----
From: "Sean Dean" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, January 03, 2007 11:05 PM
Subject: Re: nutch81 pages seems were not kept but no error message found
The Nutch DB stats (and everything else in there) will not get updated until
you actually issue a "updatedb" command on a fetched segment. Nutch does not
support real-time updates of this information.
----- Original Message ----
From: Chee Wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 7:33:08 AM
Subject: nutch81 pages seems were not kept but no error message found
Hi all,
I am using crawl tool in Nutch81 under cygwin,trying to retrieve
pages from about 2 thousand websites,and the crawl process has been
running for nearly 20 hours.
But during the past 10 hours, the fetch status always remain the
same as below:
TOTAL urls: 165212
retry 0: 164110
retry 1: 814
retry 2: 288
min score: 0.0
avg score: 0.029228665
max score: 2.333
status 1 (DB_unfetched): 134960
status 2 (DB_fetched): 27812
status 3 (DB_gone): 2440
all the number in the status remain the same. DB_fetched page always
is 27812. From the console output and hadoop.log I can see the the
page fetching process is running without any error.
the size of the crawl db also have no change,always be 328M.
I have tried to solve this problem during all the last week. any hints
for this problem is appreciated. Thanks and bow~~~-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general