While the fetch is in progress, or any other operation for that matter all the 
working information will be kept in the Hadoop temp directory. This will be 
"/tmp/hadoop-<username>" unless you specify something else using the 
"hadoop.tmp.dir" property in your hadoop-site.xml file.
 
When the fetch is complete, and Hadoop finishes its parse-reduce stage you will 
then notice all the information will have been copied to the applicable segment 
directory.

 
----- Original Message ----
From: chee wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 11:23:56 AM
Subject: Re: nutch81 pages seems were not kept but no error message found


Thanks Sean, I see.. The fetcher process only update the status in the 
segments,but the  status of readdb is from crawldb....
Another question in this mail thread is why the size of the crawl dir,which 
include crawldb and segements, always remains unchanged ? The pages already 
fetched should be kept in  segements and the size of the segements directory 
should also increase accordingly, is this true ?



----- Original Message ----- 
From: "Sean Dean" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, January 03, 2007 11:05 PM
Subject: Re: nutch81 pages seems were not kept but no error message found


The Nutch DB stats (and everything else in there) will not get updated until 
you actually issue a "updatedb" command on a fetched segment. Nutch does not 
support real-time updates of this information.


----- Original Message ----
From: Chee Wu <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 3, 2007 7:33:08 AM
Subject: nutch81 pages seems were not kept but no error message found


Hi all,
   I am using crawl tool in Nutch81 under cygwin,trying to retrieve
pages from about 2 thousand websites,and the crawl process has been
running for nearly 20 hours.
    But during the past 10 hours, the fetch status always remain the
same as below:
    TOTAL urls: 165212
    retry 0:    164110
    retry 1:    814
    retry 2:    288
    min score:  0.0
    avg score:  0.029228665
    max score:  2.333
    status 1 (DB_unfetched):    134960
    status 2 (DB_fetched):      27812
    status 3 (DB_gone): 2440
all the number in the status remain the same. DB_fetched page always
is 27812. From the console output and hadoop.log I can see the the
page fetching process is running without any error.

the size of the crawl db also have no change,always be 328M.

I have tried to solve this problem during all the last week. any hints
for this problem is appreciated. Thanks and bow~~~
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to