Hi,

I
have installed Nutch0.9 and crawled news website. I got hits
also.
After that I recrawled the same site. At that time I didn't
get the hits for new pages.
But I saw update urls in the log
file.

EX: I crawled on 17th. Again I recrawled on 23th. I saw
the 23th urls in the log file like.
"indexer.Indexer - 
Indexing [http://......./2007.07.23.html] with analyzer
[EMAIL PROTECTED] (null)"

Is
there any error on
"[EMAIL PROTECTED]
(null)"?

Please help me how to recrawl any website.

I
have used following code for recrawl

bin/nutch generate
$1/crawldb $1/segments -adddays 5
segment=`ls -d $1/segments/* |
tail -1 | grep "[a-zA-Z0-9/]*"`
bin/nutch fetch
$segment
bin/nutch updatedb $1/crawldb $segment

bin/nutch
generate $1/crawldb $1/segments -adddays 5
s2=`ls -d
$1/segments/2* | tail -1`
bin/nutch fetch $s2
bin/nutch
updatedb $1/crawldb $s2

bin/nutch generate $1/crawldb
$1/segments -adddays 5
s3=`ls -d $1/segments/2* | tail
-1`
bin/nutch fetch $s3
bin/nutch updatedb $1/crawldb $s3

rm
-r $1/indexes
bin/nutch invertlinks $1/linkdb 
$1/segments/*
bin/nutch index $1/indexes $1/crawldb $1/linkdb
$1/segments/*

Thanks in advance.

Regards,
Anuradha.





      Why delete messages? Unlimited storage is just a click away. Go to 
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to