Hi
I am using nutch-0.8.1 and copied the recrawl script from the web.
I did a simple crawl on url http://www.saic.com at depth 2 with -topN
100 and got 18 records.
But when I do a recrawl with -topN 100 and -adddays 31 (forcing all
pages to be refetched), I
get 132 documents. The initial crawl is fast. And then I do a
recrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.
My initial crawl command was:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100
and the url file is called nutch with
http://www.saic.com
in it.
And the recrawl command:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100
-adddays 31
I notice in the initial crawl log file, the fetching starts with the
original url for the crawl.
>> Generator: starting <<
>> Generator: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806
<<
>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<
>> Fetcher: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806
<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<
But the recrawl starts with a different url.
>> Generator: starting <<
>> Generator: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350
<< >> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. << >>
Generator: done. <<
>> ** segment =
'/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350'
** <<
>> Fetcher: starting <<
>> Fetcher: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350
<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<
Shouldn't the first fetched url be the same to get the same results?
Plus when the crawls are done, the initial crawl had two segment
directories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/ 20061211102815/
But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/ 20061211102815/ 20061211105226/
And if I force a recrawl of everything (just to test it out), shouldn't
it get the same number of documents and segment directories??
NANCY
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general