[Nutch-general] recrawl question

Nancy Snyder Mon, 11 Dec 2006 08:36:43 -0800

Hi
  I am using nutch-0.8.1 and copied the recrawl script from the web.

I did a simple crawl on url http://www.saic.com at depth 2 with -topN 
100 and got 18 records.
But when I do a recrawl with -topN 100 and -adddays 31 (forcing all 
pages to be refetched), I
get 132 documents.   The initial crawl is fast.   And then I do a 
recrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.


My initial crawl command was:
     /opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2  -topN 100

and the url file is called nutch with
    http://www.saic.com
in it.

And the recrawl command:
    /opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100 
-adddays 31

I notice in the initial crawl log file, the fetching starts with the 
original url for the crawl.
 >> Generator: starting <<
 >> Generator: segment: 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 
<<
 >> Generator: Selecting best-scoring urls due for fetch. <<
 >> Generator: Partitioning selected urls by host, for politeness. <<
 >> Generator: done. << >> Fetcher: starting <<
 >> Fetcher: segment: 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 
<<
 >> Fetcher: threads: 10 <<
 >> fetching http://www.saic.com/ <<
 >> Fetcher: done <<

But the recrawl starts with a different url.
 >> Generator: starting <<
 >> Generator: segment: 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 
<< >> Generator: Selecting best-scoring urls due for fetch. <<
 >> Generator: Partitioning selected urls by host, for politeness. << >> 
Generator: done. <<
 >> ** segment = 
'/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350' 
** <<
 >> Fetcher: starting <<
 >> Fetcher: segment: 
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 
<<
 >> Fetcher: threads: 10 <<
 >> fetching http://www.saic.com/employees/).join( <<

Shouldn't the first fetched url be the same to get the same results?

Plus when the crawls are done, the initial crawl had two segment 
directories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/

But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/  20061211105226/

And if I force a recrawl of everything (just to test it out), shouldn't 
it get the same number of documents and segment directories??

NANCY

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] recrawl question

Reply via email to