Re: [Nutch-general] recrawl question

Mathijs Homminga Tue, 12 Dec 2006 13:38:10 -0800

Hi Nancy,

Instead of recrawling, you are actually continuing the initial crawl.


I suspect that after your initial crawl you have 18 fetched urls in your 
crawldb. However, there are probably also a lot of unfetched urls (the 
outlinks from depth 2).
You can use 'nutch readdb' to inspect your crawldb.
At this moment, you'll have two segments (one for each depth).

With your recrawl command you are telling Nutch to fetch the 100 best 
scoring unfetched urls from the crawldb. This might include the 18 urls 
which were fetched in the initial crawl since you used -addays 31, but 
it will also include a lot of the unfetched outlinks from the initial crawl.
The scoring determines which url comes first. If you have not installed 
your own scoring plugin, then it uses the OPIC scoring filter. It is 
possible (even likely) that your start url is not on top.

The second crawl does one cycle and results in one extra segment. Three 
in total.

Mathijs


Nancy Snyder wrote:
> Hi
>  I am using nutch-0.8.1 and copied the recrawl script from the web.
>
> I did a simple crawl on url http://www.saic.com at depth 2 with -topN 
> 100 and got 18 records.
> But when I do a recrawl with -topN 100 and -adddays 31 (forcing all 
> pages to be refetched), I
> get 132 documents.   The initial crawl is fast.   And then I do a 
> recrawl (just for testing purposes) and
> it takes alot longer and I get lots more documents.
>
> My initial crawl command was:
>     /opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2  -topN 100
>
> and the url file is called nutch with
>    http://www.saic.com
> in it.
>
> And the recrawl command:
>    /opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100 
> -adddays 31
>
> I notice in the initial crawl log file, the fetching starts with the 
> original url for the crawl.
> >> Generator: starting <<
> >> Generator: segment: 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 
> <<
> >> Generator: Selecting best-scoring urls due for fetch. <<
> >> Generator: Partitioning selected urls by host, for politeness. <<
> >> Generator: done. << >> Fetcher: starting <<
> >> Fetcher: segment: 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 
> <<
> >> Fetcher: threads: 10 <<
> >> fetching http://www.saic.com/ <<
> >> Fetcher: done <<
>
> But the recrawl starts with a different url.
> >> Generator: starting <<
> >> Generator: segment: 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 
> << >> Generator: Selecting best-scoring urls due for fetch. <<
> >> Generator: Partitioning selected urls by host, for politeness. << 
> >> Generator: done. <<
> >> ** segment = 
> '/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350' 
> ** <<
> >> Fetcher: starting <<
> >> Fetcher: segment: 
> /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 
> <<
> >> Fetcher: threads: 10 <<
> >> fetching http://www.saic.com/employees/).join( <<
>
> Shouldn't the first fetched url be the same to get the same results?
>
> Plus when the crawls are done, the initial crawl had two segment 
> directories under crawl/segments:
> [EMAIL PROTECTED] segments]$ ls
> 20061211102806/  20061211102815/
>
> But the recrawl had three:
> [EMAIL PROTECTED] segments]$ ls
> 20061211102806/  20061211102815/  20061211105226/
>
> And if I force a recrawl of everything (just to test it out), 
> shouldn't it get the same number of documents and segment directories??
>
> NANCY

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] recrawl question

Reply via email to