Hi Nancy, Instead of recrawling, you are actually continuing the initial crawl.
I suspect that after your initial crawl you have 18 fetched urls in your crawldb. However, there are probably also a lot of unfetched urls (the outlinks from depth 2). You can use 'nutch readdb' to inspect your crawldb. At this moment, you'll have two segments (one for each depth). With your recrawl command you are telling Nutch to fetch the 100 best scoring unfetched urls from the crawldb. This might include the 18 urls which were fetched in the initial crawl since you used -addays 31, but it will also include a lot of the unfetched outlinks from the initial crawl. The scoring determines which url comes first. If you have not installed your own scoring plugin, then it uses the OPIC scoring filter. It is possible (even likely) that your start url is not on top. The second crawl does one cycle and results in one extra segment. Three in total. Mathijs Nancy Snyder wrote: > Hi > I am using nutch-0.8.1 and copied the recrawl script from the web. > > I did a simple crawl on url http://www.saic.com at depth 2 with -topN > 100 and got 18 records. > But when I do a recrawl with -topN 100 and -adddays 31 (forcing all > pages to be refetched), I > get 132 documents. The initial crawl is fast. And then I do a > recrawl (just for testing purposes) and > it takes alot longer and I get lots more documents. > > My initial crawl command was: > /opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100 > > and the url file is called nutch with > http://www.saic.com > in it. > > And the recrawl command: > /opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100 > -adddays 31 > > I notice in the initial crawl log file, the fetching starts with the > original url for the crawl. > >> Generator: starting << > >> Generator: segment: > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 > << > >> Generator: Selecting best-scoring urls due for fetch. << > >> Generator: Partitioning selected urls by host, for politeness. << > >> Generator: done. << >> Fetcher: starting << > >> Fetcher: segment: > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 > << > >> Fetcher: threads: 10 << > >> fetching http://www.saic.com/ << > >> Fetcher: done << > > But the recrawl starts with a different url. > >> Generator: starting << > >> Generator: segment: > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 > << >> Generator: Selecting best-scoring urls due for fetch. << > >> Generator: Partitioning selected urls by host, for politeness. << > >> Generator: done. << > >> ** segment = > '/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350' > ** << > >> Fetcher: starting << > >> Fetcher: segment: > /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 > << > >> Fetcher: threads: 10 << > >> fetching http://www.saic.com/employees/).join( << > > Shouldn't the first fetched url be the same to get the same results? > > Plus when the crawls are done, the initial crawl had two segment > directories under crawl/segments: > [EMAIL PROTECTED] segments]$ ls > 20061211102806/ 20061211102815/ > > But the recrawl had three: > [EMAIL PROTECTED] segments]$ ls > 20061211102806/ 20061211102815/ 20061211105226/ > > And if I force a recrawl of everything (just to test it out), > shouldn't it get the same number of documents and segment directories?? > > NANCY ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
