Hello,

I am using nutch-1.3 to crawl an intranet site. For testing purposes,
I created a local test website
with index.html and 3 links to other 3 local html pages (page1.html,
page2.html, page3.html).

To crawl I ran:
./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl
-depth 3 -topN 10
After I check the status:
./bin/nutch readdb ./crawl/crawldb -stats
CrawlDb statistics start: ./crawl/crawldb
Statistics for CrawlDb: ./crawl/crawldb
TOTAL urls:     4
retry 0:        4
min score:      0.666
avg score:      0.7495
max score:      1.0
status 2 (db_fetched):  1
status 6 (db_notmodified):      3
CrawlDb statistics: done

At this point the site is indexed into Solr. After I remove page3.html
and a hyperlink to it from the home page and rerun:
./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl
-depth 3 -topN 10
./bin/nutch readdb ./crawl/crawldb -stats
CrawlDb statistics start: ./crawl/crawldb
Statistics for CrawlDb: ./crawl/crawldb
TOTAL urls:     4
retry 0:        4
min score:      0.666
avg score:      1.7495
max score:      2.666
status 1 (db_unfetched):        1
status 6 (db_notmodified):      3
CrawlDb statistics: done

Checking removed page in crawldb yields:
./bin/nutch readdb ./crawl/crawldb -url http://localhost/page3.html
URL: http://localhost/page3.html
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Aug 03 15:29:35 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 5 seconds (0 days)
Score: 0.6666667
Signature: null
Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page3.html

Now, looks like page3.html was marked as Not Found. Then, I run solrclean:
./bin/nutch solrclean crawl/crawldb http://localhost:8983/solr/
SolrClean: starting at 2011-08-03 15:40:37
SolrClean: deleted a total of 0 documents
SolrClean: finished at 2011-08-03 15:40:39, elapsed: 00:00:01

I don’t understand why page3.html was not deleted from solr. I also
tried running:
inject
generate
fetch
parse
updatedb
invertlinks
solrindex

which gave me the same result.

Please help.

- Alex

Reply via email to