Solrclean only removes url's with status db_gone. Also, just rerunning the 
crawl command won't help if your url is not eligible for fetch (fetch_time).

> Hello,
> 
> I am using nutch-1.3 to crawl an intranet site. For testing purposes,
> I created a local test website
> with index.html and 3 links to other 3 local html pages (page1.html,
> page2.html, page3.html).
> 
> To crawl I ran:
> ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl
> -depth 3 -topN 10
> After I check the status:
> ./bin/nutch readdb ./crawl/crawldb -stats
> CrawlDb statistics start: ./crawl/crawldb
> Statistics for CrawlDb: ./crawl/crawldb
> TOTAL urls:     4
> retry 0:        4
> min score:      0.666
> avg score:      0.7495
> max score:      1.0
> status 2 (db_fetched):  1
> status 6 (db_notmodified):      3
> CrawlDb statistics: done
> 
> At this point the site is indexed into Solr. After I remove page3.html
> and a hyperlink to it from the home page and rerun:
> ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl
> -depth 3 -topN 10
> ./bin/nutch readdb ./crawl/crawldb -stats
> CrawlDb statistics start: ./crawl/crawldb
> Statistics for CrawlDb: ./crawl/crawldb
> TOTAL urls:     4
> retry 0:        4
> min score:      0.666
> avg score:      1.7495
> max score:      2.666
> status 1 (db_unfetched):        1
> status 6 (db_notmodified):      3
> CrawlDb statistics: done
> 
> Checking removed page in crawldb yields:
> ./bin/nutch readdb ./crawl/crawldb -url http://localhost/page3.html
> URL: http://localhost/page3.html
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Aug 03 15:29:35 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 5 seconds (0 days)
> Score: 0.6666667
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page3.html
> 
> Now, looks like page3.html was marked as Not Found. Then, I run solrclean:
> ./bin/nutch solrclean crawl/crawldb http://localhost:8983/solr/
> SolrClean: starting at 2011-08-03 15:40:37
> SolrClean: deleted a total of 0 documents
> SolrClean: finished at 2011-08-03 15:40:39, elapsed: 00:00:01
> 
> I don’t understand why page3.html was not deleted from solr. I also
> tried running:
> inject
> generate
> fetch
> parse
> updatedb
> invertlinks
> solrindex
> 
> which gave me the same result.
> 
> Please help.
> 
> - Alex

Reply via email to