Solrclean only removes url's with status db_gone. Also, just rerunning the crawl command won't help if your url is not eligible for fetch (fetch_time).
> Hello, > > I am using nutch-1.3 to crawl an intranet site. For testing purposes, > I created a local test website > with index.html and 3 links to other 3 local html pages (page1.html, > page2.html, page3.html). > > To crawl I ran: > ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl > -depth 3 -topN 10 > After I check the status: > ./bin/nutch readdb ./crawl/crawldb -stats > CrawlDb statistics start: ./crawl/crawldb > Statistics for CrawlDb: ./crawl/crawldb > TOTAL urls: 4 > retry 0: 4 > min score: 0.666 > avg score: 0.7495 > max score: 1.0 > status 2 (db_fetched): 1 > status 6 (db_notmodified): 3 > CrawlDb statistics: done > > At this point the site is indexed into Solr. After I remove page3.html > and a hyperlink to it from the home page and rerun: > ./bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir ./crawl > -depth 3 -topN 10 > ./bin/nutch readdb ./crawl/crawldb -stats > CrawlDb statistics start: ./crawl/crawldb > Statistics for CrawlDb: ./crawl/crawldb > TOTAL urls: 4 > retry 0: 4 > min score: 0.666 > avg score: 1.7495 > max score: 2.666 > status 1 (db_unfetched): 1 > status 6 (db_notmodified): 3 > CrawlDb statistics: done > > Checking removed page in crawldb yields: > ./bin/nutch readdb ./crawl/crawldb -url http://localhost/page3.html > URL: http://localhost/page3.html > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Aug 03 15:29:35 EDT 2011 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 5 seconds (0 days) > Score: 0.6666667 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page3.html > > Now, looks like page3.html was marked as Not Found. Then, I run solrclean: > ./bin/nutch solrclean crawl/crawldb http://localhost:8983/solr/ > SolrClean: starting at 2011-08-03 15:40:37 > SolrClean: deleted a total of 0 documents > SolrClean: finished at 2011-08-03 15:40:39, elapsed: 00:00:01 > > I don’t understand why page3.html was not deleted from solr. I also > tried running: > inject > generate > fetch > parse > updatedb > invertlinks > solrindex > > which gave me the same result. > > Please help. > > - Alex

