Multiple delete of the same URL using SolrClean -----------------------------------------------
Key: NUTCH-1052 URL: https://issues.apache.org/jira/browse/NUTCH-1052 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3, 1.4 Reporter: Tim Pease Priority: Minor The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation. One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list. Another solution is to add a new state to the status field "db_gone_and_purged". Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira