You can check the Hadoop job's counters to see how many are being deleted. If 
some are, then -deleteGone is on in your case. Only with that setting documents 
are going to be deleted.

 
 
-----Original message-----
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Monday 2nd October 2017 21:51
> To: User <user@nutch.apache.org>
> Subject: deletions from index
> 
> With my new news crawl, I would like to keep web pages in the index, even 
> after they have disappeared from the web, so I can continue using them in 
> machine-learning processes. I thought I could achieve this by avoiding 
> running cleaning jobs. However, I still notice increasing numbers of 
> deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than 
> during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and 
> deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs 
> have just been replaced by newer content, I am not ready to believe that has 
> happened to so many of them.
> Should I use a different indexer, or different settings, or something other 
> than an indexer for this purpose?
> 

Reply via email to