If you don't delete documents, the numDoc/maxDoc difference is just updated 
documents, of which the older version is eligible for deletion.

 
 
-----Original message-----
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Monday 2nd October 2017 23:29
> To: user@nutch.apache.org
> Subject: Re: deletions from index
> 
> So, I had these numbers in my index:
> Num Docs: 189550Max Docs: 285531
> Deleted Docs: 95981
> 
> Then I did a crawl and index, which told meindexed (add/update): 13,423
> And now I have these numbers in my index:
> 
> Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely 
> confused. I don't use "-deleteGone" but I get massive numbers of deletions.
> 
> Is it your theory that Solr's report of deleted docs really just means that 
> docs were replaced by newer content?
> 
> 
>       From: Markus Jelsma <markus.jel...@openindex.io>
>  To: "user@nutch.apache.org" <user@nutch.apache.org>; User 
><user@nutch.apache.org> 
>  Sent: Monday, October 2, 2017 1:19 PM
>  Subject: RE: deletions from index
>    
> You can check the Hadoop job's counters to see how many are being deleted. If 
> some are, then -deleteGone is on in your case. Only with that setting 
> documents are going to be deleted.
> 
>  
>  
> -----Original message-----
> > From:Michael Coffey <mcof...@yahoo.com.INVALID>
> > Sent: Monday 2nd October 2017 21:51
> > To: User <user@nutch.apache.org>
> > Subject: deletions from index
> > 
> > With my new news crawl, I would like to keep web pages in the index, even 
> > after they have disappeared from the web, so I can continue using them in 
> > machine-learning processes. I thought I could achieve this by avoiding 
> > running cleaning jobs. However, I still notice increasing numbers of 
> > deletions in my solr index.
> > When and why does nutch tell the indexer to delete documents, other than 
> > during cleaningJob?
> > For example, recently, Solr tells me that numDocs is about 189,000 and 
> > deletedDocs is about 96,000. Even if I assume that some of the "deleted" 
> > docs have just been replaced by newer content, I am not ready to believe 
> > that has happened to so many of them.
> > Should I use a different indexer, or different settings, or something other 
> > than an indexer for this purpose?
> > 
> 
>    

Reply via email to