On 4/12/2017 5:11 AM, Markus Jelsma wrote: > One of our 2 shard collections is rather small and gets all its entries > reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times > greater than numDoc, the merger is never scheduled but settings are default. > We just overwrite the existing entries, all of them. > > Here are the stats: > > Last Modified: 12 minutes ago > Num Docs: 8336 > Max Doc: 82362 > Heap Memory Usage: -1 > Deleted Docs: 74026 > Version: 3125 > Segment Count: 10
This discrepancy would typically mean that when you reindex, you're indexing MOST of the documents, but not ALL of them, so at least one document is still not deleted in each older segment. When segments have all their documents deleted, they are automatically removed by Lucene, but if there's even one document NOT deleted, the segment will remain until it is merged. There's no information here about how large this core is, but unless the documents are REALLY enormous, I'm betting that an optimize would happen quickly. With a document count this low and an indexing pattern that results in such a large maxdoc, this might be a good time to go against general advice and perform an optimize at least once a day. An alternate idea that would not require optimizes: If the intent is to completely rebuild the index, you might want to consider issuing a "delete all docs by query" before beginning the indexing process. This would ensure that none of the previous documents remain. As long as you don't do a commit that opens a new searcher before the indexing is complete, clients won't ever know that everything was deleted. > This is the config: > > <luceneMatchVersion>6.5.0</luceneMatchVersion> > <dataDir>${solr.data.dir:}</dataDir> > <directoryFactory name="DirectoryFactory" > class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/> > <codecFactory class="solr.SchemaCodecFactory"/> > <schemaFactory class="ClassicIndexSchemaFactory"/> > > <indexConfig> > <lockType>${solr.lock.type:native}</lockType> > <infoStream>false</infoStream> > </indexConfig> > > <jmx /> > > <updateHandler class="solr.DirectUpdateHandler2"> > <updateLog> > <str name="dir">${solr.ulog.dir:}</str> > </updateLog> > </updateHandler> Side issue: This config is missing autoCommit. You really should have autoCommit with openSearcher set to false and a maxTime in the neighborhood of 60000. It goes inside the updateHandler section. This won't change the maxDoc issue, but because of the other problems it can prevent, it is strongly recommended. It can be omitted if you are confident that your indexing code is correctly managing hard commits. Thanks, Shawn