It's also important to IndexWriter.commit (as well as open new NRT readers) periodically or after doing a large set of updates, as that lets Lucene remove any old segments referenced by the prior commit point.
Mike McCandless http://blog.mikemccandless.com On Fri, Nov 6, 2015 at 2:59 AM, Rob Audenaerde <rob.audenae...@gmail.com> wrote: > Hi will, others > > Thanks for you reply, > > As far as I understand it, deleting a document is just setting the deleted > bit, and when segments are merged, then the documents are removed. (not > really sure what this means exactly; I guess the document gets removed from > the store, the terms will no longer refer to that document. Not sure if > terms get removed if no longer needed, etc). If there are resources to read > to improve my understanding I havo not found them (yet), if you could point > me to some that be great! > > I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I > never close my InderWriter; as I use NRT searching I just alwyas keep it > open. > > My two guesses are that: a) old segments are not removed from disk or b) > deletes are not cleaned up as well as I though they would be. > > I have made a testcase which indexes 5 million rows (five iterations, five > indexing thread, indexing and deleting all such documents after each > iterator with deleteByQuery), the rows randomly generated. I see the > Taxonomy ever growing (which is logical, because facet-ordinals are never > removed as far as I understand); the index grows, but also shrinks when > deleting. So I cannot reproduce my problem easily :( > > I will start diving into the Lucene source code, but I was hoping I just > did something wrong. . > > Any hints are appreciated! > > -Rob > > > On Thu, Nov 5, 2015 at 2:52 PM, will <wmartin...@gmail.com> wrote: > >> Hi Rob: >> >> Do you understand how deletes work and how an index is compacted? >> >> There's some configuration/runtime activities you don't mention.... And >> you make testing process sound like a mirror of production? (Including >> configuration?) >> >> >> -will >> >> >> On 11/5/15 7:33 AM, Rob Audenaerde wrote: >> >>> Hi all, >>> >>> I'm currently investigating an issue we have with our index. It keeps >>> getting bigger, and I don't het why. >>> >>> Here is our use case: >>> >>> We index a database of about 4 million records; spread over a few hundred >>> tables. The data consists of a mix of text, dates, numbers etc. We also >>> add >>> all these fields as facets. >>> Each night we delete about 90% of the data, which in testing reduces the >>> index size significantly. >>> We store the data as StoredFields as well, to prevent having to access the >>> database at all. >>> We use FloatAssociatedFacet fields for the facets. >>> >>> >>> In production however, it seems the index is only growing, up to 71 GB for >>> these records for a month of running. >>> >>> It seems that lucene's index in just getting bigger there. >>> >>> We use lucene 5.3 on CentOS, java 8 64 bit. >>> >>> The taxonomy-index does not grow significantly. >>> >>> How should I go about checking what is wrong? >>> >>> Thanks! >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org