Hi, I did the following on the existing index: - expunge deletes - optimize(5) - check index
then from the existing index I exported all docs into a new one, then on the new one I did: - optimize(5) - check index the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt during the export, I also monitored the size on disk at each chunk of 100000 docs added to the new index: http://dl.dropbox.com/u/47469698/lucene/index.xls what I found was that the index was taking around 2400 Mb/million docs almost all the time, and from time to time it would take a little bit more (<3500) during a short period of time. this stays true until around 28 millions docs where the size on disk increases a lot (4500 Mb/million docs = 135 Gb on disk) until the end of the export (my index contains 32 millions docs). at the end the space on disk went from 134 Gb to 91 Gb thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is still 3000 Mb/million docs, far more than the 2400 I was seeing most of the time. I understand that merges happen, what I was surprised about was that the behavior between 28 and 32 millions was a lot bigger in scale than the other merges before, and even an optimize would not solve this entirely. did I reach a limit? should I maintain the index at 25 millions to avoid this behavior? I am using lucene 3.4 with the tiered merge policy and all the fields are stored. thanks, Vincent Sevel Ian Lea <ian....@gmail.com> Sent by: java-user-return-51136-v.sevel=lombardodier....@lucene.apache.org 27.10.2011 15:28 Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Re: index bigger than it should be? There's org.apache.lucene.index.CheckIndex which will report assorted stats about the index, as well as checking it for correctness. It can fix it too but you don't need that. I hope. Will take quite a while to run on a large index. What version of lucene? Does a before/after (or large/small) directory listing give any clues? -- Ian. On Thu, Oct 27, 2011 at 12:44 PM, <v.se...@lombardodier.com> wrote: > Hi, > > I have an application that has an index with 30 millions docs in it. every > day, I add around 1 million docs, and I remove the oldest 1 million, to > keepit stable at 30 million. > for the most part doc fields are indexed and stored. each doc weighs > around from a few Kb to a 1 Mb (a few Mb in some cases). > I used to be able to maintain the index at around 60 Gb on disk. but > recently the index has had a tendency to keep growing (90 Gb). I can see > that the expunge is doing what it should do, because after it executes, > the size on disk does go down, but never as low as the previous day. from > the outside, it looks like a leak, but since I do not remove the docs I > added during the day, it might be that the new docs are just bigger than > the old ones. still I am surprised with the increase. > > are there any tools to dig into the index structure and help justify the > space taken on disk? > I was thinking about something that would help identify terms that take up > the most space, or some sort of dump that I could compare from one day to > the other. > > any help appreciated, > > thanks, > > vince --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org ************************ DISCLAIMER ************************ This message is intended only for use by the person to whom it is addressed. It may contain information that is privileged and confidential. Its content does not constitute a formal commitment by Lombard Odier Darier Hentsch & Cie or any of its branches or affiliates. If you are not the intended recipient of this message, kindly notify the sender immediately and destroy this message. Thank You. *****************************************************************