Hmm... are you using IndexReader.numDeletedDocs to check? Did you commit from the writer and then reopen the IndexReader before calling .numDeletedDocs? Else the reader won't see the change.
Mike McCandless http://blog.mikemccandless.com On Sat, Sep 10, 2011 at 11:58 PM, <v.se...@lombardodier.com> wrote: > Hi, even with setExpungeDeletesPctAllowed(0.0), I could not get docs to > get removed from disk. > after the expunge+commit I print again the numDeletedDocs, and it stays > the same. > regards, > vincent > > > > > > > > > > > Michael McCandless <luc...@mikemccandless.com> > > > 09.09.2011 20:53 > Please respond to > java-user@lucene.apache.org > > > > To > java-user@lucene.apache.org > cc > > Subject > Re: optimize with num segments > 1 index keeps growing > > > > > > > TieredMergePolicy by default will only merge a segment if it has > 10% > deletions. > > Can you try calling .setExpungeDeletesPctAllowed(0.0) and then expunge > again? > > Mike McCandless > > http://blog.mikemccandless.com > > On Fri, Sep 9, 2011 at 1:41 PM, <v.se...@lombardodier.com> wrote: >> Hi, >> >> this post is quite old, but I would like to share some recen > developments. >> >> I applied the recommandation. my process became: expunge deletes and >> optimize 2 segments. >> >> at the time I was with lucene 3.1 and that solved my issue. recently I >> moved to lucene 3.3, and I tried playing with the new tiered merge > policy. >> what I found was that after an expunge, the number of deleted docs would >> stay the same, and space would not be reclaimed on the disk. I switched >> back to the default merge policy (LogByteSizeMergePolicy: >> minMergeSize=1677721, mergeFactor=10, maxMergeSize=2147483648, >> maxMergeSizeForOptimize=9223372036854775807, > calibrateSizeByDeletes=true, >> maxMergeDocs=2147483647, useCompoundFile=true, noCFSRatio=0.1) and got >> this time the right behavior : size was reclaimed on disk. I even tried >> with the BalancedSegmentMergePolicy and got again the right behavior. >> >> so this issue seems to affect only the tiered merge policy. >> >> to illustrate this, I took an index with many deleted docs then >> expunged/optimized while using the tiered policy, then did the same > thing >> with a default merge policy. here is for each step the content of the >> directory: >> >> before: >> >> 09.09.2011 17:38 20 segments.gen >> 09.09.2011 17:38 5'335 segments_4bf1u >> 06.09.2011 15:27 0 write.lock >> 06.09.2011 00:49 31'681'157'794 _jhwld.fdt >> 06.09.2011 00:49 115'562'268 _jhwld.fdx >> 06.09.2011 00:37 5'347 _jhwld.fnm >> 06.09.2011 01:13 7'147'947'472 _jhwld.frq >> 06.09.2011 01:13 3'927'649'164 _jhwld.prx >> 06.09.2011 01:13 41'992'760 _jhwld.tii >> 06.09.2011 01:13 3'745'729'056 _jhwld.tis >> 09.09.2011 00:27 1'805'669 _jhwld_3.del >> 09.09.2011 00:31 11'397'619'448 _jtrwg.fdt >> 09.09.2011 00:31 98'393'316 _jtrwg.fdx >> 09.09.2011 00:27 5'347 _jtrwg.fnm >> 09.09.2011 00:47 5'146'273'732 _jtrwg.frq >> 09.09.2011 00:47 1'661'436'146 _jtrwg.prx >> 09.09.2011 00:47 23'950'194 _jtrwg.tii >> 09.09.2011 00:47 2'139'903'139 _jtrwg.tis >> 09.09.2011 07:39 94'471'867 _jugaa.cfs >> 09.09.2011 10:14 252'716'611 _juok2.cfs >> 09.09.2011 15:45 7'986'102 _jwuaq.cfs >> 09.09.2011 16:00 5'780'703 _jx45g.cfs >> 09.09.2011 16:00 333'981'384 _jx46a.cfs >> 09.09.2011 16:23 20'955'761 _jxge0.cfs >> 09.09.2011 16:46 19'258'025 _jxmas.cfs >> 09.09.2011 16:55 16'622'800 _jxpv4.cfs >> 09.09.2011 17:10 14'605'028 _jxvd6.cfs >> 09.09.2011 17:34 12'456'476 _jy28o.cfs >> 09.09.2011 17:38 2'584'950 _jy91y.cfs >> 09.09.2011 17:38 2'595'049 _jy92i.cfs >> 09.09.2011 17:38 2'600'991 _jy932.cfs >> 09.09.2011 17:38 2'610'278 _jy93m.cfs >> 09.09.2011 17:38 46'664 _jy93x.cfs >> 09.09.2011 17:38 9'765 _jy93y.cfs >> 09.09.2011 17:38 10'691 _jy93z.cfs >> 09.09.2011 17:38 9'533 _jy940.cfs >> 09.09.2011 17:38 11'684 _jy941.cfs >> 09.09.2011 17:38 8'996 _jy942.cfs >> 38 File(s) 67'918'759'565 bytes >> >> >> after expunge/optimize (tiered merge policy): >> >> 09.09.2011 18:02 20 segments.gen >> 09.09.2011 18:02 3'171 segments_4bf3g >> 06.09.2011 15:27 0 write.lock >> 06.09.2011 00:49 31'681'157'794 _jhwld.fdt >> 06.09.2011 00:49 115'562'268 _jhwld.fdx >> 06.09.2011 00:37 5'347 _jhwld.fnm >> 06.09.2011 01:13 7'147'947'472 _jhwld.frq >> 06.09.2011 01:13 3'927'649'164 _jhwld.prx >> 06.09.2011 01:13 41'992'760 _jhwld.tii >> 06.09.2011 01:13 3'745'729'056 _jhwld.tis >> 09.09.2011 17:39 1'805'669 _jhwld_4.del >> 09.09.2011 17:45 11'814'367'373 _jy9iy.fdt >> 09.09.2011 17:45 101'565'036 _jy9iy.fdx >> 09.09.2011 17:39 5'347 _jy9iy.fnm >> 09.09.2011 18:01 5'328'530'169 _jy9iy.frq >> 09.09.2011 18:01 1'733'490'572 _jy9iy.prx >> 09.09.2011 18:01 25'072'713 _jy9iy.tii >> 09.09.2011 18:01 2'239'702'399 _jy9iy.tis >> 09.09.2011 18:02 185'962 _jy9mv.cfs >> 09.09.2011 18:02 9'955 _jy9mw.cfs >> 09.09.2011 18:02 10'380 _jy9mx.cfs >> 09.09.2011 18:02 9'341 _jy9my.cfs >> 09.09.2011 18:02 9'228 _jy9mz.cfs >> 09.09.2011 18:02 10'382 _jy9n0.cfs >> 09.09.2011 18:02 9'345 _jy9n1.cfs >> 09.09.2011 18:02 9'231 _jy9n2.cfs >> 09.09.2011 18:02 8'961 _jy9n3.cfs >> 09.09.2011 18:02 10'381 _jy9n4.cfs >> 09.09.2011 18:02 199'651 _jy9n5.cfs >> 09.09.2011 18:02 9'345 _jy9n6.cfs >> 09.09.2011 18:02 9'230 _jy9n7.cfs >> 31 File(s) 67'905'077'722 bytes >> >> after expungeDeletes/optimize with default merge policy : >> >> 09.09.2011 19:31 20 segments.gen >> 09.09.2011 19:31 2'081 segments_4bfpe >> 09.09.2011 18:13 0 write.lock >> 09.09.2011 18:42 30'133'772'814 _jyb4c.fdt >> 09.09.2011 18:42 103'164'812 _jyb4c.fdx >> 09.09.2011 18:27 5'347 _jyb4c.fnm >> 09.09.2011 19:03 6'474'023'590 _jyb4c.frq >> 09.09.2011 19:03 3'699'406'141 _jyb4c.prx >> 09.09.2011 19:03 37'900'657 _jyb4c.tii >> 09.09.2011 19:03 3'380'266'875 _jyb4c.tis >> 09.09.2011 19:15 11'820'477'088 _jyb4e.fdt >> 09.09.2011 19:15 101'659'700 _jyb4e.fdx >> 09.09.2011 19:03 5'347 _jyb4e.fnm >> 09.09.2011 19:29 5'333'219'797 _jyb4e.frq >> 09.09.2011 19:29 1'734'633'179 _jyb4e.prx >> 09.09.2011 19:29 25'105'023 _jyb4e.tii >> 09.09.2011 19:29 2'242'558'333 _jyb4e.tis >> 09.09.2011 19:31 223'600 _jyb5t.cfs >> 09.09.2011 19:31 9'545 _jyb5u.cfs >> 09.09.2011 19:31 8'963 _jyb5v.cfs >> 09.09.2011 19:31 9'250 _jyb5w.cfs >> 09.09.2011 19:31 9'047 _jyb5x.cfs >> 09.09.2011 19:31 11'253 _jyb5y.cfs >> 09.09.2011 19:31 11'239 _jyb5z.cfs >> 24 File(s) 65'086'483'701 bytes >> >> any clue to what is happenning? >> >> thanks, >> >> >> Vincent >> >> >> >> >> >> >> >> >> "Uwe Schindler" <u...@thetaphi.de> >> >> >> 21.07.2011 22:46 >> Please respond to >> java-user@lucene.apache.org >> >> >> >> To >> <java-user@lucene.apache.org> >> cc >> >> Subject >> RE: optimize with num segments > 1 index keeps growing >> >> >> >> >> >> >> There is also expungeDeletes()... >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >>> -----Original Message----- >>> From: v.se...@lombardodier.com [mailto:v.se...@lombardodier.com] >>> Sent: Thursday, July 21, 2011 8:39 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: optimize with num segments > 1 index keeps growing >>> >>> Hi, thanks for this explanation. >>> so what is the best solution: merge the large segment (how can I do >> that) >> or >>> work with many segments (10?) so that I will avoid have this "large >> segment" >>> issue? >>> thanks, >>> vince >>> >>> >>> Vincent Sevel >>> Lombard Odier Darier Hentsch & Cie >>> 11, rue de la Corraterie - 1204 Genève - Suisse T +41 22 709 3376 - F >> +41 >> 22 709 >>> 3782 www.lombardodier.com >>> >>> >>> >>> >>> >>> >>> >>> Simon Willnauer <simon.willna...@googlemail.com> >>> >>> >>> 21.07.2011 20:06 >>> Please respond to >>> java-user@lucene.apache.org >>> >>> >>> >>> To >>> java-user@lucene.apache.org >>> cc >>> >>> Subject >>> Re: optimize with num segments > 1 index keeps growing >>> >>> >>> >>> >>> >>> >>> so the problem here is that you have one really big segment _52aho.* > and >>> several smaller ones _7e0wz.*, _7e0xu.*, _7e1x5.* .... >>> if you optimize to 2 segmetns all the smaller segments are merged into >> one >>> but all the large segment remains untouched. This means that all > deleted >>> documents in the large segment are not removed / freed while if you >>> optimized to one segment they are removed. In the single seg. >>> index there is no *.del file present meaning no deletes. Unless you >> merge >>> the large segment all you deleted documents are only marked as delete >> but >>> not yet removed. >>> >>> simon >>> >>> On Thu, Jul 21, 2011 at 5:50 PM, <v.se...@lombardodier.com> wrote: >>> > hi, >>> > closing after the 2 segments optimize does not change it. >>> > also I am running with lucene 3.1.0. >>> > cheers, >>> > vince >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > Ian Lea <ian....@gmail.com> >>> > >>> > >>> > 21.07.2011 17:30 >>> > Please respond to >>> > java-user@lucene.apache.org >>> > >>> > >>> > >>> > To >>> > java-user@lucene.apache.org >>> > cc >>> > >>> > Subject >>> > Re: optimize with num segments > 1 index keeps growing >>> > >>> > >>> > >>> > >>> > >>> > >>> > A write.lock file with timestamp of 13:58 is in all the listings. The >>> > first thing I'd try is to add some IndexWriter.close() calls. >>> > >>> > >>> > -- >>> > Ian. >>> > >>> > >>> > >>> > On Thu, Jul 21, 2011 at 4:05 PM, <v.se...@lombardodier.com> wrote: >>> >> Hi, >>> >> >>> >> here is a concrete example. >>> >> >>> >> I am starting with an index that has 19017236 docs, which takes > 58989 >>> Mb >>> >> on disk: >>> >> >>> >> 21.07.2011 15:21 20 segments.gen >>> >> 21.07.2011 15:21 2'974 segments_2acy4 >>> >> 21.07.2011 13:58 0 write.lock >>> >> 16.07.2011 02:21 33'445'798'886 _52aho.fdt >>> >> 16.07.2011 02:21 178'723'932 _52aho.fdx >>> >> 16.07.2011 01:58 5'002 _52aho.fnm >>> >> 16.07.2011 03:10 9'857'410'889 _52aho.frq >>> >> 16.07.2011 03:10 4'538'234'846 _52aho.prx >>> >> 16.07.2011 03:10 61'581'767 _52aho.tii >>> >> 16.07.2011 03:10 5'505'039'790 _52aho.tis >>> >> 21.07.2011 01:01 1'899'536 _52aho_5.del >>> >> 21.07.2011 01:05 4'222'206'034 _6t61z.fdt >>> >> 21.07.2011 01:05 21'424'556 _6t61z.fdx >>> >> 21.07.2011 01:01 5'002 _6t61z.fnm >>> >> 21.07.2011 01:12 1'170'370'187 _6t61z.frq >>> >> 21.07.2011 01:12 598'373'388 _6t61z.prx >>> >> 21.07.2011 01:12 7'574'912 _6t61z.tii >>> >> 21.07.2011 01:12 678'766'206 _6t61z.tis >>> >> 21.07.2011 13:46 1'458'592'058 _7d6me.cfs >>> >> 21.07.2011 13:48 15'702'654 _7dhgz.cfs >>> >> 21.07.2011 13:52 16'800'942 _7dphm.cfs >>> >> 21.07.2011 13:55 16'714'431 _7dxht.cfs >>> >> 21.07.2011 14:24 17'505'435 _7e0wz.cfs >>> >> 21.07.2011 14:24 5'875'852 _7e0xu.cfs >>> >> 21.07.2011 14:48 18'340'470 _7e1x5.cfs >>> >> 21.07.2011 15:19 16'978'564 _7e3ck.cfs >>> >> 21.07.2011 15:21 1'208'656 _7e3hv.cfs >>> >> 21.07.2011 15:21 19'361 _7e3hw.cfs >>> >> 28 File(s) 61'855'156'350 bytes >>> >> >>> >> I am doing a delete of some of the older documents. after the > delete, >>> >> I commit then I optimize down to 2 segments. at the end of the >>> >> optimize >>> > the >>> >> index contains 18702510 docs (314727 were deleted) and it takes now >>> > 58975 >>> >> Mb on disk: >>> >> >>> >> 21.07.2011 15:37 20 segments.gen >>> >> 21.07.2011 15:37 524 segments_2acy6 >>> >> 21.07.2011 13:58 0 write.lock >>> >> 16.07.2011 02:21 33'445'798'886 _52aho.fdt >>> >> 16.07.2011 02:21 178'723'932 _52aho.fdx >>> >> 16.07.2011 01:58 5'002 _52aho.fnm >>> >> 16.07.2011 03:10 9'857'410'889 _52aho.frq >>> >> 16.07.2011 03:10 4'538'234'846 _52aho.prx >>> >> 16.07.2011 03:10 61'581'767 _52aho.tii >>> >> 16.07.2011 03:10 5'505'039'790 _52aho.tis >>> >> 21.07.2011 15:23 1'999'945 _52aho_6.del >>> >> 21.07.2011 15:31 5'194'848'138 _7e3hy.fdt >>> >> 21.07.2011 15:31 28'613'668 _7e3hy.fdx >>> >> 21.07.2011 15:25 5'002 _7e3hy.fnm >>> >> 21.07.2011 15:37 1'529'771'296 _7e3hy.frq >>> >> 21.07.2011 15:37 726'582'244 _7e3hy.prx >>> >> 21.07.2011 15:37 8'518'198 _7e3hy.tii >>> >> 21.07.2011 15:37 763'213'144 _7e3hy.tis >>> >> 18 File(s) 61'840'347'291 bytes >>> >> >>> >> as you can see, size on disk did not really change. at this point I >>> >> optimize down to 1 segment and at the end the index takes 48273 Mb > on >>> >> disk: >>> >> >>> >> 21.07.2011 16:46 20 segments.gen >>> >> 21.07.2011 16:46 278 segments_2acy8 >>> >> 21.07.2011 13:58 0 write.lock >>> >> 21.07.2011 16:06 32'901'423'750 _7e3hz.fdt >>> >> 21.07.2011 16:06 149'582'052 _7e3hz.fdx >>> >> 21.07.2011 15:42 5'002 _7e3hz.fnm >>> >> 21.07.2011 16:46 8'608'541'177 _7e3hz.frq >>> >> 21.07.2011 16:46 4'392'616'115 _7e3hz.prx >>> >> 21.07.2011 16:46 50'571'856 _7e3hz.tii >>> >> 21.07.2011 16:46 4'515'914'658 _7e3hz.tis >>> >> 10 File(s) 50'618'654'908 bytes >>> >> >>> >> >>> >> this means that with the 1 segment optimize I was able to reclaim 10 >>> >> Gb >>> > on >>> >> disk that the 2 segments optimize could not achieve. >>> >> >>> >> how can this be explained? is that a normal behavior? >>> >> >>> >> thanks, >>> >> >>> >> vince >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> Simon Willnauer <simon.willna...@googlemail.com> >>> >> >>> >> >>> >> 20.07.2011 23:11 >>> >> Please respond to >>> >> java-user@lucene.apache.org >>> >> >>> >> >>> >> >>> >> To >>> >> java-user@lucene.apache.org >>> >> cc >>> >> >>> >> Subject >>> >> Re: optimize with num segments > 1 index keeps growing >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, Jul 20, 2011 at 2:00 PM, <v.se...@lombardodier.com> wrote: >>> >>> Hi, >>> >>> >>> >>> I index several millions small documents per day. each day, I > remove >>> >> some >>> >>> of the older documents to keep the index at a stable number of >>> >> documents. >>> >>> after each purge, I commit then I optimize the index. what I found >>> >>> is >>> >> that >>> >>> if I keep optimizing with max num segments = 2, then the index > keeps >>> >>> growing on the disk. but as soon as I optimize with just 1 segment, >>> the >>> >>> space gets reclaimed on the disk. so, I have currently adopted the >>> >>> following strategy : every night I optimize with 2 segments, except >>> > once >>> >>> per week where I optimize with just 1 segment. >>> >> >>> >> what do you mean by keeps growing. you have n segments and you >>> >> optimize down to 2 and the index is bigger than the one with n >>> >> segments? >>> >> >>> >> simon >>> >>> >>> >>> is that an expected behavior? >>> >>> I guess I am doing something special because I was not able to >>> > reproduce >>> >>> this behavior in a unit test. what could it be? >>> >>> >>> >>> it would be nice to get some explanatory services within the > product >>> to >>> >>> help get some understanding on its behavior. something that tells >>> >>> you >>> >> some >>> >>> information about your index for instance (number of docs in the >>> >> different >>> >>> states, how the space is being used, ...). lucene is a wonderful >>> >> product, >>> >>> but to me this is almost like black magic, and when there is a >>> specific >>> >>> behavior, I have got little clues to figure out something by > myself. >>> >> some >>> >>> user oriented logging would be nice as well (the index writer info >>> >> stream >>> >>> is really verbose and very low level). >>> >>> >>> >>> thanks for your help, >>> >>> >>> >>> >>> >>> Vince >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>> > >>> > >>> > >>> > >>> > ************************ DISCLAIMER >>> ************************ This >>> > message is intended only for use by the person to whom it is >>> > addressed. It may contain information that is privileged and >>> > confidential. Its content does not constitute a formal commitment by >>> > Lombard Odier Darier Hentsch & Cie or any of its branches or >>> > affiliates. >>> > If you are not the intended recipient of this message, kindly notify >>> > the sender immediately and destroy this message. Thank You. >>> > >>> ********************************************************** >>> ******* >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >>> ************************ DISCLAIMER ************************ >>> This message is intended only for use by the person to whom it is >> addressed. >>> It may contain information that is privileged and confidential. Its >> content >>> does not constitute a formal commitment by Lombard Odier Darier Hentsch >>> & Cie or any of its branches or affiliates. >>> If you are not the intended recipient of this message, kindly notify > the >>> sender immediately and destroy this message. Thank You. >>> ********************************************************** >>> ******* >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> ************************ DISCLAIMER ************************ >> This message is intended only for use by the person to >> whom it is addressed. It may contain information that is >> privileged and confidential. Its content does not >> constitute a formal commitment by Lombard Odier >> Darier Hentsch & Cie or any of its branches or affiliates. >> If you are not the intended recipient of this message, >> kindly notify the sender immediately and destroy this >> message. Thank You. >> ***************************************************************** >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > ************************ DISCLAIMER ************************ > This message is intended only for use by the person to > whom it is addressed. It may contain information that is > privileged and confidential. Its content does not > constitute a formal commitment by Lombard Odier > Darier Hentsch & Cie or any of its branches or affiliates. > If you are not the intended recipient of this message, > kindly notify the sender immediately and destroy this > message. Thank You. > ***************************************************************** > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org