You can! (At least at the Lucene level) You just need to implement a custom MergePolicy.
Or, the default (LogByteSizeMergePolicy) already has a calibrateSizeByDeletes attr which proportionally discounts segments according to their %tg deletes, thus making big segments that are mostly deleted look small, thus causing the merge policy to target them more "aggressively". Maybe try enabling that first? Please report back if you get interesting results! I think the merge policy should be doing this by default; and in fact we had intended to as of 3.0 (there is a TODO in the code), but, we missed it. I'll make sure we change this default for 4.0. Mike On Fri, Aug 13, 2010 at 2:32 PM, Ron Mayer <r...@0ape.com> wrote: > Short summary: > > * If I could make Solr merge oldest segments (or the one > with the most deleted docs) rather than smallest > segments; I think I'd almost never need "optimize". > > * Can I tell Solr to do this? Or if not, can someone > point me in the right direction regarding where I might > patch it to try this myself? > > > I have a system where documents are refreshed and/or expired > pretty much in a FIFO manner. In particular, no document > in the system can live for over 1 month. > > Without frequent optimizes, ISTM my indexes tend to get > bloated with mostly deleted content. I attached a ls-l > below - showing the largest segments in my index are all > from July. A query of > timestamp:([1999-01-01T00:00:00Z TO 2010-08-01T23:59:59Z]) > returns no documents so it appears to me the first 2 segments > are entirely filled with deleted documents. > > I imagine this is not too uncommon a situation -- for example > a web-crawler that periodically updates web pages that contain > some dynamic content. > > Perhaps a different good criteria would be selecting to merge > the segments with the largest number of deleted documents. > In my case it'd be the same; but I could imagine non-FIFO > update-heavy systems where that would work better. > > > > > $ ls -lrt *.fdt > -rw-rw-r-- 1 ramayer ramayer 291490823897 Jul 20 21:34 _u63.fdt > -rw-rw-r-- 1 ramayer ramayer 78251326159 Jul 29 18:15 _xkh.fdt > -rw-rw-r-- 1 ramayer ramayer 69295141685 Aug 8 01:29 _10f5.fdt > -rw-rw-r-- 1 ramayer ramayer 5406369697 Aug 10 21:14 _13fv.fdt > -rw-rw-r-- 1 ramayer ramayer 66210508029 Aug 10 21:44 _13g1.fdt > -rw-rw-r-- 1 ramayer ramayer 2001873014 Aug 10 23:05 _13io.fdt > -rw-rw-r-- 1 ramayer ramayer 1578531820 Aug 11 14:10 _13m8.fdt > -rw-rw-r-- 1 ramayer ramayer 2254917604 Aug 12 03:49 _13p3.fdt > -rw-rw-r-- 1 ramayer ramayer 2890967852 Aug 12 06:49 _13s6.fdt > -rw-rw-r-- 1 ramayer ramayer 2820285238 Aug 12 09:49 _13v9.fdt > -rw-rw-r-- 1 ramayer ramayer 2905550377 Aug 12 12:52 _13yc.fdt > -rw-rw-r-- 1 ramayer ramayer 2776837514 Aug 12 15:54 _141f.fdt > -rw-rw-r-- 1 ramayer ramayer 259698816 Aug 12 16:15 _141p.fdt > -rw-rw-r-- 1 ramayer ramayer 290083173 Aug 12 16:34 _1420.fdt > -rw-rw-r-- 1 ramayer ramayer 279500106 Aug 12 16:54 _142b.fdt > -rw-rw-r-- 1 ramayer ramayer 277156197 Aug 12 17:17 _142m.fdt > -rw-rw-r-- 1 ramayer ramayer 91360010 Aug 13 00:27 _142x.fdt > -rw-rw-r-- 1 ramayer ramayer 7351514 Aug 13 00:37 _142y.fdt > -rw-rw-r-- 1 ramayer ramayer 7286 Aug 13 00:38 _142z.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 01:07 _1430.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 02:07 _1431.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 03:07 _1432.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 04:07 _1433.fdt > -rw-rw-r-- 1 ramayer ramayer 2388369 Aug 13 04:35 _1434.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 05:07 _1435.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 06:07 _1436.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 07:07 _1437.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 08:07 _1438.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 09:07 _1439.fdt > -rw-rw-r-- 1 ramayer ramayer 21 Aug 13 10:07 _143a.fdt > -rw-rw-r-- 1 ramayer ramayer 198581 Aug 13 11:04 _143b.fdt > >