You can!  (At least at the Lucene level)

You just need to implement a custom MergePolicy.

Or, the default (LogByteSizeMergePolicy) already has a
calibrateSizeByDeletes attr which proportionally discounts segments
according to their %tg deletes, thus making big segments that are
mostly deleted look small, thus causing the merge policy to target
them more "aggressively".

Maybe try enabling that first?  Please report back if you get
interesting results!

I think the merge policy should be doing this by default; and in fact
we had intended to as of 3.0 (there is a TODO in the code), but, we
missed it.  I'll make sure we change this default for 4.0.

Mike

On Fri, Aug 13, 2010 at 2:32 PM, Ron Mayer <r...@0ape.com> wrote:
> Short summary:
>
>  * If I could make Solr merge oldest segments (or the one
>   with the most deleted docs) rather than smallest
>   segments; I think I'd almost never need "optimize".
>
>  * Can I tell Solr to do this?  Or if not, can someone
>   point me in the right direction regarding where I might
>   patch it to try this myself?
>
>
> I have a system where documents are refreshed and/or expired
> pretty much in a FIFO manner.  In particular, no document
> in the system can live for over 1 month.
>
> Without frequent optimizes, ISTM my indexes tend to get
> bloated with mostly deleted content.   I attached a ls-l
> below - showing the largest segments in my index are all
> from July.   A query of
>   timestamp:([1999-01-01T00:00:00Z TO 2010-08-01T23:59:59Z])
> returns no documents so it appears to me the first 2 segments
> are entirely filled with deleted documents.
>
> I imagine this is not too uncommon a situation -- for example
> a web-crawler that periodically updates web pages that contain
> some dynamic content.
>
> Perhaps a different good criteria would be selecting to merge
> the segments with the largest number of deleted documents.
> In my case it'd be the same; but I could imagine non-FIFO
> update-heavy systems where that would work better.
>
>
>
>
> $ ls -lrt *.fdt
> -rw-rw-r-- 1 ramayer ramayer 291490823897 Jul 20 21:34 _u63.fdt
> -rw-rw-r-- 1 ramayer ramayer  78251326159 Jul 29 18:15 _xkh.fdt
> -rw-rw-r-- 1 ramayer ramayer  69295141685 Aug  8 01:29 _10f5.fdt
> -rw-rw-r-- 1 ramayer ramayer   5406369697 Aug 10 21:14 _13fv.fdt
> -rw-rw-r-- 1 ramayer ramayer  66210508029 Aug 10 21:44 _13g1.fdt
> -rw-rw-r-- 1 ramayer ramayer   2001873014 Aug 10 23:05 _13io.fdt
> -rw-rw-r-- 1 ramayer ramayer   1578531820 Aug 11 14:10 _13m8.fdt
> -rw-rw-r-- 1 ramayer ramayer   2254917604 Aug 12 03:49 _13p3.fdt
> -rw-rw-r-- 1 ramayer ramayer   2890967852 Aug 12 06:49 _13s6.fdt
> -rw-rw-r-- 1 ramayer ramayer   2820285238 Aug 12 09:49 _13v9.fdt
> -rw-rw-r-- 1 ramayer ramayer   2905550377 Aug 12 12:52 _13yc.fdt
> -rw-rw-r-- 1 ramayer ramayer   2776837514 Aug 12 15:54 _141f.fdt
> -rw-rw-r-- 1 ramayer ramayer    259698816 Aug 12 16:15 _141p.fdt
> -rw-rw-r-- 1 ramayer ramayer    290083173 Aug 12 16:34 _1420.fdt
> -rw-rw-r-- 1 ramayer ramayer    279500106 Aug 12 16:54 _142b.fdt
> -rw-rw-r-- 1 ramayer ramayer    277156197 Aug 12 17:17 _142m.fdt
> -rw-rw-r-- 1 ramayer ramayer     91360010 Aug 13 00:27 _142x.fdt
> -rw-rw-r-- 1 ramayer ramayer      7351514 Aug 13 00:37 _142y.fdt
> -rw-rw-r-- 1 ramayer ramayer         7286 Aug 13 00:38 _142z.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 01:07 _1430.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 02:07 _1431.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 03:07 _1432.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 04:07 _1433.fdt
> -rw-rw-r-- 1 ramayer ramayer      2388369 Aug 13 04:35 _1434.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 05:07 _1435.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 06:07 _1436.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 07:07 _1437.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 08:07 _1438.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 09:07 _1439.fdt
> -rw-rw-r-- 1 ramayer ramayer           21 Aug 13 10:07 _143a.fdt
> -rw-rw-r-- 1 ramayer ramayer       198581 Aug 13 11:04 _143b.fdt
>
>

Reply via email to