RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Steven Parkes Thu, 22 Mar 2007 15:44:24 -0800

> Right I'm calling a newly created segment (ie flushed from RAM) level
> 0 and then a level 1 segment is created when you merge 10 level 0
> segments, level 2 is created when merge 10 level 1 segments, etc.


This isn't the way the current code treats things. I'm not saying it's
the only way to look at things, just making sure I/we are clear on the
current state. The current code effectively computes level as a function
of size of the segment, independent of how the segment was created and
of the size of its neighbors.

Hmmm ... that's not completely true. Lets say it's mostly true. Okay,
kinda true.

> Backing up a bit ... I think the lowest amortized cost merge policy
> should always try to merge roughly equal sized segments

Well, logarithmic does do this, it just clumps logarithmically, with, in
this case, the boundary condition (size of the lowest level) being
defined by maxBufferedDocs. Do you consider 99999 and 10001 to be
roughly equal sized?

Of course, without deletes and not worrying about edge cases caused by
flushes/closes, everything will be of the form
maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that
way: in particular what happens to the sizes of big old segments. Can
everybody afford an optimize? 

I think it'd be interesting to play with merge policies and see what we
can do. One thing I'm interested in is tweaking the behavior on the big
end of the scale. Exponentials get big fast and I think in some cases,
treating the big segments differently from the smaller segments makes
sense, for example, self-merging an old segment as deletes build up, or
merging small subsequences of segments for a simlar reason. This is
mostly what motivated me to work on factoring out the merge policy from
the rest of IndexWriter. I've got something reasonably stable (though by
no means perfect) that'll I'll post, hopefully later today.

Anyway, back to some of your comments:

It's really interesting to think about this. I'm not a algorithms
expert, but I still find it interesting. It's appealing to think of a
"best" policy but is it possible to have a best policy without making
assumptions on the operations sequence (adds, deletes)?

Also, "best" has to trade off multiple constraints. You mentioned
copying bytes as rarely as possible. But the number of segments has both
hard and soft impacts on lookups, too, right? The hard limit is the
number of file descriptors: every reader is opening every segment:
that's the limiting case for file descriptors (mergers only need to open
mergeFactor+1). The soft impact is doing the join when walking posting
lists from each segment.

My impression is that the logarithmic policy (in size), was chosen as a
nice closed form that does an "okay" job in all these areas.

> Merging is costly because you read all data in then write all data
> out, so, you want to minimize for byte of data in the index in the
> index how many times it will be "serviced" (read in, written out) as
> part of a merge.

I've thought about this, as I looked through the existing merge code. I
think you can get cases where you'll get a perfect storm of bad merges,
copying a large segment multiple times to merge it with small segments.
Ick. But an edge condition? I guess the question is what is the expected
value of the cost, i.e., the absolute cost but mitigated by the
unlikelihood of it occurring.

Interesting.

> I think instead of calling segments "level N" we should just measure
> their net sizes and merge on that basis?

I think this is a great candidate, given that you factor in the number
of segments and sometimes merge just to keep the total number down.

> Yes, at no time should you merge more than mergeFactor segments at
once.

Why is this? File descriptors? Like I said before, readers are more an
issue there, right? I don't see that there's a huge difference between
mergeFactor and mergeFactor+1 if for other reasons, one might think
mergeFactor+1 was better than mergeFactor for a particular merge.

Hmmm ... I guess I think of mergeFactor as more setting the size
boundaries of segments than I do the n-way-ness of a merge, though the
word itself lends itself to the latter interpretation.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to