> Right I'm calling a newly created segment (ie flushed from RAM) level > 0 and then a level 1 segment is created when you merge 10 level 0 > segments, level 2 is created when merge 10 level 1 segments, etc.
This isn't the way the current code treats things. I'm not saying it's the only way to look at things, just making sure I/we are clear on the current state. The current code effectively computes level as a function of size of the segment, independent of how the segment was created and of the size of its neighbors. Hmmm ... that's not completely true. Lets say it's mostly true. Okay, kinda true. > Backing up a bit ... I think the lowest amortized cost merge policy > should always try to merge roughly equal sized segments Well, logarithmic does do this, it just clumps logarithmically, with, in this case, the boundary condition (size of the lowest level) being defined by maxBufferedDocs. Do you consider 99999 and 10001 to be roughly equal sized? Of course, without deletes and not worrying about edge cases caused by flushes/closes, everything will be of the form maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that way: in particular what happens to the sizes of big old segments. Can everybody afford an optimize? I think it'd be interesting to play with merge policies and see what we can do. One thing I'm interested in is tweaking the behavior on the big end of the scale. Exponentials get big fast and I think in some cases, treating the big segments differently from the smaller segments makes sense, for example, self-merging an old segment as deletes build up, or merging small subsequences of segments for a simlar reason. This is mostly what motivated me to work on factoring out the merge policy from the rest of IndexWriter. I've got something reasonably stable (though by no means perfect) that'll I'll post, hopefully later today. Anyway, back to some of your comments: It's really interesting to think about this. I'm not a algorithms expert, but I still find it interesting. It's appealing to think of a "best" policy but is it possible to have a best policy without making assumptions on the operations sequence (adds, deletes)? Also, "best" has to trade off multiple constraints. You mentioned copying bytes as rarely as possible. But the number of segments has both hard and soft impacts on lookups, too, right? The hard limit is the number of file descriptors: every reader is opening every segment: that's the limiting case for file descriptors (mergers only need to open mergeFactor+1). The soft impact is doing the join when walking posting lists from each segment. My impression is that the logarithmic policy (in size), was chosen as a nice closed form that does an "okay" job in all these areas. > Merging is costly because you read all data in then write all data > out, so, you want to minimize for byte of data in the index in the > index how many times it will be "serviced" (read in, written out) as > part of a merge. I've thought about this, as I looked through the existing merge code. I think you can get cases where you'll get a perfect storm of bad merges, copying a large segment multiple times to merge it with small segments. Ick. But an edge condition? I guess the question is what is the expected value of the cost, i.e., the absolute cost but mitigated by the unlikelihood of it occurring. Interesting. > I think instead of calling segments "level N" we should just measure > their net sizes and merge on that basis? I think this is a great candidate, given that you factor in the number of segments and sometimes merge just to keep the total number down. > Yes, at no time should you merge more than mergeFactor segments at once. Why is this? File descriptors? Like I said before, readers are more an issue there, right? I don't see that there's a huge difference between mergeFactor and mergeFactor+1 if for other reasons, one might think mergeFactor+1 was better than mergeFactor for a particular merge. Hmmm ... I guess I think of mergeFactor as more setting the size boundaries of segments than I do the n-way-ness of a merge, though the word itself lends itself to the latter interpretation. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
