Re: Concurrent merge

robert engels Wed, 21 Feb 2007 14:46:44 -0800

I agree, that is why I suggested using the queue. You will add newsegments to be merged to a queue. The merge thread pulls segmentsform the queue when it is free to do so. The addDocument() creates anew segment and adds it to the queue. If the queue is full, the addto the queue will block until the queue has room.

As for using multiple disks, this is a great idea. We use a balancedmerge sort in our application, and splitting the merge across disksand using multiple threads, we were able to double the IO throughput.It probably doesn't make much difference for sites with advancedstriped disk controllers/configurations, but for rather dumb/cheapinstalls it makes a huge difference, since a balanced merge involvesmassive amounts of sequential reads and writes.



On Feb 21, 2007, at 9:56 AM, Nadav Har'El wrote:

On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.
This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but more
importantly, disk segment merges will no longer block document
additions or deletions.
Hi Ning,
Reading what you wrote here, I understand that the most importantbenefit(as you see it) of this idea is to make the document addition timemore
predictable: the current situation where addDocument() usually takes a
second, but sometimes can take an hour, will be gone.
I am wondering, however, whether this problem will actually besolved bythe single-background-merge-thread solution you propose. I'llexplain the
possible problem I see below.
...
Two limits on the total number of disk segments are used to detect
merge's lag and to slow down document additions/deletions: a soft
limit and a hard limit. When the number of disk segments reaches
the soft limit, a document addition/deletion will be slowed down
for time T. As the number of disk segments continues to grow, the
time for which an addition/deletion is slowed down will increase.
When the number of disk segments reaches the hard limit, document
additions/deletions will be blocked until the number falls under
the hard limit. The hard limit is almost never reached with proper
slow down.
..
How many merges can go on concurrently in your scheme? Youmentioned "a mergethread", which leads me to think that you mean that only one mergecan go onat once. You also said you prefer a single merge thread in the endof your
email.
However, I think that allowing only one merge to proceed at oncecan preventyou from achieving your intended improvment (of never having towait an hour
for adding a single document). Consider the following scenario:
Imagine that after having just added the milionth document, a hugemerge
starts in the background, which is going to take an hour.
At the same time, new documents continue to come in and be added tothe index.When you add more documents, you start getting small segments,which need tobe merged into larger segments, and so on. But if you don't allowmultiplemerges to be active concurrently, the huge merge in progress willcause thesesmall merges to be postponed for later, and in just a few minutesyou'llaccumulate many small segments, and very soon your "slowdown"mechanism willstart slowing down the additions, to the point that after a fewminutes
additions will have to come to a grinding halt to avoid accumulating
thousands of new small segments.

This means that if you allow only one concurrent merge, your original
intention - of allowing additions to continue in (almost) full steam
while a huge merge is in progress - may not be realized.
A counter-argument may be that if you slow down the additionsenough - say -wait 10 seconds between every addition - you'll not accumulate toomany newsegments in this hour, and at the same time never wait more than 10secondsfor an addition, so everything is perfect. Well, this view is onlytrue ifadditions do come only once every 10 seconds. If a new file isqueued to beadded every, say, once every 2 seconds, then your deliberateslowdown willcause new additions to be significantly delayed - up to almost thefull hour!Moreover, the 10 seconds you spend sleeping (doing nothing) couldhave been
used for something more productive on a multiple-CPU machine - such as
continuing to do additions and do more merging.
One possibility of solving this problem is to allow multiplebackground
merge threads, not just one.
Another possible solution I can think of is to use the backgroundthread formerging when it's free, but do merges in the foreground (blockingthe adds)when it is busy. This means slowing down the adds a bit, but notwith sleep()calls which doesn't utilize the available CPU, but with real work,and addswill continue to proceed at the fastest possible speed (dependingon themachine). Presumably, the really huge merges (which I guess are theones
that bother you) will happen in the background thread.
Other ideas are most welcome!
An idea I once had, but I have no idea how much it can help, is this:
It is conceivable (but I didn't actually test this) that asignificant (?)
percentage of indexing time is IO time - disk write, read and seek.
It might be possible, while we're doing a merge of a few segmentson one
disk drive, to write new segments, and do new merges, on a second disk
drive; This means that all the writing we do to the second diskdrive doesn'tdisturb the work that is being done on the first drive (less headseeks,
and more concurrent reading and writing on both disks).
But as I said, I never tested how much this idea can improveperformance
on systems with multiple separate disks and multiple CPUs.
size (buffered documents) is not too big. And multiple disk merge
threads require significant system resources to add benefit.
See my comments above on why multiple concurrent merges might benecessary,
depending on what the "benefit" you were aiming at.

Thanks,
Nadav.

--
Nadav Har'El | Wednesday, Feb 21 2007,3 Adar 5767IBM Haifa Research Lab|-----------------------------------------|It's fortunate I have bad luck- without
http://nadav.harel.org.il           |it I would have no luck at all!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Concurrent merge

Reply via email to