I agree, that is why I suggested using the queue. You will add new segments to be merged to a queue. The merge thread pulls segments form the queue when it is free to do so. The addDocument() creates a new segment and adds it to the queue. If the queue is full, the add to the queue will block until the queue has room.

As for using multiple disks, this is a great idea. We use a balanced merge sort in our application, and splitting the merge across disks and using multiple threads, we were able to double the IO throughput. It probably doesn't make much difference for sites with advanced striped disk controllers/configurations, but for rather dumb/cheap installs it makes a huge difference, since a balanced merge involves massive amounts of sequential reads and writes.


On Feb 21, 2007, at 9:56 AM, Nadav Har'El wrote:

On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.
This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but more
importantly, disk segment merges will no longer block document
additions or deletions.

Hi Ning,

Reading what you wrote here, I understand that the most important benefit (as you see it) of this idea is to make the document addition time more
predictable: the current situation where addDocument() usually takes a
second, but sometimes can take an hour, will be gone.

I am wondering, however, whether this problem will actually be solved by the single-background-merge-thread solution you propose. I'll explain the
possible problem I see below.

...
Two limits on the total number of disk segments are used to detect
merge's lag and to slow down document additions/deletions: a soft
limit and a hard limit. When the number of disk segments reaches
the soft limit, a document addition/deletion will be slowed down
for time T. As the number of disk segments continues to grow, the
time for which an addition/deletion is slowed down will increase.
When the number of disk segments reaches the hard limit, document
additions/deletions will be blocked until the number falls under
the hard limit. The hard limit is almost never reached with proper
slow down.
..

How many merges can go on concurrently in your scheme? You mentioned "a merge thread", which leads me to think that you mean that only one merge can go on at once. You also said you prefer a single merge thread in the end of your
email.

However, I think that allowing only one merge to proceed at once can prevent you from achieving your intended improvment (of never having to wait an hour
for adding a single document). Consider the following scenario:

Imagine that after having just added the milionth document, a huge merge
starts in the background, which is going to take an hour.
At the same time, new documents continue to come in and be added to the index. When you add more documents, you start getting small segments, which need to be merged into larger segments, and so on. But if you don't allow multiple merges to be active concurrently, the huge merge in progress will cause these small merges to be postponed for later, and in just a few minutes you'll accumulate many small segments, and very soon your "slowdown" mechanism will start slowing down the additions, to the point that after a few minutes
additions will have to come to a grinding halt to avoid accumulating
thousands of new small segments.

This means that if you allow only one concurrent merge, your original
intention - of allowing additions to continue in (almost) full steam
while a huge merge is in progress - may not be realized.

A counter-argument may be that if you slow down the additions enough - say - wait 10 seconds between every addition - you'll not accumulate too many new segments in this hour, and at the same time never wait more than 10 seconds for an addition, so everything is perfect. Well, this view is only true if additions do come only once every 10 seconds. If a new file is queued to be added every, say, once every 2 seconds, then your deliberate slowdown will cause new additions to be significantly delayed - up to almost the full hour! Moreover, the 10 seconds you spend sleeping (doing nothing) could have been
used for something more productive on a multiple-CPU machine - such as
continuing to do additions and do more merging.

One possibility of solving this problem is to allow multiple background
merge threads, not just one.

Another possible solution I can think of is to use the background thread for merging when it's free, but do merges in the foreground (blocking the adds) when it is busy. This means slowing down the adds a bit, but not with sleep() calls which doesn't utilize the available CPU, but with real work, and adds will continue to proceed at the fastest possible speed (depending on the machine). Presumably, the really huge merges (which I guess are the ones
that bother you) will happen in the background thread.

Other ideas are most welcome!

An idea I once had, but I have no idea how much it can help, is this:
It is conceivable (but I didn't actually test this) that a significant (?)
percentage of indexing time is IO time - disk write, read and seek.
It might be possible, while we're doing a merge of a few segments on one
disk drive, to write new segments, and do new merges, on a second disk
drive; This means that all the writing we do to the second disk drive doesn't disturb the work that is being done on the first drive (less head seeks,
and more concurrent reading and writing on both disks).

But as I said, I never tested how much this idea can improve performance
on systems with multiple separate disks and multiple CPUs.

size (buffered documents) is not too big. And multiple disk merge
threads require significant system resources to add benefit.

See my comments above on why multiple concurrent merges might be necessary,
depending on what the "benefit" you were aiming at.

Thanks,
Nadav.

--
Nadav Har'El | Wednesday, Feb 21 2007, 3 Adar 5767 IBM Haifa Research Lab |----------------------------------------- |It's fortunate I have bad luck - without
http://nadav.harel.org.il           |it I would have no luck at all!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to