On Tue, Feb 20, 2007, Ning Li wrote about "Concurrent merge":
I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.
This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but more
importantly, disk segment merges will no longer block document
additions or deletions.
Hi Ning,
Reading what you wrote here, I understand that the most important
benefit
(as you see it) of this idea is to make the document addition time
more
predictable: the current situation where addDocument() usually takes a
second, but sometimes can take an hour, will be gone.
I am wondering, however, whether this problem will actually be
solved by
the single-background-merge-thread solution you propose. I'll
explain the
possible problem I see below.
...
Two limits on the total number of disk segments are used to detect
merge's lag and to slow down document additions/deletions: a soft
limit and a hard limit. When the number of disk segments reaches
the soft limit, a document addition/deletion will be slowed down
for time T. As the number of disk segments continues to grow, the
time for which an addition/deletion is slowed down will increase.
When the number of disk segments reaches the hard limit, document
additions/deletions will be blocked until the number falls under
the hard limit. The hard limit is almost never reached with proper
slow down.
..
How many merges can go on concurrently in your scheme? You
mentioned "a merge
thread", which leads me to think that you mean that only one merge
can go on
at once. You also said you prefer a single merge thread in the end
of your
email.
However, I think that allowing only one merge to proceed at once
can prevent
you from achieving your intended improvment (of never having to
wait an hour
for adding a single document). Consider the following scenario:
Imagine that after having just added the milionth document, a huge
merge
starts in the background, which is going to take an hour.
At the same time, new documents continue to come in and be added to
the index.
When you add more documents, you start getting small segments,
which need to
be merged into larger segments, and so on. But if you don't allow
multiple
merges to be active concurrently, the huge merge in progress will
cause these
small merges to be postponed for later, and in just a few minutes
you'll
accumulate many small segments, and very soon your "slowdown"
mechanism will
start slowing down the additions, to the point that after a few
minutes
additions will have to come to a grinding halt to avoid accumulating
thousands of new small segments.
This means that if you allow only one concurrent merge, your original
intention - of allowing additions to continue in (almost) full steam
while a huge merge is in progress - may not be realized.
A counter-argument may be that if you slow down the additions
enough - say -
wait 10 seconds between every addition - you'll not accumulate too
many new
segments in this hour, and at the same time never wait more than 10
seconds
for an addition, so everything is perfect. Well, this view is only
true if
additions do come only once every 10 seconds. If a new file is
queued to be
added every, say, once every 2 seconds, then your deliberate
slowdown will
cause new additions to be significantly delayed - up to almost the
full hour!
Moreover, the 10 seconds you spend sleeping (doing nothing) could
have been
used for something more productive on a multiple-CPU machine - such as
continuing to do additions and do more merging.
One possibility of solving this problem is to allow multiple
background
merge threads, not just one.
Another possible solution I can think of is to use the background
thread for
merging when it's free, but do merges in the foreground (blocking
the adds)
when it is busy. This means slowing down the adds a bit, but not
with sleep()
calls which doesn't utilize the available CPU, but with real work,
and adds
will continue to proceed at the fastest possible speed (depending
on the
machine). Presumably, the really huge merges (which I guess are the
ones
that bother you) will happen in the background thread.
Other ideas are most welcome!
An idea I once had, but I have no idea how much it can help, is this:
It is conceivable (but I didn't actually test this) that a
significant (?)
percentage of indexing time is IO time - disk write, read and seek.
It might be possible, while we're doing a merge of a few segments
on one
disk drive, to write new segments, and do new merges, on a second disk
drive; This means that all the writing we do to the second disk
drive doesn't
disturb the work that is being done on the first drive (less head
seeks,
and more concurrent reading and writing on both disks).
But as I said, I never tested how much this idea can improve
performance
on systems with multiple separate disks and multiple CPUs.
size (buffered documents) is not too big. And multiple disk merge
threads require significant system resources to add benefit.
See my comments above on why multiple concurrent merges might be
necessary,
depending on what the "benefit" you were aiming at.
Thanks,
Nadav.
--
Nadav Har'El | Wednesday, Feb 21 2007,
3 Adar 5767
IBM Haifa Research Lab
|-----------------------------------------
|It's fortunate I have bad luck
- without
http://nadav.harel.org.il |it I would have no luck at all!
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]