I think IO throttling would be a useful built-in feature.

I imagine many people are actually uknowingly affected by this, when they make changes to their index on the same machine that also does simultaneous searching. It's not only optimize() that will cause this, but also normal flushing of segments, addIndex*, expungeDeletes, normal segment merging, etc.

It'd be nice if the throttling could somehow be conditional on whether there is "contention", ie, searches are currently doing reading.

Really the OS should provide this facility to us, but it doesn't (at least not up through Java's APIs). Linux does let you pick the IO Scheduler to use, and at least one of these IO Schedulers lets you prioritize whole processes wrt IO. It's not an easy problem to solve!

Mike

Halsey, Stephen wrote:

Hi,

We are using lucene to index a large number of documents (millions) and we currently optimize half the index in the background every 2 days, to
stop it becoming too fragmented.  This takes about an hour and we are
finding during this time searches are slowed down dramatically on that
machine.  This is not due to CPU as it is a dual CPU box, so I'm
thinking it must be the large amounts of IO being used to optimize the
index.

I was wondering if anyone has any ideas for alleviating this problem?

One option I've come up with is to slowly copy the index to a second
second offline box, optimize there and then slowly copy the newly
optimized index back onto the search box.  To slow down the IO so that
bandwidth and IO are not maxed out I thought I could use something like
the linux Traffic Control (tc) program
http://tldp.org/HOWTO/Traffic-Control-HOWTO/elements.html#e-shaping (see also http://gentoo-wiki.com/HOWTO_Apache_2_bandwidth_limiting ) or tar,
nfs and http://www.ivarch.com/programs/quickref/pv.shtml and its
rate-limit option to limit how quickly the index directory is copied to and from the remote machine. This option doesn't seem ideal as it would
involve other programs, servers and scripts.

The other option is to do it all within the existing Java program, by
rate-limiting/throttling the IO of the lucene Directory being used to do the optimize. I've done this in Lucene by extending the FSDirectory and
the FSIndexOutput classes and putting a small sleep in the
FSIndexOutput.flushBuffer, and it seems to work OK.  I'm not that keen
on copying and modifying lucene code though, because I'll have to check
and possibly modify it every time I upgrade lucene, so if there is a
reasonable alternative I'd be interested in hearing anyone's ideas? If people think IO throttling FSDirectory may be a good idea and useful for
them, I could develop it more and possibly contact lucene-dev to look
into getting it added to the lucene trunk?

Cheers



steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to