Re: detected corrupted index / performance improvement

Michael McCandless Fri, 08 Feb 2008 02:14:26 -0800


Mike, you're right: all lucene files are written sequentially
(flushing or merging).


It's just a matter of how many are open at once, and whether we are
also reading from source(s) files, which affects IO throughput far
less than truly random access writes.

Plus, as of LUCENE-843, bytes are written to tvx/tvd/tvf and fdx/fdt
"as we go", which is better because we get the bytes to the OS earlier
so it can properly schedule their arrival to stable storage.  So by
the time we flush a segment, the OS should have committed most of
those bytes.

When writing a segment, we write fnm, then open tii/tis/frq/prx at
once and write (sequentially) to them, then write to nrm.

Merging is far more IO intensive.  With mergeFactor=10, we read from
40 input streams and write to 4 output streams when merging the
tii/tis/frq/prx files.

Mike

Mike Klaas wrote:

Oh, it certainly causes some random access--I don't deny that. Ijust want to emphasize that this isn't at all the same as all"random writes", which would be expected to perform an order-magslower.
Just did a test where I wrote out a 1gig file in 1K chunks. Thenwrote it out in 2files, alternating 512 byte chunks, then 4 files/256 byte chunks. Some speed is lost--perhaps 10% at each doubling--but the speed is still essentially "sequential" speed. You can getback the original performance by using consistent sized chunks (1Kto each file round-robin).
HDD controllers are actually quite good at batching writes intosequentially. Why else do you think sync() takes to long :)
-Mike

On 7-Feb-08, at 3:35 PM, robert engels wrote:
I don't think that is true - but I'm probably wrong though :).
My understanding is that several files are written in parallel(during the merge), causing random access. After the files arewritten, then they are all reread and written as a CFS file(essential sequential - although the read and write is going tocause head movement).
The code:
private IndexOutput tvx, tvf, tvd; // To write termvectors
private FieldsWriter fieldsWriter;

is my clue that several files are written at once.

On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote:
On 7-Feb-08, at 2:00 PM, robert engels wrote:
My point is that commit needs to be used in most applications,and the commit in Lucene is very slow.
You don't have 2x the IO cost, mainly because only the log fileneeds to be sync'd. The index only has to be sync'd eventually,in order to prune the logfile - this can be done in thebackground, improving the performance of update and commit cycle.
Also, writing the log file is very efficiently because it is anappend/sequential operation. Writing the segment files writesmultiple files - essentially causing random access writes.
For large segments, multiple sequentially-written large filesshould perform similarly to one large sequentially-written file.It is only close to random access on the smallest segments (whicha sufficiently-large flush-by-ram shouldn't produce).
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

Reply via email to