It seems that the following are needed:
FieldInfos.hashCode(); // to allow for fast equals failure
FieldInfos.equals();
for most efficient buffer reuse during merge to avoid GC, add
int FieldsReader.doclength(int doc);
int FieldsReader.binarydoc(int doc,byte[] buffer);
this will allow the caller to reuse the existing buffer if large
enough, or create a new one
and
FieldsWriter.addBinaryDocument(byte[] buffer,int len);
All of the above methods are trivial.
SegmentMerger just needs to be changed to compare the readers to be
merged, and if all have equal FieldInfos, then use a short circuit
copy similar to
byte[] buffer = new byte[1024];
for each reader {
for doc in reader {
if doc not deleted {
int len = reader.doclength(doc);
if(len > buffer.length) {
buffer = new byte[len*2]; // allow for growth
}
reader.binarydoc(doc,buffer);
newsegment.addBinaryDocument(buffer,len);
}
}
}
On Nov 1, 2007, at 12:30 AM, jian chen wrote:
Hi, Robert,
That's a brilliant idea! Thanks so much for suggesting that.
Cheers,
Jian
On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote:
Currently, when merging segments, every document is [parsed and then
rewritten since the field numbers may differ between the segments
(compressed data is not uncompressed in the latest versions).
It would seem that in many (if not most) Lucene uses the fields
stored within each document with an index are relatively static,
probably changing for all documents added after point X, if at all.
Why not check the fields dictionary for the segments being merged,
and if the same, just copy the binary data directly?
In the common case this should be a vast improvement.
Anyone worked on anything like this? Am I missing something?
Robert Engels
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]