Actually, a bit better signatures would use method overloading and be

int FieldsReader.length(int doc); // length of document in bytes
int FieldsReader.doc(int doc,byte[] buffer); // read a formatted document into a buffer

void FieldsWriter.addDocument(byte[] buffer, int len); // write an already formatted document from a buffer


On Nov 1, 2007, at 1:06 AM, robert engels wrote:

It seems that the following are needed:

FieldInfos.hashCode(); // to allow for fast equals failure
FieldInfos.equals();

for most efficient buffer reuse during merge to avoid GC, add

int FieldsReader.doclength(int doc);
int FieldsReader.binarydoc(int doc,byte[] buffer);

this will allow the caller to reuse the existing buffer if large enough, or create a new one

and

FieldsWriter.addBinaryDocument(byte[] buffer,int len);

All of the above methods are trivial.

SegmentMerger just needs to be changed to compare the readers to be merged, and if all have equal FieldInfos, then use a short circuit copy similar to

byte[] buffer = new byte[1024];

for each reader {
    for doc in reader {
            if doc not deleted {
                int len = reader.doclength(doc);
                if(len > buffer.length) {
                        buffer = new byte[len*2]; // allow for growth
                }
                reader.binarydoc(doc,buffer);
                newsegment.addBinaryDocument(buffer,len);
          }
    }
}



On Nov 1, 2007, at 12:30 AM, jian chen wrote:

Hi, Robert,

That's a brilliant idea! Thanks so much for suggesting that.

Cheers,

Jian

On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote:

Currently, when merging segments, every document is [parsed and then
rewritten since the field numbers may differ between the segments
(compressed data is not uncompressed in the latest versions).

It would seem that in many (if not most) Lucene uses the fields
stored within each document with an index are relatively static,
probably changing for all documents added after point X, if at all.

Why not check the fields dictionary for the segments being merged,
and if the same, just copy the binary data directly?

In the common case this should be a vast improvement.

Anyone worked on anything like this? Am I missing something?

Robert Engels



-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to