[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444316 ] 
            
Yonik Seeley commented on LUCENE-695:
-------------------------------------

> I wonder why this happened.

readBytes on less than a buffer size probably only happens with binary (or 
compressed) fields, relatively new additions to Lucene, so it probably didn't 
have much of a real-world impact.   I think it is important to fix though, as 
more things may be byte-oriented in the future.

After applying the patch, at least one unit test fails:

    [junit] Testcase: testReadPastEOF(org.apache.lucene.index.TestCompoundFile):
        FAILED
    [junit] Block read past end of file
    [junit] junit.framework.AssertionFailedError: Block read past end of file
    [junit]     at org.apache.lucene.index.TestCompoundFile.testReadPastEOF(Test
CompoundFile.java:616)


> Improve BufferedIndexInput.readBytes() performance
> --------------------------------------------------
>
>                 Key: LUCENE-695
>                 URL: http://issues.apache.org/jira/browse/LUCENE-695
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.0.0
>            Reporter: Nadav Har'El
>            Priority: Minor
>         Attachments: readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to