subject:"\[jira\] \[Commented\] \(LUCENE\-695\) Improve BufferedIndexInput.readBytes\(\) performance"

[jira] [Commented] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2011-06-23 Thread Uparis Abeysena (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053807#comment-13053807
]

Uparis Abeysena commented on LUCENE-695:

Click: http://customized-dog-collars.com

Improve BufferedIndexInput.readBytes() performance
--

Key: LUCENE-695
URL: https://issues.apache.org/jira/browse/LUCENE-695
Project: Lucene - Java
Issue Type: Improvement
Components: core/store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor
Attachments: readbytes.patch, readbytes.patch

During a profiling session, I discovered that BufferedIndexInput.readBytes(),
the function which reads a bunch of bytes from an index, is very inefficient
in many cases. It is efficient for one or two bytes, and also efficient
for a very large number of bytes (e.g., when the norms are read all at once);
But for anything in between (e.g., 100 bytes), it is a performance disaster.
It can easily be improved, though, and below I include a patch to do that.
The basic problem in the existing code was that if you ask it to read 100
bytes, readBytes() simply calls readByte() 100 times in a loop, which means
we check byte after byte if the buffer has another character, instead of just
checking once how many bytes we have left, and copy them all at once.
My version, attached below, copies these 100 bytes if they are available at
bulk (using System.arraycopy), and if less than 100 are available, whatever
is available gets copied, and then the rest. (as before, when a very large
number of bytes is requested, it is read directly into the final buffer).
In my profiling, this fix caused amazing performance
improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
of the run time, and after the fix, this was down to 1% of the run time!
However, my scenario is *not* the typical Lucene code, but rather a version
of Lucene with added payloads, and these payloads average at 100 bytes, where
the original readBytes() did worst. I expect that my fix will have less of an
impact on vanilla Lucene, but it still can have an impact because it is
used for things like reading fields. (I am not aware of a standard Lucene
benchmark, so I can't provide benchmarks on a more typical case).
In addition to the change to readBytes(), my attached patch also adds a new
unit test to BufferedIndexInput (which previously did not have a unit test).
This test simulates a file which contains a predictable series of bytes, and
then tries to read from it with readByte() and readButes() with various
sizes (many thousands of combinations are tried) and see that exactly the
expected bytes are read. This test is independent of my new readBytes()
inplementation, and can be used to check the old implementation as well.
By the way, it's interesting that BufferedIndexOutput.writeBytes was already
efficient, and wasn't simply a loop of writeByte(). Only the reading code was
inefficient. I wonder why this happened.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-26 Thread Nadav Har'El (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444903 ] 

Nadav Har'El commented on LUCENE-695:
-

 If given a null array? Is this ever done in Lucene? Which should be fixed, 
 the testcase or the code? 

I don't know - readBytes() documentation doesn't explictly say what should 
happen if it is asked to read zero bytes: is it simply supposed to do nothing 
(and in this case it doesn't matter which array you give it - could even be 
null), or should it still expect the array to be non-null? I don't know if any 
code in Lucene itself assumes that it can work when given a null array and a 0 
count - I doubt it. But one test does assume this, so I simply added an extra 
if to check for the 0 count, and when that happens, avoid calling 
System.arraycopy() (which even when the count is 0, expects the array to be 
non-null, for some reason).

 Improve BufferedIndexInput.readBytes() performance
 --

 Key: LUCENE-695
 URL: http://issues.apache.org/jira/browse/LUCENE-695
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor
 Attachments: readbytes.patch, readbytes.patch


 During a profiling session, I discovered that BufferedIndexInput.readBytes(),
 the function which reads a bunch of bytes from an index, is very inefficient
 in many cases. It is efficient for one or two bytes, and also efficient
 for a very large number of bytes (e.g., when the norms are read all at once);
 But for anything in between (e.g., 100 bytes), it is a performance disaster.
 It can easily be improved, though, and below I include a patch to do that.
 The basic problem in the existing code was that if you ask it to read 100
 bytes, readBytes() simply calls readByte() 100 times in a loop, which means
 we check byte after byte if the buffer has another character, instead of just
 checking once how many bytes we have left, and copy them all at once.
 My version, attached below, copies these 100 bytes if they are available at
 bulk (using System.arraycopy), and if less than 100 are available, whatever
 is available gets copied, and then the rest. (as before, when a very large
 number of bytes is requested, it is read directly into the final buffer).
 In my profiling, this fix caused amazing performance
 improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
 of the run time, and after the fix, this was down to 1% of the run time! 
 However, my scenario is *not* the typical Lucene code, but rather a version 
 of Lucene with added payloads, and these payloads average at 100 bytes, where 
 the original readBytes() did worst. I expect that my fix will have less of an 
 impact on vanilla Lucene, but it still can have an impact because it is 
 used for things like reading fields. (I am not aware of a standard Lucene 
 benchmark, so I can't provide benchmarks on a more typical case).
 In addition to the change to readBytes(), my attached patch also adds a new
 unit test to BufferedIndexInput (which previously did not have a unit test).
 This test simulates a file which contains a predictable series of bytes, and
 then tries to read from it with readByte() and readButes() with various
 sizes (many thousands of combinations are tried) and see that exactly the
 expected bytes are read. This test is independent of my new readBytes()
 inplementation, and can be used to check the old implementation as well.
 By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
 efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
 inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Yonik Seeley (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444316 ] 

Yonik Seeley commented on LUCENE-695:
-

 I wonder why this happened.

readBytes on less than a buffer size probably only happens with binary (or 
compressed) fields, relatively new additions to Lucene, so it probably didn't 
have much of a real-world impact.   I think it is important to fix though, as 
more things may be byte-oriented in the future.

After applying the patch, at least one unit test fails:

[junit] Testcase: testReadPastEOF(org.apache.lucene.index.TestCompoundFile):
FAILED
[junit] Block read past end of file
[junit] junit.framework.AssertionFailedError: Block read past end of file
[junit] at org.apache.lucene.index.TestCompoundFile.testReadPastEOF(Test
CompoundFile.java:616)


 Improve BufferedIndexInput.readBytes() performance
 --

 Key: LUCENE-695
 URL: http://issues.apache.org/jira/browse/LUCENE-695
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor
 Attachments: readbytes.patch


 During a profiling session, I discovered that BufferedIndexInput.readBytes(),
 the function which reads a bunch of bytes from an index, is very inefficient
 in many cases. It is efficient for one or two bytes, and also efficient
 for a very large number of bytes (e.g., when the norms are read all at once);
 But for anything in between (e.g., 100 bytes), it is a performance disaster.
 It can easily be improved, though, and below I include a patch to do that.
 The basic problem in the existing code was that if you ask it to read 100
 bytes, readBytes() simply calls readByte() 100 times in a loop, which means
 we check byte after byte if the buffer has another character, instead of just
 checking once how many bytes we have left, and copy them all at once.
 My version, attached below, copies these 100 bytes if they are available at
 bulk (using System.arraycopy), and if less than 100 are available, whatever
 is available gets copied, and then the rest. (as before, when a very large
 number of bytes is requested, it is read directly into the final buffer).
 In my profiling, this fix caused amazing performance
 improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
 of the run time, and after the fix, this was down to 1% of the run time! 
 However, my scenario is *not* the typical Lucene code, but rather a version 
 of Lucene with added payloads, and these payloads average at 100 bytes, where 
 the original readBytes() did worst. I expect that my fix will have less of an 
 impact on vanilla Lucene, but it still can have an impact because it is 
 used for things like reading fields. (I am not aware of a standard Lucene 
 benchmark, so I can't provide benchmarks on a more typical case).
 In addition to the change to readBytes(), my attached patch also adds a new
 unit test to BufferedIndexInput (which previously did not have a unit test).
 This test simulates a file which contains a predictable series of bytes, and
 then tries to read from it with readByte() and readButes() with various
 sizes (many thousands of combinations are tried) and see that exactly the
 expected bytes are read. This test is independent of my new readBytes()
 inplementation, and can be used to check the old implementation as well.
 By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
 efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
 inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Nadav Har'El (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444322 ] 

Nadav Har'El commented on LUCENE-695:
-

Sorry, I didn't notice that my fix broke this unit test. Thanks for catching 
that.

What is happening is interesting: this test 
(TestCompoundFile.testReadPastEof()) is testing what happens when you read 40 
bytes beyond the end of file, and expects the appropriate exception to be 
thrown. The old code actually did this for 40 bytes, so it passed this test, 
but the interesting thing is that when you asked for more than a buffer-full of 
bytes, say, 10K, the length() checking code was not there! So the old code was 
broken in this respect, just not for 40 bytes which were tested.

I'll fix my patch to add this beyond-end-of-file check, and will post the new 
patch ASAP.

 Improve BufferedIndexInput.readBytes() performance
 --

 Key: LUCENE-695
 URL: http://issues.apache.org/jira/browse/LUCENE-695
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor
 Attachments: readbytes.patch


 During a profiling session, I discovered that BufferedIndexInput.readBytes(),
 the function which reads a bunch of bytes from an index, is very inefficient
 in many cases. It is efficient for one or two bytes, and also efficient
 for a very large number of bytes (e.g., when the norms are read all at once);
 But for anything in between (e.g., 100 bytes), it is a performance disaster.
 It can easily be improved, though, and below I include a patch to do that.
 The basic problem in the existing code was that if you ask it to read 100
 bytes, readBytes() simply calls readByte() 100 times in a loop, which means
 we check byte after byte if the buffer has another character, instead of just
 checking once how many bytes we have left, and copy them all at once.
 My version, attached below, copies these 100 bytes if they are available at
 bulk (using System.arraycopy), and if less than 100 are available, whatever
 is available gets copied, and then the rest. (as before, when a very large
 number of bytes is requested, it is read directly into the final buffer).
 In my profiling, this fix caused amazing performance
 improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
 of the run time, and after the fix, this was down to 1% of the run time! 
 However, my scenario is *not* the typical Lucene code, but rather a version 
 of Lucene with added payloads, and these payloads average at 100 bytes, where 
 the original readBytes() did worst. I expect that my fix will have less of an 
 impact on vanilla Lucene, but it still can have an impact because it is 
 used for things like reading fields. (I am not aware of a standard Lucene 
 benchmark, so I can't provide benchmarks on a more typical case).
 In addition to the change to readBytes(), my attached patch also adds a new
 unit test to BufferedIndexInput (which previously did not have a unit test).
 This test simulates a file which contains a predictable series of bytes, and
 then tries to read from it with readByte() and readButes() with various
 sizes (many thousands of combinations are tried) and see that exactly the
 expected bytes are read. This test is independent of my new readBytes()
 inplementation, and can be used to check the old implementation as well.
 By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
 efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
 inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] [Commented] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

4 matches

Site Navigation

Mail list logo

Footer information