[
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580592#action_12580592
]
Michael McCandless commented on LUCENE-510:
-------------------------------------------
{quote}
I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder,
CharsetDecoder....?
{quote}
I think there are two reasons for rolling our own instead of using
CharsetEncoder/Decoder. First is performance. If I use
CharsetEncoder, like this:
CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb = CharBuffer.allocate(5000);
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();
t0 = System.currentTimeMillis();
for(int i=0;i<count;i++) {
cb.clear();
cb.put(strings[i]);
cb.flip();
bb.clear();
encoder.reset();
encoder.encode(cb, bb, true);
}
Then it takes 676 msec to convert ~3.3 million strings from the terms
from indexing first 200K Wikipedia docs. If I replace for loop with:
UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);
It's 441 msec.
Second reason is some API mismatch. EG we need to convert char[] that
end in the 0xffff character. Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.
> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
> Key: LUCENE-510
> URL: https://issues.apache.org/jira/browse/LUCENE-510
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Affects Versions: 2.1
> Reporter: Doug Cutting
> Assignee: Michael McCandless
> Attachments: LUCENE-510.patch, LUCENE-510.take2.patch,
> SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length
> of the string is in bytes, not Java characters. This issue has been
> discussed at:
> http://www.mail-archive.com/[email protected]/msg01970.html
> We must increment the file format number to indicate this change. At least
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0
> (other than removal of deprecated features).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]