[ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580592#action_12580592 ]
Michael McCandless commented on LUCENE-510: ------------------------------------------- {quote} I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....? {quote} I think there are two reasons for rolling our own instead of using CharsetEncoder/Decoder. First is performance. If I use CharsetEncoder, like this: CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder(); CharBuffer cb = CharBuffer.allocate(5000); ByteBuffer bb = ByteBuffer.allocate(5000); byte[] bbArray = bb.array(); UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result(); t0 = System.currentTimeMillis(); for(int i=0;i<count;i++) { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); } Then it takes 676 msec to convert ~3.3 million strings from the terms from indexing first 200K Wikipedia docs. If I replace for loop with: UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result); It's 441 msec. Second reason is some API mismatch. EG we need to convert char[] that end in the 0xffff character. Also, we need to do incremental conversion (only convert changed bytes), which is used by TermEnum. CharsetEncoder/Decoder doesn't do this. > IndexOutput.writeString() should write length in bytes > ------------------------------------------------------ > > Key: LUCENE-510 > URL: https://issues.apache.org/jira/browse/LUCENE-510 > Project: Lucene - Java > Issue Type: Improvement > Components: Store > Affects Versions: 2.1 > Reporter: Doug Cutting > Assignee: Michael McCandless > Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, > SortExternal.java, strings.diff, TestSortExternal.java > > > We should change the format of strings written to indexes so that the length > of the string is in bytes, not Java characters. This issue has been > discussed at: > http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html > We must increment the file format number to indicate this change. At least > the format number in the segments file should change. > I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until > after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 > (other than removal of deprecated features). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]