[ 
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122
 ] 

Chuck Williams commented on LUCENE-510:
---------------------------------------

Has an improvement been made to eliminate the reported 20% indexing hit?  That 
would be a big price to pay.

To me the performance benefits in algorithms that scan for selected fields 
(e.g., FieldsReader.doc() with a FieldSelector) are much more important than 
standard UTF-8 compliance.

A 20% hit seems suprising.  The pre-scan over the string to be written 
shouldn't cost much compared to the cost of tokenizing and indeixng that string 
(assuming it is in an indexed field).

In case it is relevant, I had a related issue in my bulk updater, a case where 
a vint required at the beginning of a record by the lucene index format was not 
known until after the end.  I solved this with a fixed length vint record that 
was estimated up front and revised if necessary after the whole record was 
processed.  The vint representation still works if more bytes than necessary 
are written.


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length 
> of the string is in bytes, not Java characters.  This issue has been 
> discussed at:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
> We must increment the file format number to indicate this change.  At least 
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until 
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 
> (other than removal of deprecated features).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to