[ 
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462285
 ] 

Marvin Humphrey commented on LUCENE-510:
----------------------------------------

Grant... At the moment I am completely consumed by the task of getting a devel 
release of KinoSearch version 0.20 out the door.  Once that is taken care of, I 
will be glad to update this patch, and to explore how to compensate for the 
performance hit it causes.

Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes 
along for the ride.  There's actually a 1-2% performance gain to be had using 
standard over modified because of simplified conditionals.  What holds us back 
is backwards compatibility -- but we'll have wrecked backwards compat with the 
bytecounts.  However, I no longer have a strong objection to using Modified 
UTF-8 (for Lucene, that is -- Modified UTF-8 would be a deal-breaker for Lucy), 
so if somewhere along the way we find a compelling reason to stick with 
modified UTF-8, so be it.

If bytecount-based strings get adopted, it will be because they hold up on 
their own merits.  They're required for KinoSearch merge model; once KS 0.20 is 
out, I'll port the new benchmarking stuff, we can study the numbers, and assess 
whether the significant effort needed to pry that algo into Lucene would be 
worthwhile.

Yonik... yes, I agree.  Even better for indexing time, leave postings in 
serialized form for the entire indexing session.  :)

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length 
> of the string is in bytes, not Java characters.  This issue has been 
> discussed at:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
> We must increment the file format number to indicate this change.  At least 
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until 
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 
> (other than removal of deprecated features).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to