[jira] [Comment Edited] (LUCENE-7052) BytesRefHash.sort should always sort in unicode code point order

Uwe Schindler (JIRA) Sun, 28 Feb 2016 02:34:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170991#comment-15170991
 ]


Uwe Schindler edited comment on LUCENE-7052 at 2/28/16 10:34 AM:
-----------------------------------------------------------------

Hi Mike,
I know originally we added the different comparators to be able to allow the 
index term dict to be sorted in different order. This never prooved to be 
useful, as many Lucene queries rely on the default order. The only codec that 
used another byte order internally was the Lucene 3 one (but it used the 
unicode spaghetti algorithm to reorder its term enums at runtime). As this is 
now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye 
remove the comparators at all and just implement BytesRef.compareTo() and use 
that one for sorting?

I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup 
nowhere else anymore (except some test that check alternative sorts - those can 
be removed).

As a first step I changed the BytesRef code to no longer use inner classes and 
instead use a lambda to define the comparators. But I'd suggest to remove at 
least the UTF-16 one completely and move it as private impl detail to TSTLookup 
(as only used there).

_FYI: The lambda has no speed impact because it is called only once and 
internally compiles to a class file that implements Comparator. It just looks 
nicer than the horrible comparator classes_


was (Author: thetaphi):
Hi Mike,
I know originally we added the different comparators to be able to allow the 
index term dict to be sorted in different order. This never prooved to be 
useful, as many Lucene queries rely on the default order. The only codec that 
used another byte order internally was the Lucene 3 one (but it used the 
unicode spaghetti algorithm to reorder its term enums at runtime). As this is 
now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye 
remove the comparators at all and just implement BytesRef.compareTo() and use 
that one for sorting?

I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup 
nowhere else anymore (except some test that check alternative sorts - those can 
be removed).

As a first step I changed the BytesRef code to no longer use inner classes and 
instead use a lambda to define the comparators. But I'd suggest to remove at 
least the UTF-16 one completely and move it as private impl detail and move it 
hidden TSTLookup (as only used there).

_FYI: The lambda has no speed impact because it is called only once and 
internally compiles to a class file that implements Comparator. It just looks 
nicer than the horrible comparator classes_

> BytesRefHash.sort should always sort in unicode code point order
> ----------------------------------------------------------------
>
>                 Key: LUCENE-7052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7052
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master, 6.0
>
>         Attachments: LUCENE-7052-cleanup1.patch, LUCENE-7052.patch
>
>
> Today {{BytesRefHash.sort}} takes a custom {{Comparator}} but we always pass 
> it {{BytesRef.getUTF8SortedAsUnicodeComparator()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7052) BytesRefHash.sort should always sort in unicode code point order

Reply via email to