[jira] Commented: (LUCENE-2798) Randomize indexed collation key testing

Steven Rowe (JIRA) Wed, 08 Dec 2010 09:54:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969395#action_12969395
 ]


Steven Rowe commented on LUCENE-2798:
-------------------------------------

{quote}
Are we sure we shouldn't deprecate the jdk collation functionality (remove in 
trunk) and only offer ICU?

I was just thinking that the JDK Collator integration is basically a RAM trap 
due to its aweful keysize, etc:
http://site.icu-project.org/charts/collation-icu4j-sun
{quote}

I don't like this idea, because it removes the choice.

If there were some way to perform deprecation without eventual removal, I'd be 
okay with it.  The issue, as I see it, is documentaiton.  Here is an excerpt 
from the current class-level javadoc for {{CollationKeyFilter}}:

{quote}
The <code>ICUCollationKeyFilter</code> in the icu package of Lucene's contrib 
area uses ICU4J's Collator, which makes its version available, thus allowing 
collation to be versioned independently from the JVM.  ICUCollationKeyFilter is 
also significantly faster and generates significantly shorter keys than 
CollationKeyFilter.  See http://site.icu-project.org/charts/collation-icu4j-sun 
for key generation timing and key length comparisons between ICU4J and 
java.text.Collator over several languages.
{quote}

So an attempt is already being made to inform potential victims of the choice 
they're making - it even links to the same web page you mentioned.

Maybe if we move the JDK variant out of core and into a module, rather than on 
trunk, it would at least send a message that it's on par with the ICU variant.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed 
> collation key testing is currently fragile (for example, they had to be 
> revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of 
> Unicode 6.0 collation changes) and coverage is trivial (only 5 locales 
> tested, and no collator options are exercised).  This affects both the JDK 
> implementation in {{modules/analysis/common/}} and the ICU implementation 
> under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as 
> that provided by the Collator itself.  Instead of the current set of static 
> tests, this could be achieved via indexing randomly generated terms' 
> collation keys (and collator options) and then comparing the index terms' 
> order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order 
> of indexed terms is inherently unstable.  When performing runtime collation, 
> the Collator addresses the sort stability issue by adding a secondary sort 
> over the normalized original terms.  In order to directly compare Collator's 
> sort with Lucene's collation key sort, a secondary sort will need to be 
> applied to Lucene's indexed terms as well. Robert has suggested indexing the 
> original terms in addition to their collation keys, then using a Sort over 
> the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and 
> trunk uses UTF-8 order, so the implemented secondary sort will need to 
> respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with 
> Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the 
> original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the 
> tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or 
> comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2798) Randomize indexed collation key testing

Reply via email to