[ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018383#comment-13018383 ]
Robert Muir commented on LUCENE-2798: ------------------------------------- also i don't see any check that preflex codec isn't in use for this test? > Randomize indexed collation key testing > --------------------------------------- > > Key: LUCENE-2798 > URL: https://issues.apache.org/jira/browse/LUCENE-2798 > Project: Lucene - Java > Issue Type: Test > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Assignee: Steven Rowe > Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-2798.patch > > > Robert Muir noted on #lucene IRC channel today that Lucene's indexed > collation key testing is currently fragile (for example, they had to be > revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of > Unicode 6.0 collation changes) and coverage is trivial (only 5 locales > tested, and no collator options are exercised). This affects both the JDK > implementation in {{modules/analysis/common/}} and the ICU implementation > under {{modules/icu/}}. > The key thing to test is that the order of the indexed terms is the same as > that provided by the Collator itself. Instead of the current set of static > tests, this could be achieved via indexing randomly generated terms' > collation keys (and collator options) and then comparing the index terms' > order to the order provided by the Collator over the original terms. > Since different terms may produce the same collation key, however, the order > of indexed terms is inherently unstable. When performing runtime collation, > the Collator addresses the sort stability issue by adding a secondary sort > over the normalized original terms. In order to directly compare Collator's > sort with Lucene's collation key sort, a secondary sort will need to be > applied to Lucene's indexed terms as well. Robert has suggested indexing the > original terms in addition to their collation keys, then using a Sort over > the original terms as the secondary sort. > Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and > trunk uses UTF-8 order, so the implemented secondary sort will need to > respect that. > From #lucene: > {quote} > rmuir__: so i think we have to on 3.x, sort the 'expected list' with > Collator.compare, if thats equal, then as a tiebreak use String.compareTo > rmuir__: and in the index sort on the collated field, followed by the > original term > rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the > tiebreak for the expected list > rmuir__: instead compare codepoints (iterating character.codepointAt, or > comparing .getBytes("UTF-8")) > {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org