RE: Sort differences between .NET and Java in Lucene.Net 2.0

George Aroush Wed, 13 Dec 2006 18:07:50 -0800

Hi Joe and all,

I don't think we can use CompareOrdinal() as it doesn't take locale into
consideration.

The issue is with the following function in
Lucene.Net.Search.FieldSortedHitQueue.cs:

    public int Compare(ScoreDoc i, ScoreDoc j)
    {
        return collator.Compare(index[i.doc].ToString(),
index[j.doc].ToString());
    }

To demonstrate how Java and C# differ in the way they do compare, here is a
sample code:

    // C# code: you get back -1 for 'res'
    string s1 = "H\u00D8T";
    string s2 = "HUT";
    System.Globalization.CultureInfo locale = new
System.Globalization.CultureInfo("en-US");
    System.Globalization.CompareInfo collator = locale.CompareInfo;
    int res = collator.Compare(s1, s2);

    // Java code: you get back 1 for 'res'
    String s1 = "H\u00D8T";
    String s2 = "HUT";
    Collator collator = Collator.getInstance (Locale.US);
    int diff = collator.compare(s1, s2);

Who is doing the right thing?  Or am I missing additional calls before I can
compare?

My goal is to understand why the difference exist and thus we can judge how
serious this is and either fix it or accept it as a language difference.

Btw, I am going to post this question on the Java Lucene mailing list to see
what folks on the Java land have to say.

Regards,

-- George Aroush

-----Original Message-----
From: Joe Shaw [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 13, 2006 1:35 PM
To: [email protected]
Cc: [email protected]
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0

Hi,

On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a 
> defect in NET or in the way the culture info is used in those two 
> languages or if there is more culture setting I have to do in .NET.
> 
> My thinking is, in .NET during compare, "\u00D8", is being treated as 
> ASCII "O" and not the Unicode character that it really is.

This isn't the case, because if so "HOT" would be equal to "H\u00D8T".  

I think that the sort order is just different between .NET and Java -- ie,
the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java -- at
least in the culture you're using.  

If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).

BTW, doing culture insensitive string comparisons might be a good thing to
do anyway.  From the MSDN docs for String.Compare(string, string):

        The comparison uses the current culture to obtain
        culture-specific information such as casing rules and the
        alphabetic order of individual characters. For example, a
        culture could specify that certain combinations of characters be
        treated as a single character, or uppercase and lowercase
        characters be compared in a particular way, or that the sorting
        order of a character depends on the characters that precede or
        follow it.

For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/
frlrfsystemStringclassComparetopic.asp 

Joe

RE: Sort differences between .NET and Java in Lucene.Net 2.0

Reply via email to