Hi Joe and all,
I don't think we can use CompareOrdinal() as it doesn't take locale into
consideration.
The issue is with the following function in
Lucene.Net.Search.FieldSortedHitQueue.cs:
public int Compare(ScoreDoc i, ScoreDoc j)
{
return collator.Compare(index[i.doc].ToString(),
index[j.doc].ToString());
}
To demonstrate how Java and C# differ in the way they do compare, here is a
sample code:
// C# code: you get back -1 for 'res'
string s1 = "H\u00D8T";
string s2 = "HUT";
System.Globalization.CultureInfo locale = new
System.Globalization.CultureInfo("en-US");
System.Globalization.CompareInfo collator = locale.CompareInfo;
int res = collator.Compare(s1, s2);
// Java code: you get back 1 for 'res'
String s1 = "H\u00D8T";
String s2 = "HUT";
Collator collator = Collator.getInstance (Locale.US);
int diff = collator.compare(s1, s2);
Who is doing the right thing? Or am I missing additional calls before I can
compare?
My goal is to understand why the difference exist and thus we can judge how
serious this is and either fix it or accept it as a language difference.
Btw, I am going to post this question on the Java Lucene mailing list to see
what folks on the Java land have to say.
Regards,
-- George Aroush
-----Original Message-----
From: Joe Shaw [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 13, 2006 1:35 PM
To: [email protected]
Cc: [email protected]
Subject: RE: Sort differences between .NET and Java in Lucene.Net 2.0
Hi,
On Wed, 2006-12-13 at 11:35 -0500, George Aroush wrote:
> This is why those two tests are failing and I wander if this is a
> defect in NET or in the way the culture info is used in those two
> languages or if there is more culture setting I have to do in .NET.
>
> My thinking is, in .NET during compare, "\u00D8", is being treated as
> ASCII "O" and not the Unicode character that it really is.
This isn't the case, because if so "HOT" would be equal to "H\u00D8T".
I think that the sort order is just different between .NET and Java -- ie,
the order is "O", "\u00D8", "U" in .NET but "O", "U", "\u00D8" in Java -- at
least in the culture you're using.
If you're looking for the actual numerical values of the characters for
comparison (in which "\u00D8" would be quite a bit higher than both "O"
and "U", you probably want to use String.CompareOrdinal()).
BTW, doing culture insensitive string comparisons might be a good thing to
do anyway. From the MSDN docs for String.Compare(string, string):
The comparison uses the current culture to obtain
culture-specific information such as casing rules and the
alphabetic order of individual characters. For example, a
culture could specify that certain combinations of characters be
treated as a single character, or uppercase and lowercase
characters be compared in a particular way, or that the sorting
order of a character depends on the characters that precede or
follow it.
For more info, see the String.Compare() docs:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/
frlrfsystemStringclassComparetopic.asp
Joe