Author: atsushi
Date: 2005-05-18 03:01:26 -0400 (Wed, 18 May 2005)
New Revision: 44665
Modified:
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collator.cs
Log:
2005-05-18 Atsushi Enomoto <[EMAIL PROTECTED]>
* Collation-notes.txt :
Added task list.
Revised comparison methods; backward iteration is possible.
More on char-by-char comparison.
Level 4 comparison is actually a bit more complex.
Misc corrections.
* Collator.cs : some conceptual updates wrt above.
Modified:
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
===================================================================
---
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
2005-05-18 06:29:08 UTC (rev 44664)
+++
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
2005-05-18 07:01:26 UTC (rev 44665)
@@ -5,6 +5,26 @@
We are going to implement Windows-like collation, apart from ICU (that
implements UCA, Unicode Collation Algorithm).
+
+** Tasks
+
+ * create collation element table(s)
+ - infer how Windows collation element table is composed
+ - infer how Windows sortkey is structured
+ - design data structures
+ - write table generator source(s)
+ * create culture-specific sortkey modifiers
+ - data structures
+ - data
+ * implement collation methods
+ - sortkey collector methods
+ - character iterator models and comparison methods
+ - string search optimizations
+
+ TODO: add "progress" information on these list items; currently many
+ of them are half-baked.
+
+
** How to implement CompareInfo members
GetSortKey()
@@ -17,21 +37,28 @@
the head of the searchee and immediately return false if it
didn't match.
IsSuffix()
- (TODO: revising. See Collator.cs for current idea.)
+ For each character in the target, examine if it matches
+ form the tail of the searchee and immediately return false if
+ it didn't match. It uses reversal element query.
IndexOf()
Try IsPrefix() to the end of the string to find.
LastIndexOf()
- Try IsPrefix() from the end to the beginning of the string.
- But start with the same length as the target and if IndexOf()
- was not -1, then find the last one until IndexOf() returns -1.
+ Try IsSuffix() from the end of the string to find.
+ It uses reversal element query.
For IndexOf() and LastIndexOf(), overloads of char argument and string
argument will be unified by providng common CharacterIterator base.
- We are not likely to provide reversed character iterator (it is
- complex; handle context-aware composites and expansions).
- http://blogs.msdn.com/michkap/archive/2005/03/16/396704.aspx
+ For GetSortKey(), Compare(), IsPrefix() and IndexOf(), it uses forward
+ iteration, which moves forward and don't stop until either it finds
+ next "primary" character or it reached the end of the string, checking
+ HasContractHead(char) for composite.
+ For IsSuffix() and LastIndexOf(), it uses backward iteration, which
+ moves backward and don't stop until either it finds "primary"
+ characters or it reached the beginning of the string, checking
+ HasContractTail(char) for composite.
+
** How to support CompareOptions
There are two kind of "ignorance" : ignorance which acts as stripper,
@@ -151,6 +178,10 @@
in InvariantCulture)
- level 5: variable weighting (apostrophe, hyphens etc.)
+ Note that these levels does not digitally match IgnoreXXX flags. Thus
+ it is not OK that we omit some levels of sortkey values in reflection
+ to CompareOptions.
+
String comparison is done from level 1 to 5. The comparison won't
stop until either it found the primary difference, or it reached to
the end (thus upper level differences are returned).
@@ -208,10 +239,16 @@
Those sortkey data is collected only for Japanese (category 22)
characters.
- There are 3 sections separated by FF. Each of them represents the
- values for character by character:
+ There are 3 sections each of them ends by FF. Each of them represents
+ the values for character by character:
- small letter type (kogaki moji); C4 (small) or E4 (normal)
- - kana type; E4 (hiragana) or C4 (katakana)
+ - category middle section:
+ + dash mark 4 (\u309D,\u309E,\u30FD,\u30FE), 5 (\u30FC)
+ LAMESPEC: those characters of value '4' differs in level 2
+ wrt voice marks, but does not differetiate kana types (bug).
+ It is ignored when IgnoreNonSpace applies.
+ + 0x02.
+ + kana type: E4 (hiragana) or C4 (katakana)
- width; C5 (full) or C4 (half)
**** level 5
@@ -292,11 +329,18 @@
http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C2.asp
(Note that composite is different from expansion.)
+ <del>
According to "Developing International Software" book, in Win32
lstrcmpi(), "sc" in hu-HU is treated as a single character, so if it is
compared against "Sc" (in IgnoreCase), "sc" won't match it. However
.NET (LCMapString) behaves differently.
+ </del>
+ It is wrong. It is "cs" that is treated specially, and also applies to
+ LCMapString collation.
+ Note that composite characters is likely to not have equivalent
+ codepoint.
+
**** Expanded character processing
Some characters are expanded to two or more characters:
@@ -698,9 +742,10 @@
Provides several character information:
- ignorable, ignorable nonspace, normalize width, normalize kanatype
- - (has) nonspace normalization; for Japanese Kana collation
+ - level 4 sortkey provision method(s)
- (has) variable weighting
- - (has) contraction/expansion
+ - (has) contraction, both from head and tail
+ - (has) expansion
- default sortkey table
- culture dependent sortkey table
- culture dependent sortkey adjustment table
@@ -718,7 +763,8 @@
bunch of string instances. The simple set of below is enough:
- MoveNext(),
- - Index,
+ - MoveBack(),
+ - Current,
- ElementLength, and
- Reset()
@@ -729,6 +775,37 @@
one, and it will use culture-dependent character combination/expansion
tables.
+**** character comparison
+
+ Since composite character is likely to *not have* equivalent
+ codepoint, character comparison could not just be done by expecting
+ "resulting char" value.
+ In contrast, since composite character is likely to *do have*
+ equivalent codepoint, character comparison could not also just be done
+ by comparing "source char" value.
+
+ Collator.CompareChar() compares character elements in two character
+ iterators and returns the difference "level". The returned value is
+ used either to check equality or save difference level (see Compare()
+ description for details).
+
+ Character iterator comparison is first done by codepoints i.e. it
+ first detects where the codepoints differ.
+
+ From where those codepoints differ, for each iterators it adjusts the
+ position so that it represents exactly one character element. That is,
+ find primary character as the start of the range and the last
+ nonprimary character as the end of the range.
+
+ Once the comparer adjusted the character iterator to be valid
+ comparison position, further comparison is done by following steps:
+
+ Then, return comparison results of sortkeys of n(s, l, f), where
+ function n returns keys for such s, l, f that:
+ - s : the source character (or string) of character iterator,
+ - l : the comparison level
+ - f : CompareOption (note that it is different from level)
+
*** sort key element table
We will create *our own* collation element table which is closer
@@ -766,7 +843,8 @@
**** Level 3: Case properties
Case properties will be halfly computed and halfly stored as a byte
- array (mostly for Latin characters), with limited areas of codepoint.
+ array (mostly for Latin characters), with limited areas of codepoint
+ (cp < 3120 || FE00 < cp).
For Hangul characters, it will be computed by codepoint areas.
Modified:
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collator.cs
===================================================================
--- branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collator.cs
2005-05-18 06:29:08 UTC (rev 44664)
+++ branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collator.cs
2005-05-18 07:01:26 UTC (rev 44665)
@@ -56,7 +56,7 @@
l1b = l2b = l3b = l4b = l5b = null;
}
- internal void AdjustBufferSize (string s, int kanaWeight)
+ internal void AdjustBufferSize (string s)
{
if (l1b == null || l1b.Length < s.Length)
l1b = new byte [s.Length * 2 + 10];
@@ -64,10 +64,12 @@
l2b = new byte [s.Length + 10];
if (l3b == null || l3b.Length < s.Length)
l3b = new byte [s.Length + 10];
- // For level 4 in Japanese it might spend large key
- // data (happens only in Japanese)
- if (l4b == null || l4b.Length < s.Length)
- l4b = new byte [s.Length * kanaWeight + 10];
+ // This weight is used only in Japanese text.
+ // We could expand the initial length as well as
+ // primary length (actually x3), but even in ja-JP
+ // most of the compared strings won't be Japanese.
+ if (l4b == null)
+ l5b = new byte [10];
if (l5b == null)
l5b = new byte [10];
}
@@ -82,11 +84,14 @@
AppendBufferPrimitive (table [idx2++], ref l2b,
ref l2);
while (table [idx3] != 0)
AppendBufferPrimitive (table [idx3++], ref l3b,
ref l3);
- while (table [idx4] != 0)
- AppendBufferPrimitive (table [idx4++], ref l4b,
ref l4);
+ // level 4 weight is added only for Japanese kana.
+ if (idx4 > 0)
+ while (table [idx4] != 0)
+ AppendBufferPrimitive (table [idx4++],
+ ref l4b, ref l4);
}
- // Append variable character.
+ // Append variable-weight character.
internal void AppendLevel5 (byte [] table, int idx, int
currentIndex)
{
// offset
@@ -179,9 +184,16 @@
protected readonly CompareOptions Options;
protected readonly Collator Collator;
+ // MinValue: does not have a valid value
protected char Value;
- protected char Next;
+ // extensions are collected to char[] in both default table
+ // and culture-dependent table.
+ protected char [] ExtensionTable;
+ protected int ExtensionStart; // -1 when no extension
+ protected int ExtensionEnd;
+ protected int ExtensionCurrent;
+
public abstract bool MoveNext ();
public abstract void Reset ();
}
@@ -214,19 +226,37 @@
public override bool MoveNext ()
{
- if (Next != '\0') {
- Value = Next;
- Next = '\0';
- return true;
+ if (ExtensionStart > 0) {
+ if (ExtensionCurrent++ < ExtensionEnd)
+ return true;
+ ExtensionStart = -1;
}
Current += Length;
- if (end <= Current)
+ if (end <= Current) // actually '<' must not happen.
return false;
+ // mhm, it might be better to have it inside this type.
Collator.MoveIteratorNext (this);
return true;
}
+ public override bool MoveBack ()
+ {
+ if (ExtensionStart > 0) {
+ if (ExtensionCurrent++ >= 0)
+ return true;
+ ExtensionStart = -1;
+ }
+
+ // we cannot adjust Current here but should do it
+ // inside MoveIteratorBack().
+ if (start >= Current) // actually '>' must not happen
+ return false;
+ // mhm, it might be better to have it inside this type.
+ Collator.MoveIteratorBack (this);
+ return true;
+ }
+
public void MoveTo (int i)
{
Reset ();
@@ -262,14 +292,28 @@
{
if (done)
return false;
- if (Next != '\0') {
- Value = Next;
- Next = '\0';
- return true;
+ if (ExtensionStart > 0) {
+ if (ExtensionCurrent++ < ExtensionEnd)
+ return true;
+ ExtensionStart = -1;
}
- Collator.MoveIteratorNext (this);
- return true;
+ return Collator.MoveIteratorNext (this);
}
+
+ public override bool MoveBack ()
+ {
+ if (done)
+ return false;
+ if (ExtensionStart > 0) {
+ if (ExtensionCurrent-- >= 0)
+ return true;
+ ExtensionStart = -1;
+ }
+ if (ExtensionStart > 0) {
+ if (Extension
+
+ return Collator.MoveIteratorBack (this);
+ }
}
internal class Collator
@@ -668,7 +712,7 @@
public byte [] GetSortKey (string s, CompareOptions opt)
{
StringIterator iter = new StringIterator (s, 0,
s.Length, opt, this);
- buf.AdjustBufferSize (s, culture.Name == "ja-JP" ? 4 :
1);
+ buf.AdjustBufferSize (s);
while (iter.MoveNext ())
GetSortKeyForChar (iter, buf);
return buf.GetResultAndReset ();
_______________________________________________
Mono-patches maillist - [email protected]
http://lists.ximian.com/mailman/listinfo/mono-patches