> I don't think UTF-32 will save you much. The unicode case map is variable
> length, combining character, canonical equivalence, and many other thing
> will require variable length mapping. For example, if I only want to

This is true.

> parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
> time, the regcomp() can find out whether this regexp will need complicated
> preprocessing. Another example, if I want to search for /resume/e,
> (equivalent matching), the regex engine can normalize the case, fully 
> decompose input string, strip off any combining character, and do 8-bit

Hmmm.  The above sounds complicated not quite what I had in mind
for equivalence matching: I would have just said "both the pattern
and the target need to normalized, as defined by Unicode".  Then 
the comparison and searching reduce to the trivial cases of byte
equivalence and searching (of which B-M is the most popular example).

> Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
> (BTW, the equivalent matching means match English spelling against French
> spell, disregarding diacritics.)
> 
> I think we should explore more choices and do some experiments.

What do you mean by *we*? :-) I am not a p6-internals regular, nor do
I intend to, there are only so many hours in a day.  But yes, the
sooner we get into exploration/experiment mode, the better.  The
Unicode mindset *must* be adopted sooner rather than later,
"unwriting" 8-bit-byteism out of the code later is hell.  Hopefully my
little treatise will kick Parrot more or less in the right direction.

> Hong

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Reply via email to