> I don't think UTF-32 will save you much. The unicode case map is variable > length, combining character, canonical equivalence, and many other thing > will require variable length mapping. For example, if I only want to
This is true. > parse /[0-9]+/, why you want to convert everything to UTF-32. Most of > time, the regcomp() can find out whether this regexp will need complicated > preprocessing. Another example, if I want to search for /resume/e, > (equivalent matching), the regex engine can normalize the case, fully > decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not quite what I had in mind for equivalence matching: I would have just said "both the pattern and the target need to normalized, as defined by Unicode". Then the comparison and searching reduce to the trivial cases of byte equivalence and searching (of which B-M is the most popular example). > Boyer-Moore search. I bet it will be simpler and faster than using UTF-32. > (BTW, the equivalent matching means match English spelling against French > spell, disregarding diacritics.) > > I think we should explore more choices and do some experiments. What do you mean by *we*? :-) I am not a p6-internals regular, nor do I intend to, there are only so many hours in a day. But yes, the sooner we get into exploration/experiment mode, the better. The Unicode mindset *must* be adopted sooner rather than later, "unwriting" 8-bit-byteism out of the code later is hell. Hopefully my little treatise will kick Parrot more or less in the right direction. > Hong -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen