On Fri, Jan 18, 2002 at 11:40:17PM +0000, Nicholas Clark wrote: > On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote: > > > > As for character encodings, we're forcing everything to UTF-32 in > > > regular expressions. No exceptions. If you use a string in a regex, > > > it'll be transcoded. I honestly can't think of a better way to > > > guarantee efficient string indexing. > > > > I'm fine with that. The bloat is of course a shame, but as long as > > that's not a real problem for someone, let's not worry about it too > > much. > > Forcing everything to UTF-32 in the API?
I think Brent meant UTF-32 internally for the regexen. When you say /a/, Parrot sees 0x00 0x00 0x00 0x41. > To me it seems that making UTF-32 do everything correctly which the real > world can use while encoding optimised versions are written is better than > having a snazzy 4 encoding autoswitcher that is wrong and therefore not > releasable to the world. Now, now. But yes, maybe selecting *one* first (and getting its implementation right) would be good, and in that case it's either UTF-16 (which is reasonably compact, but variable length), or UTF-32 (which is a bit asteful, but fixed length, and therefore easy to think in). So I guess UTF-32 wins. > But I don't know about how the internals of all these things work, so I > may well be wrong on any technical detail. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen