On Fri, Jan 18, 2002 at 11:40:17PM +0000, Nicholas Clark wrote:
> On Fri, Jan 18, 2002 at 05:24:00PM +0200, Jarkko Hietaniemi wrote:
> 
> > > As for character encodings, we're forcing everything to UTF-32 in
> > > regular expressions.  No exceptions.  If you use a string in a regex,
> > > it'll be transcoded.  I honestly can't think of a better way to
> > > guarantee efficient string indexing.
> > 
> > I'm fine with that.  The bloat is of course a shame, but as long as
> > that's not a real problem for someone, let's not worry about it too
> > much.
> 
> Forcing everything to UTF-32 in the API?

I think Brent meant UTF-32 internally for the regexen.  When you say
/a/, Parrot sees 0x00 0x00 0x00 0x41.

> To me it seems that making UTF-32 do everything correctly which the real
> world can use while encoding optimised versions are written is better than
> having a snazzy 4 encoding autoswitcher that is wrong and therefore not
> releasable to the world.

Now, now.

But yes, maybe selecting *one* first (and getting its implementation
right) would be good, and in that case it's either UTF-16 (which is
reasonably compact, but variable length), or UTF-32 (which is a bit
asteful, but fixed length, and therefore easy to think in).
So I guess UTF-32 wins.

> But I don't know about how the internals of all these things work, so I
> may well be wrong on any technical detail.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Reply via email to