On Sat, Jan 05, 2008 at 12:19:14PM -0600, Patrick R. Michaud wrote:
> On Sat, Jan 05, 2008 at 11:09:57AM +0000, Nicholas Clark wrote:
> > On Sat, Jan 05, 2008 at 02:11:35AM -0800, chromatic wrote:

> > Jarkko's view was that if he were doing Perl 5 Unicode again he would opt 
> > for
> > fixed width 32 bit rather than UTF-8, because a lot of algorithms,
> > particularly in regexps, assume linear random access.
> 
> Based on what I now realize about working with utf8 strings
> (and how that affects PGE), I would wholeheartedly agree.
> 
> > Space wise, a better compromise, at only slightly more complexity
> > (vtables for accessors feel natural for this) is to go for fixed width,
> > smallest that will hold the largest Unicode code point in the string,
> > 7 bit, 8 bit, 16 bit and 32 bit.
> 
> I think we could probably omit the 7 bit version.  Sometimes
> detecting the largest Unicode point in a string is a bit tricky --
> promoting a string to a larger representation is no problem, but
> figuring out when it's safe to demote a string to a smaller
> representation may be a bit tricky.

I believe that the reason for supporting the 7 bit version is that it makes
processing "binary" data easier. You default to all data coming in being
7 bit, and throw exceptions if anyone tries to upper case/lower case/title
case anything that is a code point in the range 128-255

Whereas if you have only 8/16/32 to choose from, you either have to actually
scan your data on input to verify that it is true 7 bit US-ASCII, or insist
that every input stream is flagged with a character set that defines
behaviour on code points 128-255.

> The other tricky part to this may be that even though we may use a
> fixed-width encoding internally, input and output will still often
> want to use or assume utf8 encoding.  So, we'd need to decide where
> the translations belong -- in Parrot, in the tools, or in the HLL.

It's not insane to have the boundary near the outside of parrot.
One of the bugs of Perl 5 currently is that if you tell it you're passing it
UTF-8, it believes you. It doesn't check for well-formedness, and malformed
UTF-8 is a security risk. So having the internals entirely fixed width,
and validating with conversion as a side effect on the way in, might not
be as inefficient as it first seems, particularly given the hassle of
dealing with CERT advisories.

You could also treat your UTF-8 data as "7 bit" until proven otherwise.
In that the first time it needs anything regexp-like done it, it's
validated and transformed to fixed width 8 or 16 or 32.

> (Right now I suspect the tools or HLL will have to be responsible
> for these choices, leaving Parrot to "pure" internal representations
> of whatever encoding is being advertised.)
> 
> Thanks for the information, it's a big help in guiding us forward.

At one time Dan was considering making the internals of parrot character
set agnostic (ie not just Unicode in its various encoding) but also fixed
width. To do this, he was intending to store variable width encodings
(such as shift-JIS) with each character's variable length octet sequence
serialised in its own 16 or 32 bit integer. (smallest that worked).

Sorry if the description is unclear, but the full plan as was isn't as crazy
as it sounds, if one doesn't want to "just" convert everything to Unicode
on the way in.

Nicholas Clark

Reply via email to