At 11:43 AM 9/20/2001 -0700, Paul Prescod wrote:
>Dan Sugalski wrote:
> >
> >...
> >
> > Make sense? Parrot's set up such that the libraries to handle a particular
> > kind of data (EBCDIC, Unicode, Shift-JIS, Big5/traditional, Finnish ASCII)
> > will be dynamically loadable so we can add them after the fact and you
> > don't have to pay the memory price.
>
>I'd suggest that you document character set and encoding totally
>separately. Character set is something that is visible to the Perl
>programmer. Encoding is *only* an implementation issue that should be
>invisible to the programmer doing ordinary string manipulations.
The only time you need to deal with the actual encoding of a string is when
doing I/O (generally specified as an attribute on the filehandle) or when
someone feels the need to hit the bits directly. The latter should be
rather less common than the former, I hope, but still doable.
>I think that the extra complexity of dealing with multiple character
>sets has more cost than benefit. What will chr(10203) return?
The default character set's chr(10203). In which case it's no different
than chr(65), which isn't an A on EBCDIC platforms... :)
> If I do a
>grep for chr(10203) am I looking for the 10203'th character in the
>character set of the data or the character that is logically the same as
>the 10203'th character of the default character set on my platform? Or
>the 10203'th character in Unicode?
Depends. At the lowest levels (like for people writing Parrot assembly),
it'll be looking for the character who's code-point is 10203 regardless of
the character set. At higher levels you probably won't be looking for 10203
but instead doing a regex against chr(10203) or whatever character the set
in question uses there. (It's a bogus Unicode character today, FWIW, but
there's no telling about tomorrow...)
> > We will, FWIW, transcode to Unicode in those cases where we have to deal
> > with data in multiple encodings and shouldn't just throw an error. While
> > LCDs are bad, they're better than nothing...
>
>In the paragraph above I would not use the word transcode. I would say
>"strings conforming to multiple character sets are combined according to
>Unicode semantics."
Fair enough, and we probably will for user-level docs. We're a step or two
below that at the moment.
>if you use variable width encodings ... then I think you're in
>for a world of ... pain.
I snipped all the irrelevant bits of that paragraph. :)
We're only going to do variable width for I/O, and only if the source or
destination are in a variable width format. The internal bits that need to
care will work on fixed-width representations.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk