At 11:43 AM 9/20/2001 -0700, Paul Prescod wrote:
>Dan Sugalski wrote:
> >
> >...
> >
> > Make sense? Parrot's set up such that the libraries to handle a particular
> > kind of data (EBCDIC, Unicode, Shift-JIS, Big5/traditional, Finnish ASCII)
> > will be dynamically loadable so we can add them after the fact and you
> > don't have to pay the memory price.
>
>I'd suggest that you document character set and encoding totally
>separately. Character set is something that is visible to the Perl
>programmer. Encoding is *only* an implementation issue that should be
>invisible to the programmer doing ordinary string manipulations.

The only time you need to deal with the actual encoding of a string is when 
doing I/O (generally specified as an attribute on the filehandle) or when 
someone feels the need to hit the bits directly. The latter should be 
rather less common than the former, I hope, but still doable.

>I think that the extra complexity of dealing with multiple character
>sets has more cost than benefit. What will chr(10203) return?

The default character set's chr(10203). In which case it's no different 
than chr(65), which isn't an A on EBCDIC platforms... :)

>  If I do a
>grep for chr(10203) am I looking for the 10203'th character in the
>character set of the data or the character that is logically the same as
>the 10203'th character of the default character set on my platform? Or
>the 10203'th character in Unicode?

Depends. At the lowest levels (like for people writing Parrot assembly), 
it'll be looking for the character who's code-point is 10203 regardless of 
the character set. At higher levels you probably won't be looking for 10203 
but instead doing a regex against chr(10203) or whatever character the set 
in question uses there. (It's a bogus Unicode character today, FWIW, but 
there's no telling about tomorrow...)

> > We will, FWIW, transcode to Unicode in those cases where we have to deal
> > with data in multiple encodings and shouldn't just throw an error. While
> > LCDs are bad, they're better than nothing...
>
>In the paragraph above I would not use the word transcode. I would say
>"strings conforming to multiple character sets are combined according to
>Unicode semantics."

Fair enough, and we probably will for user-level docs. We're a step or two 
below that at the moment.

>if you use variable width encodings ... then I think you're in
>for a world of ... pain.

I snipped all the irrelevant bits of that paragraph. :)

We're only going to do variable width for I/O, and only if the source or 
destination are in a variable width format. The internal bits that need to 
care will work on fixed-width representations.


                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to