At 11:18 AM 12/15/00 -0600, Jarkko Hietaniemi wrote:
>On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > IMHO, the first thing we need to design and code is the API and runtime
> > library, since everything else builds on top of that, and we can design 
> other
> > stuff in parallel with coding it. (A lot of it will be grunt work.)
> >
> > So, before we start even thinking about what we need, it's time to look 
> at the
> > vexed question of string representation. How do we do Unicode without 
> getting
> > into the horrendous non-Latin1 cockups we're seeing on p5p right now? Larry
>
>As painful as it may sound (codingwise) I would urge to spare some
>thought to using (internally) UTF-32 for those encodings for which
>UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).

If we can manage it, I'd prefer to not have a preferred internal 
representation and Do The Right Thing in a general way. (Though I know that 
we may have to go more specific for speed)

I can see us having good reason to handle at least:

Binary
UTF-8 (and yes, I know latin-1, or ASCII, or something of the sort is a 
proper subset of UTF-8)
EBCDIC
UTF-32
Shift-JIS

as text. How to generalize the regex engine (which strikes me as the most 
likely piece of perl to care deeply about representation) to handle all the 
types is an interesting question. I'm currently trying to figure out a way 
to generalize things, and it's mostly there, but I'm really worried about 
speed issues because of it.

Worst case, handling bytes and UTF-32 should get us by, (variable-lenth 
encodings are a *pain*...) though we'd be well-served to handle more natively.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to