At 11:18 AM 12/15/00 -0600, Jarkko Hietaniemi wrote:
>On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > IMHO, the first thing we need to design and code is the API and runtime
> > library, since everything else builds on top of that, and we can design
> other
> > stuff in parallel with coding it. (A lot of it will be grunt work.)
> >
> > So, before we start even thinking about what we need, it's time to look
> at the
> > vexed question of string representation. How do we do Unicode without
> getting
> > into the horrendous non-Latin1 cockups we're seeing on p5p right now? Larry
>
>As painful as it may sound (codingwise) I would urge to spare some
>thought to using (internally) UTF-32 for those encodings for which
>UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).
If we can manage it, I'd prefer to not have a preferred internal
representation and Do The Right Thing in a general way. (Though I know that
we may have to go more specific for speed)
I can see us having good reason to handle at least:
Binary
UTF-8 (and yes, I know latin-1, or ASCII, or something of the sort is a
proper subset of UTF-8)
EBCDIC
UTF-32
Shift-JIS
as text. How to generalize the regex engine (which strikes me as the most
likely piece of perl to care deeply about representation) to handle all the
types is an interesting question. I'm currently trying to figure out a way
to generalize things, and it's mostly there, but I'm really worried about
speed issues because of it.
Worst case, handling bytes and UTF-32 should get us by, (variable-lenth
encodings are a *pain*...) though we'd be well-served to handle more natively.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk