At 11:13 AM 12/16/00 -0600, Jarkko Hietaniemi wrote:
>On Fri, Dec 15, 2000 at 03:10:16PM -0500, Dan Sugalski wrote:
> > At 11:18 AM 12/15/00 -0600, Jarkko Hietaniemi wrote:
> > >On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > > > IMHO, the first thing we need to design and code is the API and runtime
> > > > library, since everything else builds on top of that, and we can 
> design
> > > other
> > > > stuff in parallel with coding it. (A lot of it will be grunt work.)
> > > >
> > > > So, before we start even thinking about what we need, it's time to 
> look
> > > at the
> > > > vexed question of string representation. How do we do Unicode without
> > > getting
> > > > into the horrendous non-Latin1 cockups we're seeing on p5p right 
> now? Larry
> > >
> > >As painful as it may sound (codingwise) I would urge to spare some
> > >thought to using (internally) UTF-32 for those encodings for which
> > >UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).
> >
> > If we can manage it, I'd prefer to not have a preferred internal
>
>I didn't mean 'preferred', I meant that if UTF-8 would be longer for
>some encodings, both for space *and* speed using straight honest UTF-32
>would make much more sense.

I'm thinking for speed that binary and UTF-32 should be our internal 
representations, at least for the data that gets handed to the regex 
engine. Or at least we use a constant-width character that's 8 and 32 bits, 
if I'm misusing UTF-32. (UTF-8 is variable-width--is UTF-32?)

> > representation and Do The Right Thing in a general way. (Though I know 
> that
> > we may have to go more specific for speed)
> >
> > I can see us having good reason to handle at least:
> >
> > Binary
> > UTF-8 (and yes, I know latin-1, or ASCII, or something of the sort is a
> > proper subset of UTF-8)
> > EBCDIC
> > UTF-32
> > Shift-JIS
> >
> > as text. How to generalize the regex engine (which strikes me as the most
> > likely piece of perl to care deeply about representation) to handle all 
> the
> > types is an interesting question. I'm currently trying to figure out a way
> > to generalize things, and it's mostly there, but I'm really worried about
> > speed issues because of it.
> >
> > Worst case, handling bytes and UTF-32 should get us by, (variable-lenth
> > encodings are a *pain*...) though we'd be well-served to handle more 
> natively.
>
>EMPHATIC YES (after glaring for weeks at the regex/utf8 code).

What, after all that time you can still see? I'm impressed. :)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to