At 11:13 AM 12/16/00 -0600, Jarkko Hietaniemi wrote:
>On Fri, Dec 15, 2000 at 03:10:16PM -0500, Dan Sugalski wrote:
> > At 11:18 AM 12/15/00 -0600, Jarkko Hietaniemi wrote:
> > >On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > > > IMHO, the first thing we need to design and code is the API and runtime
> > > > library, since everything else builds on top of that, and we can
> design
> > > other
> > > > stuff in parallel with coding it. (A lot of it will be grunt work.)
> > > >
> > > > So, before we start even thinking about what we need, it's time to
> look
> > > at the
> > > > vexed question of string representation. How do we do Unicode without
> > > getting
> > > > into the horrendous non-Latin1 cockups we're seeing on p5p right
> now? Larry
> > >
> > >As painful as it may sound (codingwise) I would urge to spare some
> > >thought to using (internally) UTF-32 for those encodings for which
> > >UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).
> >
> > If we can manage it, I'd prefer to not have a preferred internal
>
>I didn't mean 'preferred', I meant that if UTF-8 would be longer for
>some encodings, both for space *and* speed using straight honest UTF-32
>would make much more sense.
I'm thinking for speed that binary and UTF-32 should be our internal
representations, at least for the data that gets handed to the regex
engine. Or at least we use a constant-width character that's 8 and 32 bits,
if I'm misusing UTF-32. (UTF-8 is variable-width--is UTF-32?)
> > representation and Do The Right Thing in a general way. (Though I know
> that
> > we may have to go more specific for speed)
> >
> > I can see us having good reason to handle at least:
> >
> > Binary
> > UTF-8 (and yes, I know latin-1, or ASCII, or something of the sort is a
> > proper subset of UTF-8)
> > EBCDIC
> > UTF-32
> > Shift-JIS
> >
> > as text. How to generalize the regex engine (which strikes me as the most
> > likely piece of perl to care deeply about representation) to handle all
> the
> > types is an interesting question. I'm currently trying to figure out a way
> > to generalize things, and it's mostly there, but I'm really worried about
> > speed issues because of it.
> >
> > Worst case, handling bytes and UTF-32 should get us by, (variable-lenth
> > encodings are a *pain*...) though we'd be well-served to handle more
> natively.
>
>EMPHATIC YES (after glaring for weeks at the regex/utf8 code).
What, after all that time you can still see? I'm impressed. :)
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk