On Thu, 21 Aug 2003, Elizabeth Mattijsen wrote:
> At 14:15 +0100 8/21/03, Nicholas Clark wrote:
> >On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:
> > > Leopold Toetsch wrote:
> > > > But these could be converted to utf32 as soon as they are seen.
> > > For a long string, that could be quite a bit of bloat.
> >Jarkko's view is that the combined hit of the size of the extra code to skip
> >along the variable length encoding, the time taken to execute that code,
> >(and I guess the cache misses it creates) is greater than the gain from
> >saving space.
>
> Indeed. I think available memory has increased more than 4 fold
> since the first regexp engine that could only do 1-byte ASCII. So
> relatively, I don't think that bloat is an issue. Just don't do
> regexps on 256Mbyte strings when your machine has less than 1 GByte
> RAM ;-)
FWIW, we're not going to do string ops on UTF-8 stuff. We'll understand
it, and know how to translate it to more useful forms, but it's just a
static storage format for us. (Mainly because, while working with UTF-8
strings is a massive pain, it's foolish to transform it to UTF16 or UTF32
if we don't need to) Our unicode operations will be done either on UTF-16
(if we get ICU going, since that's what it uses) or UTF-32. -8 is a
legacy/storage format only so far as we're concerned.
THe same thing goes for other variable-width encodings such as Shift-JIS,
FWIW.
Dan