Dan Sugalski writes:
: > iii) Never assume bytes.
:
: What, never? Not even in vectors and bitmaps? :)
:
: I agree, though. Character and byte are separate constructs and need to be
: dealt with separately.
Not sure what you guys mean. A string is a sequence of integers.
A sequence of integers can have many useful representations, where
usefulness can be defined in any of several different ways.
Just to tweak everyone's brain again, we wouldn't necessarily have to
support *any* variable length encoding in the core. That would throw
utf8 and utf16 right out the window. Under this model, you just store
strings as integer arrays, and use the smallest size integers that will
hold all the characters of the string. You hit a character that won't
fit into your current representation, you just change the
representation on the fly from 8 to 16 to 32 bits.
That's one definition of useful. It combines the representation of
integer arrays and strings. It's optimized for substr(), but
deoptimized for I/O conversions. Choose your poison...
: > > Perhaps the regex engine should always force UF8 form ?
: >
: > I think we really want to store data internally in a common, Unicode format.
:
: Maybe we should just abstract it, though the more abstract it gets the
: slower the regex engine's likely to be, as it does prefer to rip through
: raw data buffers.
Again, a small perl wants one regex engine that follow vtbl pointers
character by character. (Or it might want to force a single representation
on all regex strings.) A large perl may want three or four different
regex engines tuned for each representation. Memory is getting cheaper,
after all.
The point being that we write an abstract generic regex engine that
can be instantiated either once for small perls or multiple times for
large perls.
Larry