At 06:59 PM 6/5/2001 -0700, Larry Wall wrote:
>Dan Sugalski writes:
>: At 04:44 PM 6/5/2001 -0700, Larry Wall wrote:
>: >(Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!)
>:
>: I know we can, but is it really a good idea? 32 bits is really stretching
>: it for character encoding, and 64 seems rather excessive.
>
>Such large values would not typically be used for standard characters, but
>as a means of embedding an inline chunk of non-character data, such as a
>pointer, or a set of metadata bits.

Ah. In that case, perhaps extended utf-8 processing isn't really the most 
appropriate way to go. If the intent is to do embedded binary bits in a 
text stream, maybe we should build input and output filters to do that instead.

>: And I
>: really, *really* want to do as little as possible internally with
>: variable-width encodings. Yech.
>
>Mmm, the difficulty of that is overrated.  Very seldom do you want to
>do anything other than find the next character, or the previous
>character, and those are pretty easy to do in utf8.

As Hong pointed out to me on more than one occasion. I'm not sure I buy 
that, and I have serious reservations about the speed of dealing with 
variable length characters instead of fixed-length ones. (Though I still 
need to build a test suite to benchmark that)

>: >They also arbitrarily define UTF-32 to not use higher values than
>: >0x10ffff, but that doesn't mean we're gonna send in the high-bit Nazis
>: >if people want higher values for their own purposes.
>:
>: Well, that'd be inappropriate since a good chunk of the rest of the set's
>: been dedicated to future expansion. I think it might be a reasonable idea
>: for -w to grumble if someone's used a character in the unassigned range,
>: though. (IIRC there's a piece set aside for folks to do whatever they want
>: with)
>
>Certainly, but it's easy to come up with reasons to want to stuff more
>bits inline than the private use areas will support.

Maybe. That trips my "way too clever" reflex, though, and makes me think 
that perhaps it's not the best way to go about that sort of thing. Rather 
than making non-text things look like text, maybe we'd be better off coming 
up with a better way to intermingle text and non-text things. It'd be more 
space-efficient as well, since utf-8 encoding random binary things will 
tend to expand them more than would seem necessary.

>Rather than have
>-w grumble about such characters, I'd rather see an optional output
>discipline that enforces strict Unicode output.

Fair enough.

>On the other hand, maybe there's some use for a data structure that is
>a sequence of integers of various sizes, where the representation of
>different chunks of the array/string might be different sizes.  Would
>make some aspects of copy-on-write more efficient to be able to chunk
>strings and integer arrays.  And of course this would all be transparent
>at the language level, in the absence of explicit syntax to treat an
>array as a string or a string as an array.

I think that'd be a better solution than fibbing about what a piece of a 
data stream is.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to