Re: Usage of \[oxdb] (was Re: String Literals, take 2)

Larry Wall Thu, 05 Dec 2002 10:35:35 -0800

On Thu, Dec 05, 2002 at 09:18:21AM -0800, Michael Lazzaro wrote:
: 
: On Thursday, December 5, 2002, at 02:11  AM, James Mastros wrote:
: 
: >On 12/04/2002 3:21 PM, Larry Wall wrote:
: >>\x and \o are then just shortcuts.
: >Can we please also have \0 as a shortcut for \0x0?
: 
: \0 in addition to \x, meaning the same thing?  I think that would get 
: us back to where we were with octal, wouldn't it?  I'm not real keen on 
: leading zero meaning anything, personally...  :-P


\0 still means chr(0).  I don't think there's much conflict with
the new \0x, \0o, \0b, and \0d, since \0 almost always occurs at the
end of a string, if anywhere.

: >>There ain't no such thing as a "wide" character.  \xff is exactly
: >>the same character as \x[ff].
: >Which means that the only way to get a string with a literal 0xFF byte 
: >in it is with qq:u1[\xFF]? (Larry, I don't know that this has been 
: >mentioned before: is that right?)  chr:u1(0xFF) might do it too, but 
: >we're getting ahead of ourselves.
: 
: Hmm... does this matter?  I'm a bit rusty on my Unicode these days, but 
: I was assuming that \xFF and \x00FF always pointed to the same 
: character, and that you in fact _don't_ have the ability to put 
: individual bytes in a string, because Perl is deciding how to place the 
: characters for you (how long they should be, etc.)  So if you wanted 
: more explicit control, you'd use C<pack>.

A "byte" string is any string whose characters are all under 256.  It's
up to an interface to coerce this to actual bytes if it needs them.

We'll presumably have something like "use bytes" that turns off all
multi-byte processing, in which case you have to deal with any UTF that
comes in by hand.  But in general it'll be better if the interface coerces
to types like "str8", which is presumably pronouced "straight".

Don't ask me how str16 and str32 are pronounced.  (But generally you should
be using utf16 instead of str16 in any event, unless your interface truly
doesn't know how to deal with surrogates.)  In other words, str16 is
the name of the obsolescent UCS-2, and str32 is the name for UCS-4, which
is more or less the same as UTF-32, except that UTF-32 is not allowed to
use the bits above 0x10ffff.

So anyway, we've got all these types:

    str8        utf8
    str16       utf16
    str32       utf32

where the "str" version is essentially just a compact integer array.  One could
alias str8 to "latin1" since the default coercion from Unicode to str8 would
have those semantics.

It's not clear exactly what the bare "str" type is.  "Str" is obviously
the abstract string type, but "str" probably means the default C string
type for the current architecture/OS/locale/whatever.  In other words,
it might be str8, or it might be utf8.  Let's hope it's utf8, because
that will work forever, give or take an eon.

: >Also, an annoying corner case: is "\0x1ff" eq "\0x[1f]f", or is it eq 
: >"\0x[1ff]"?  What about other bases?  Is "\0x1x" eq "\0x[1]", or is it 
: >eq "\0x[1x]" (IE illegal).  (Now that I put those three questions 
: >together, the only reasonable answer seems to be that the number ends 
: >in the last place it's valid to end if you don't use explicit 
: >brackets.)
: 
: Yeah, my guess is that it's as you say... it goes till it can't goes no 
: more, but never gives an error (well, maybe for "\0xz", where there are 
: zero valid digits?)  But I would suspect that the bracketed form is 
: *strongly* recommended.  At least, that's what I plan on telling 
: people.  :-)

Sounds good to me.  Dwimming is wonderful, but so is dwissing.

Larry

Re: Usage of \[oxdb] (was Re: String Literals, take 2)

Reply via email to