Re: Character (or byte?) escapes under utf8 pragma

Michael Ludwig Wed, 10 Mar 2010 01:38:19 -0800

Moin Juerd,

Am 08.03.2010 um 16:15 schrieb Juerd Waalboer:


> Michael Ludwig skribis 2010-03-08 15:55 (+0100):
>> Okay. But unless I'm completely misled, you can tell whether a
>> string is supposed to contain characters (<- Encode::decode) or
>> bytes (<- Encode::encode)
> 
> The result of decode is a character string.
> 
> The result of encode is a byte string.

Thanks for confirming.

> However, apart from looking at the source code and deducing the
> intentions of the programmer, there is no way to tell whether a given
> string is meant as a character or byte string, simply because there is
> no technical representation of this intent in the string or its
> metadata.
> 
> Note that "characters" are the general case: a string is made of
> characters. When every character value fits in a single byte, the string
> can be used as a byte string.

And clarifying further.

> This bug forces us to look at the internal encoding and flags to come to
> the conclusion that it is indeed a bug. Don't mistake this as a sign
> that looking at the internal encoding or flags should ever happen in
> actual code. Even if you work around the bug, make sure that you don't
> make anything conditional on the current formatting of the string.
> 
> Instead, coerce it to whatever you need by using utf8::downgrade or
> utf8::upgrade. In your specific case, concatenation of two separate
> parts is probably the most sane thing to do.

Good.

>>>> Am I mistaken in my expectation that while "\xa0" should be
>>>> a byte, "\x{a0}" and "\x{00a0}" should be characters?
> 
> Yes. These three escapes are supposed to be exactly the same. They
> create a U+00A0 character, which happens to be perfectly usable as the
> A0 byte when used as such, in a string that doesn't contain any
> character greater than U+00FF.

Okay. Let me try to see if I have understood correctly. Without the utf8
pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
of two bytes in my source code will be stored internally as a sequence
of 12 integers. With the utf8 pragma in scope, only 11 integers.

I know I shouldn't care about the internals, but sometimes grokking the
internals is helpful as an aide-mémoire, because it puts things into
perspective that otherwise seem more arbitrary.

-- 
Michael.Ludwig (#) XING.com

Re: Character (or byte?) escapes under utf8 pragma

Reply via email to