Nicholas Clark wrote:
> On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote:
>>     $S1 = "He said, \xabHello\xbb"
>>     $S2 = "3 \u2212 4 = \u207b 1"
>>
>> are treated as ASCII strings even though they obviously contain
>> codepoints outside of the ASCII range.  (The first results in a 
>> 'malformed string' error when compiled, the second chops off the
>> high-order bits of the \u sequence.)
> 
> IIRC having ASCII as the default was a deliberate design choice to avoid 
> the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
> string literal with bytes outside the range 0-127.

Aye, it was auto-promoting to latin1 and was changed to ascii-by-default
by me and Leo a while ago.

> If so, then I assume that the behaviour of your second example is wrong - it
> should also be a malformed string.
> 
> If PGE is always outputting UTF-8 literals, what stops it from always
> prefixing every literal "unicode:", even if it only uses Unicode characters
> 0 to 127?

Indeed, it would be much easier if unicode:"" on an ascii-only string
can automatically go back to use ascii for representation, and choose to
use utf8 (or better, latin1/ucs2) only iff there is high-bit parts in it.

A Perlish way to solve this is to introduce another pragma, similar to
"n_operators", that controls the encoding of all string literals of the
PIR program:

    .pragma encoding utf8

Once written that way, you can simply use literal « » in the program,
which reads better than \xab and \xbb anyway... :-)

Audrey

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to