Hi Will,

On Wed, Feb 20, 2013 at 3:52 AM, Will Coleda <w...@coleda.com> wrote:

> The code points giving you trouble are 0xFDD0..0xFDEF:
>
>
> http://stackoverflow.com/questions/5188679/whats-the-purpose-of-the-noncharacters-ufdd0-to-ufdef
>
> You can split this into two ranges to avoid the problematic points (and
> could use this to combine the distinct ranges you have above.)
>
> $  perl6 -e 'say ?("\c[0xFDCF]" ~~
> /<[\c[0xE000]..\c[0xFDCF]\c[0xFDF0]..\c[0xFFFD]]>/)'
> True
>

Thanks for clarifying that.

>
> Note that if you have invalid UTF-8 input, though, you'll still get the
> invalid character error, so you'll need to deal with that before trying to
> use the rule.
>
> $  perl6 -e 'say ?("\c[0xFDD0]" ~~
> /<[\c[0xE000]..\c[0xFDCF]\c[0xFDF0]..\c[0xFFFD]]>/)'
> ===SORRY!===
> Invalid character for UTF-8 encoding
>
> Hope this helps.
>
>
> Yes I think so. The rule seems to be trying to step around invalid
code-points; and every thing that it rejects blows up when you try to
encode it anyway.

So as long as I'm only parsing encoded strings, the following much simpler
rule should suffice:

    token nonascii       {<- [\x0..\x7F]>}

(also using \x7F rather than \c[0x7F] - as suggested)

Thanks,
- David


> On Mon, Feb 18, 2013 at 11:29 PM, David Warring 
> <david.warr...@gmail.com>wrote:
>
>> Hi Guys,
>> A quick question.
>>
>> I'm trying to interpret unicode code-point ranges from the CSS 3 spec -
>> http://www.w3.org/TR/css3-syntax/#CHARSETS
>>
>> The rule in question is
>>
>> nonascii :== #x80-#xD7FF #xE000-#xFFFD #x10000-#x10FFFF
>>
>> Where (I think) these are unicode code-point ranges.
>>
>> The latest rakudo build is fine with:
>>
>>
>> % perl6 -e perl6 -e '/<[\c[0x80]..\c[0xD7FF]]>/'
>>
>>
>> ...but doesn't like the second (or third) range:
>>
>>
>> % perl6 -e '/<[\c[0xE000]..\c[0xFFFD]]>/'
>> ===SORRY!===
>> Invalid character for UTF-8 encoding
>>
>>
>> ...the individual code points are ok:
>>
>>
>> % perl6 -e '/<[\c[0xE000]]>/'
>> % perl6 -e '/<[\c[0xFFFD]]>/'
>>
>>
>> I'm think I'm getting the above error because not all unicode code-points
>> are defined for the range xE000 to xFFFD - see
>> http://www.utf8-chartable.de/unicode-utf8-table.pl  .
>>
>> I'm just having a problem implementing a concise regex/grammar rule for
>> the
>> above. Looking for advice.
>>
>> Cheers,
>> David Warring
>>
>
>
>
> --
> Will "Coke" Coleda
>

Reply via email to