Re: Character error not reported

2019-07-03 Thread Hans Åberg


> On 3 Jul 2019, at 07:24, Akim Demaille  wrote:
> 
>> Le 2 juil. 2019 à 14:15, Hans Åberg  a écrit :
>> 
>>> On 2 Jul 2019, at 07:08, Akim Demaille  wrote:
>>> 
 Le 18 juin 2019 à 18:09, Hans Åberg  a écrit :
 
 As 8-bit character tokens are not useful with UTF-8, I have replaced it 
 with:
 %token token_error "token error"
 
 . { return my_parser::token::token_error; }
 
 Please let me know if there is a better way to generate a parser error.
>>> 
>>> I personally prefer to throw an exception.
>>> 
>>> .   throw parser::syntax_error(loc, "invalid character: "s + yytext);
>> 
>> I changed to that too, writing to make it look as though thrown by the 
>> parser:
>> . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected 
>> my_parser token error.");
>> 
>> When the match is a part of an UTF-8 byte, it is not useful to report what 
>> it is.
> 
> You have a point.  I would still report the culprit, but improve the pattern.

As for Bison, I thought maybe a suggestion for better diagnostics.

> /* UTF-8 Encoded Unicode Code Point, from Flex's documentation. */
> mbchar
> [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x\90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2})
> 
> %%
> 
> {mbchar}  throw parser::syntax_error(loc, "invalid character: "s + yytext);
> . throw parser::syntax_error(loc, "invalid byte: "s + yytext);

Thanks for the suggestion. I made a Haskell program generating such regex 
patterns for UTF-8 and UTF-32 character classes, and also a C++ version.

I think though of testing my own software I mentioned before as a replacement 
for Flex.





Re: Character error not reported

2019-07-02 Thread Akim Demaille



> Le 2 juil. 2019 à 14:15, Hans Åberg  a écrit :
> 
> 
>> On 2 Jul 2019, at 07:08, Akim Demaille  wrote:
>> 
>> Hi Hans,
> 
> Hello,
> 
>>> Le 18 juin 2019 à 18:09, Hans Åberg  a écrit :
>>> 
>>> As 8-bit character tokens are not useful with UTF-8, I have replaced it 
>>> with:
>>> %token token_error "token error"
>>> 
>>> . { return my_parser::token::token_error; }
>>> 
>>> Please let me know if there is a better way to generate a parser error.
>> 
>> I personally prefer to throw an exception.
>> 
>> .   throw parser::syntax_error(loc, "invalid character: "s + yytext);
> 
> I changed to that too, writing to make it look as though thrown by the parser:
> . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected my_parser 
> token error.");
> 
> When the match is a part of an UTF-8 byte, it is not useful to report what it 
> is.

You have a point.  I would still report the culprit, but improve the pattern.

 /* UTF-8 Encoded Unicode Code Point, from Flex's documentation. */
mbchar
[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x\90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2})

%%

{mbchar}  throw parser::syntax_error(loc, "invalid character: "s + yytext);
. throw parser::syntax_error(loc, "invalid byte: "s + yytext);




Re: Character error not reported

2019-07-02 Thread Hans Åberg


> On 2 Jul 2019, at 07:08, Akim Demaille  wrote:
> 
> Hi Hans,

Hello,

>> Le 18 juin 2019 à 18:09, Hans Åberg  a écrit :
>> 
>> As 8-bit character tokens are not useful with UTF-8, I have replaced it with:
>> %token token_error "token error"
>> 
>> . { return my_parser::token::token_error; }
>> 
>> Please let me know if there is a better way to generate a parser error.
> 
> I personally prefer to throw an exception.
> 
>  .   throw parser::syntax_error(loc, "invalid character: "s + yytext);

I changed to that too, writing to make it look as though thrown by the parser:
. { throw my_parser::syntax_error(yylloc, "syntax error, unexpected my_parser 
token error.");

When the match is a part of an UTF-8 byte, it is not useful to report what it 
is.

The token-error token may still be needed, though, as I store token values on 
the symbol table.





Re: Character error not reported

2019-07-01 Thread Akim Demaille
Hi Hans,

> Le 18 juin 2019 à 18:09, Hans Åberg  a écrit :
> 
> As 8-bit character tokens are not useful with UTF-8, I have replaced it with:
>  %token token_error "token error"
> 
> . { return my_parser::token::token_error; }
> 
> Please let me know if there is a better way to generate a parser error.

I personally prefer to throw an exception.

  .   throw parser::syntax_error(loc, "invalid character: "s + yytext);




Re: Character error not reported

2019-06-18 Thread Hans Åberg


> On 17 Jun 2019, at 18:06, Akim Demaille  wrote:
> 
> Hi Hans,

Hi,

>> Le 17 juin 2019 à 15:12, Hans Åberg  a écrit :
>> 
>> When a byte with high bit set that is not used in the grammar, the parser 
>> generated by Bison 3.4.1, does not report an error, only if the high bit is 
>> not set.
> 
> This is hard to believe.  I suspect your problem is elsewhere.
> 
>> This occurs if one sets a Flex default rule
>> . { return yytext[0]; }
>> and the lexer finds a stray UTF-8 byte.
> 
> I would say that here, you return a char (yytext[0]) with "a high bit set", 
> on an architecture where char is signed, so you are actually returning a 
> negative int (when the 8th bit is set).  And for Bison, any negative token 
> number stands for end-of-file.

Indeed, likely the case.

> You should actually write:
> 
> . { return (unsigned char) yytext[0]; }

As 8-bit character tokens are not useful with UTF-8, I have replaced it with:
  %token token_error "token error"

. { return my_parser::token::token_error; }

Please let me know if there is a better way to generate a parser error.






Re: Character error not reported

2019-06-18 Thread Akim Demaille
[Resent, thanks Uxio]

Hi Hans,

> Le 17 juin 2019 à 15:12, Hans Åberg  a écrit :
> 
> When a byte with high bit set that is not used in the grammar, the parser 
> generated by Bison 3.4.1, does not report an error, only if the high bit is 
> not set.

This is hard to believe.  I suspect your problem is elsewhere.

> This occurs if one sets a Flex default rule
> . { return yytext[0]; }
> and the lexer finds a stray UTF-8 byte.

I would say that here, you return a char (yytext[0]) with "a high bit set", on 
an architecture where char is signed, so you are actually returning a negative 
int (when the 8th bit is set).  And for Bison, any negative token number stands 
for end-of-file.

You should actually write:

. { return (unsigned char) yytext[0]; }

Cheers!


Re: Character error not reported

2019-06-17 Thread uxio prego
It’s really empty, your mail client and server haven’t failed:
https://lists.gnu.org/archive/html/bug-bison/2019-06/msg5.html
🤷

> On 17 Jun 2019, at 18:06, Akim Demaille  wrote:
> 




Re: Character error not reported

2019-06-17 Thread Akim Demaille


Character error not reported

2019-06-17 Thread Hans Åberg
When a byte with high bit set that is not used in the grammar, the parser 
generated by Bison 3.4.1, does not report an error, only if the high bit is not 
set. This occurs if one sets a Flex default rule
  . { return yytext[0]; }
and the lexer finds a stray UTF-8 byte.