Re: Character error not reported
> On 3 Jul 2019, at 07:24, Akim Demaille wrote: > >> Le 2 juil. 2019 à 14:15, Hans Åberg a écrit : >> >>> On 2 Jul 2019, at 07:08, Akim Demaille wrote: >>> Le 18 juin 2019 à 18:09, Hans Åberg a écrit : As 8-bit character tokens are not useful with UTF-8, I have replaced it with: %token token_error "token error" . { return my_parser::token::token_error; } Please let me know if there is a better way to generate a parser error. >>> >>> I personally prefer to throw an exception. >>> >>> . throw parser::syntax_error(loc, "invalid character: "s + yytext); >> >> I changed to that too, writing to make it look as though thrown by the >> parser: >> . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected >> my_parser token error."); >> >> When the match is a part of an UTF-8 byte, it is not useful to report what >> it is. > > You have a point. I would still report the culprit, but improve the pattern. As for Bison, I thought maybe a suggestion for better diagnostics. > /* UTF-8 Encoded Unicode Code Point, from Flex's documentation. */ > mbchar > [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x\90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) > > %% > > {mbchar} throw parser::syntax_error(loc, "invalid character: "s + yytext); > . throw parser::syntax_error(loc, "invalid byte: "s + yytext); Thanks for the suggestion. I made a Haskell program generating such regex patterns for UTF-8 and UTF-32 character classes, and also a C++ version. I think though of testing my own software I mentioned before as a replacement for Flex.
Re: Character error not reported
> Le 2 juil. 2019 à 14:15, Hans Åberg a écrit : > > >> On 2 Jul 2019, at 07:08, Akim Demaille wrote: >> >> Hi Hans, > > Hello, > >>> Le 18 juin 2019 à 18:09, Hans Åberg a écrit : >>> >>> As 8-bit character tokens are not useful with UTF-8, I have replaced it >>> with: >>> %token token_error "token error" >>> >>> . { return my_parser::token::token_error; } >>> >>> Please let me know if there is a better way to generate a parser error. >> >> I personally prefer to throw an exception. >> >> . throw parser::syntax_error(loc, "invalid character: "s + yytext); > > I changed to that too, writing to make it look as though thrown by the parser: > . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected my_parser > token error."); > > When the match is a part of an UTF-8 byte, it is not useful to report what it > is. You have a point. I would still report the culprit, but improve the pattern. /* UTF-8 Encoded Unicode Code Point, from Flex's documentation. */ mbchar [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x\90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) %% {mbchar} throw parser::syntax_error(loc, "invalid character: "s + yytext); . throw parser::syntax_error(loc, "invalid byte: "s + yytext);
Re: Character error not reported
> On 2 Jul 2019, at 07:08, Akim Demaille wrote: > > Hi Hans, Hello, >> Le 18 juin 2019 à 18:09, Hans Åberg a écrit : >> >> As 8-bit character tokens are not useful with UTF-8, I have replaced it with: >> %token token_error "token error" >> >> . { return my_parser::token::token_error; } >> >> Please let me know if there is a better way to generate a parser error. > > I personally prefer to throw an exception. > > . throw parser::syntax_error(loc, "invalid character: "s + yytext); I changed to that too, writing to make it look as though thrown by the parser: . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected my_parser token error."); When the match is a part of an UTF-8 byte, it is not useful to report what it is. The token-error token may still be needed, though, as I store token values on the symbol table.
Re: Character error not reported
Hi Hans, > Le 18 juin 2019 à 18:09, Hans Åberg a écrit : > > As 8-bit character tokens are not useful with UTF-8, I have replaced it with: > %token token_error "token error" > > . { return my_parser::token::token_error; } > > Please let me know if there is a better way to generate a parser error. I personally prefer to throw an exception. . throw parser::syntax_error(loc, "invalid character: "s + yytext);
Re: Character error not reported
> On 17 Jun 2019, at 18:06, Akim Demaille wrote: > > Hi Hans, Hi, >> Le 17 juin 2019 à 15:12, Hans Åberg a écrit : >> >> When a byte with high bit set that is not used in the grammar, the parser >> generated by Bison 3.4.1, does not report an error, only if the high bit is >> not set. > > This is hard to believe. I suspect your problem is elsewhere. > >> This occurs if one sets a Flex default rule >> . { return yytext[0]; } >> and the lexer finds a stray UTF-8 byte. > > I would say that here, you return a char (yytext[0]) with "a high bit set", > on an architecture where char is signed, so you are actually returning a > negative int (when the 8th bit is set). And for Bison, any negative token > number stands for end-of-file. Indeed, likely the case. > You should actually write: > > . { return (unsigned char) yytext[0]; } As 8-bit character tokens are not useful with UTF-8, I have replaced it with: %token token_error "token error" . { return my_parser::token::token_error; } Please let me know if there is a better way to generate a parser error.
Re: Character error not reported
[Resent, thanks Uxio] Hi Hans, > Le 17 juin 2019 à 15:12, Hans Åberg a écrit : > > When a byte with high bit set that is not used in the grammar, the parser > generated by Bison 3.4.1, does not report an error, only if the high bit is > not set. This is hard to believe. I suspect your problem is elsewhere. > This occurs if one sets a Flex default rule > . { return yytext[0]; } > and the lexer finds a stray UTF-8 byte. I would say that here, you return a char (yytext[0]) with "a high bit set", on an architecture where char is signed, so you are actually returning a negative int (when the 8th bit is set). And for Bison, any negative token number stands for end-of-file. You should actually write: . { return (unsigned char) yytext[0]; } Cheers!
Re: Character error not reported
It’s really empty, your mail client and server haven’t failed: https://lists.gnu.org/archive/html/bug-bison/2019-06/msg5.html 🤷 > On 17 Jun 2019, at 18:06, Akim Demaille wrote: >
Re: Character error not reported
Character error not reported
When a byte with high bit set that is not used in the grammar, the parser generated by Bison 3.4.1, does not report an error, only if the high bit is not set. This occurs if one sets a Flex default rule . { return yytext[0]; } and the lexer finds a stray UTF-8 byte.