> On 3 Jul 2019, at 07:24, Akim Demaille <a...@lrde.epita.fr> wrote: > >> Le 2 juil. 2019 à 14:15, Hans Åberg <haber...@telia.com> a écrit : >> >>> On 2 Jul 2019, at 07:08, Akim Demaille <a...@lrde.epita.fr> wrote: >>> >>>> Le 18 juin 2019 à 18:09, Hans Åberg <haber...@telia.com> a écrit : >>>> >>>> As 8-bit character tokens are not useful with UTF-8, I have replaced it >>>> with: >>>> %token token_error "token error" >>>> >>>> . { return my_parser::token::token_error; } >>>> >>>> Please let me know if there is a better way to generate a parser error. >>> >>> I personally prefer to throw an exception. >>> >>> . throw parser::syntax_error(loc, "invalid character: "s + yytext); >> >> I changed to that too, writing to make it look as though thrown by the >> parser: >> . { throw my_parser::syntax_error(yylloc, "syntax error, unexpected >> my_parser token error."); >> >> When the match is a part of an UTF-8 byte, it is not useful to report what >> it is. > > You have a point. I would still report the culprit, but improve the pattern.
As for Bison, I thought maybe a suggestion for better diagnostics. > /* UTF-8 Encoded Unicode Code Point, from Flex's documentation. */ > mbchar > [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x\90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) > > %% > > {mbchar} throw parser::syntax_error(loc, "invalid character: "s + yytext); > . throw parser::syntax_error(loc, "invalid byte: "s + yytext); Thanks for the suggestion. I made a Haskell program generating such regex patterns for UTF-8 and UTF-32 character classes, and also a C++ version. I think though of testing my own software I mentioned before as a replacement for Flex.