Re: lexer.ll: Warn about non-UTF-8 characters (issue 5505090)

dak Sun, 01 Jan 2012 01:56:14 -0800


http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll
File lily/lexer.ll (right):


http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134
lily/lexer.ll:134: A            [a-zA-Z\200-\377]
On 2012/01/01 02:01:11, Keith wrote:

non-ASCII characters are used internally as-read, tested below only to

warn the

user if the input is invalid as UTF-8


You mean, you want a comment here?  I actually don't know how they look
in this section of the Flex source.

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode220
lily/lexer.ll:220: (void) YYText_utf8 ();
On 2012/01/01 02:01:11, Keith wrote:

Complaining about comments doesn't help anybody.


There are editors and tools that decide about which encoding to use
based on an analysis of the whole file.  Exceptions inside of comments
can make them pick a "safe choice" like displaying every byte in binary.
 So I don't agree with "doesn't help anybody".  Also if you start with a
file that contains non-ASCII non-UTF-8 characters in comments, the
editor will not assume that you want to save in UTF-8.  All additionally
entered characters will be encoded compatibly with the previously
detected file encoding.

So I don't agree with your statement that basically says that files that
contain non-UTF-8 characters in comments can't cause trouble.  I won't
insist on throwing a warning here, but I won't remove it without being
sure that this is based on an informed decision.

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1009
lily/lexer.ll:1009: case 0xc2:
On 2011/12/31 18:28:00, lemzwerg wrote:

Wouldn't it be more effective to create an array of 128 bytes (for

0x80-0xFF)

which maps `p[i]' to the value of `more' (or 0 if there is more to

do)?

How and when to break switch statements down into conditionals and
lookup tables depending on the individual processor's performance
characteristics is a pet peeve of compiler writers.  I see little point
meddling with that.  Have you looked at the generated code? I'd probably
have exploded the case 0x00..0x7f into the switch as well, but did not
like the resulting number of lines, and I don't know if clang++ or
whatever other compilers we are using can deal with case ranges (like
gcc can).

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083
lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ());
On 2012/01/01 02:01:11, Keith wrote:

I suggest "unsupported file encoding, expected UTF-8"


I think that error message would be appropriate once per run, either
because of making it a fatal error, or as a summary at the end.  But
when flagging the actual place where there is a problematic character?

You could warn once per token, rather than once per character, by

following this

with  break;


I think once per character is ok (after all, users don't appreciate
fixing all errors of one run, just to get fresh errors on the next).  It
does not help, however, if the error message points merely to the token
(more accurately, the lexing unit which does not need to result in a
token).

So the final version to commit should either point straight to the bad
character, or give just one message per string.  I'll change this to the
latter before the final commit if nobody gives a good suggestion of how
to make it do the former.

http://codereview.appspot.com/5505090/

_______________________________________________
lilypond-devel mailing list
lilypond-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/lilypond-devel

Re: lexer.ll: Warn about non-UTF-8 characters (issue 5505090)

Reply via email to