[ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085149#comment-13085149 ]
Sebb commented on CODEC-127: ---------------------------- It's not that one cannot edit UTF-8; the problem is that it is easy to mangle non-ASCII characters by mistake. The safest is to only use ASCII, i.e. Unicode escapes, which are valid in both UTF-8 and ISO-8859-1 and all likely default encodings. However, they are difficult to read, hence the comments on the lines. If the comments get mangled, it will be obvious, because they won't look right; and it's relatively easy to fix them from the Unicode. I don't think it's an option to use native characters in the non-comment code, because we already know they can get corrupted, and the corruption won't necessarily cause errors. I don't see the harm in "translating" the code into commments; after all the translation can be done again. > Non-ascii characters in source files > ------------------------------------ > > Key: CODEC-127 > URL: https://issues.apache.org/jira/browse/CODEC-127 > Project: Commons Codec > Issue Type: Bug > Reporter: Sebb > > Some of the test cases include characters in a native encoding (possibly > UTF-8), rather than using Unicode escapes. > This can cause a problem for IDEs if they don't know the encoding (e.g. cause > compilation errors, which is how I found the issue), and possibly some > transformations may corrupt the contents, e.g. fixing EOL. > I think we should have a rule of using Unicode escapes for all such non-ascii > characters. > It's particularly important for non-ISO-8859-1 characters. > Some example classes with non-ascii characters: > {code} > binary\Base64Test.java:96 byte[] decode = > b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ="); > language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", > "664645214"}, > language\ColognePhoneticTest.java:130 String[][] data = > {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}}; > language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"}, > language\ColognePhoneticTest.java:143 {"ganz", "G├ñnse"}, > language\DoubleMetaphoneTest.java:1222 > this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S"); > language\DoubleMetaphoneTest.java:1227 > this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N"); > language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) { > language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:375 Assert.assertEquals("", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) { > language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:395 Assert.assertEquals("", > this.getSoundexEncoder().encode("´┐¢")); > {code} > The characters are probably not correct above, because I used a crude perl > script to find them: > {code} > perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if > m/\P{ASCII}/;$s=$ARGV;" */*.java > {code} > language\SoundexTest.java:367 in particular is incorrect, because it's > supposed to be a single character. > Now one might think that native2ascii -encoding UTF-8 would fix that, but it > gives: > if (Character.isLetter('\ufffd')) > which is an "unknown" character. > Similarly for binary\Base64Test.java:96. > It's not all that clear what the Unicode escapes should be in these cases, > but probably not the unknown character. > [Possibly the characters got mangled at some point, or maybe they have always > been wrong] > The ColognePhoneticTest.java cases are less serious, as the characters are > valid ISO-8859-1 (accented German), but given that the rest of the file uses > unicode escaps, I think they should be changed too (but add comments to say > what they are, e.g. o-umlaut, u-umlaut) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira