[ https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebb updated CODEC-127: ----------------------- Comment: was deleted (was: If I run the command as is, I get: {quote} Can't open perl script "ne": No such file or directory {quote}) > Non-ascii characters in source files > ------------------------------------ > > Key: CODEC-127 > URL: https://issues.apache.org/jira/browse/CODEC-127 > Project: Commons Codec > Issue Type: Bug > Reporter: Sebb > > Some of the test cases include characters in a native encoding (possibly > UTF-8), rather than using Unicode escapes. > This can cause a problem for IDEs if they don't know the encoding (e.g. cause > compilation errors, which is how I found the issue), and possibly some > transformations may corrupt the contents, e.g. fixing EOL. > I think we should have a rule of using Unicode escapes for all such non-ascii > characters. > It's particularly important for non-ISO-8859-1 characters. > Some example classes with non-ascii characters: > {code} > binary\Base64Test.java:96 byte[] decode = > b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ="); > language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", > "664645214"}, > language\ColognePhoneticTest.java:130 String[][] data = > {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}}; > language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"}, > language\ColognePhoneticTest.java:143 {"ganz", "G├ñnse"}, > language\DoubleMetaphoneTest.java:1222 > this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S"); > language\DoubleMetaphoneTest.java:1227 > this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N"); > language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) { > language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:375 Assert.assertEquals("", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) { > language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", > this.getSoundexEncoder().encode("´┐¢")); > language\SoundexTest.java:395 Assert.assertEquals("", > this.getSoundexEncoder().encode("´┐¢")); > {code} > The characters are probably not correct above, because I used a crude perl > script to find them: > {code} > perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if > m/\P{ASCII}/;$s=$ARGV;" xxxx.java > {code} > language\SoundexTest.java:367 in particular is incorrect, because it's > supposed to be a single character. > Now one might think that native2ascii -encoding UTF-8 would fix that, but it > gives: > if (Character.isLetter('\ufffd')) > which is an "unknown" character. > Similarly for binary\Base64Test.java:96. > It's not all that clear what the Unicode escapes should be in these cases, > but probably not the unknown character. > [Possibly the characters got mangled at some point, or maybe they have always > been wrong] > The ColognePhoneticTest.java cases are less serious, as the characters are > valid ISO-8859-1 (accented German), but given that the rest of the file uses > unicode escaps, I think they should be changed too (but add comments to say > what they are, e.g. o-umlaut, u-umlaut) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira