Non-ascii characters in test source files
-----------------------------------------
Key: CODEC-127
URL: https://issues.apache.org/jira/browse/CODEC-127
Project: Commons Codec
Issue Type: Bug
Reporter: Sebb
Some of the test cases include characters in a native encoding (possibly
UTF-8), rather than using Unicode escapes.
This can cause a problem for IDEs if they don't know the encoding (e.g. cause
compilation errors, which is how I found the issue), and possibly some
transformations may corrupt the contents, e.g. fixing EOL.
I think we should have a rule of using Unicode escapes for all such non-ascii
characters.
It's particularly important for non-ISO-8859-1 characters.
Some example classes with non-ascii characters:
{code}
binary\Base64Test.java:96 byte[] decode =
b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
language\ColognePhoneticTest.java:110 {"m├Ânchengladbach",
"664645214"},
language\ColognePhoneticTest.java:130 String[][] data =
{{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
language\DoubleMetaphoneTest.java:1222
this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
language\DoubleMetaphoneTest.java:1227
this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:369 Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:375 Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
language\SoundexTest.java:389 Assert.assertEquals("´┐¢000",
this.getSoundexEncoder().encode("´┐¢"));
language\SoundexTest.java:395 Assert.assertEquals("",
this.getSoundexEncoder().encode("´┐¢"));
{code}
The characters are probably not correct above, because I used a crude perl
script to find them:
{code}
perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;"
*/*.java
{code}
language\SoundexTest.java:367 in particular is incorrect, because it's supposed
to be a single character.
Now one might think that native2ascii -encoding UTF-8 would fix that, but it
gives:
if (Character.isLetter('\ufffd'))
which is an "unknown" character.
Similarly for binary\Base64Test.java:96.
It's not all that clear what the Unicode escapes should be in these cases, but
probably not the unknown character.
[Possibly the characters got mangled at some point, or maybe they have always
been wrong]
The ColognePhoneticTest.java cases are less serious, as the characters are
valid ISO-8859-1 (accented German), but given that the rest of the file uses
unicode escaps, I think they should be changed too (but add comments to say
what they are, e.g. o-umlaut, u-umlaut)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira