[
https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084743#comment-13084743
]
Sebb commented on CODEC-127:
----------------------------
Here's the full list of lines containing non-ASCII characters:
{code}
java/org/apache/commons/codec/language/ColognePhonetic.java:264 private
static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
java/org/apache/commons/codec/language/ColognePhonetic.java:265
{'\u00DC', 'U'}, // Ü
java/org/apache/commons/codec/language/ColognePhonetic.java:266
{'\u00D6', 'O'}, // ├âÔÇô
java/org/apache/commons/codec/language/ColognePhonetic.java:267
{'\u00DF', 'S'} // ├â┼©
java/org/apache/commons/codec/language/ColognePhonetic.java:388 * Converts
the string to upper case and replaces germanic umlauts, and the
├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
test/org/apache/commons/codec/binary/Base64Test.java:96 byte[] decode =
b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
test/org/apache/commons/codec/language/ColognePhoneticTest.java:110
{"m├Ânchengladbach", "664645214"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:130
String[][] data = {{"bergisch-gladbach", "174845214"},
{"M├╝ller-L├╝denscheidt", "65752682"}};
test/org/apache/commons/codec/language/ColognePhoneticTest.java:137
{"Meyer", "M├╝ller"},
test/org/apache/commons/codec/language/ColognePhoneticTest.java:143
{"ganz", "Gänse"},
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222
this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227
this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
test/org/apache/commons/codec/language/SoundexTest.java:367 if
(Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:369
Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:375
Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:387 if
(Character.isLetter('´┐¢')) {
test/org/apache/commons/codec/language/SoundexTest.java:389
Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/SoundexTest.java:395
Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93
String[] names = { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47
{ "Nu├▒ez", "spanish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49
{ "─îapek", "czech", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52
{ "Küçük", "turkish", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55
{ "Ceauşescu", "romanian", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57
{ "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58
{ "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59
{ "ÎøÎö΃", "hebrew", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60
{ "ácz", "any", EXACT },
test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61
{ "átz", "any", EXACT } });
{code}
Note the comment at ColognePhonetic.java:388 - this does not seem to make sense
in any encoding, but I could be wrong.
> Non-ascii characters in test source files
> -----------------------------------------
>
> Key: CODEC-127
> URL: https://issues.apache.org/jira/browse/CODEC-127
> Project: Commons Codec
> Issue Type: Bug
> Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly
> UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause
> compilation errors, which is how I found the issue), and possibly some
> transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii
> characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96 byte[] decode =
> b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110 {"m├Ânchengladbach",
> "664645214"},
> language\ColognePhoneticTest.java:130 String[][] data =
> {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
> language\DoubleMetaphoneTest.java:1222
> this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227
> this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369 Assert.assertEquals("´┐¢000",
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375 Assert.assertEquals("",
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389 Assert.assertEquals("´┐¢000",
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395 Assert.assertEquals("",
> this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl
> script to find them:
> {code}
> perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;"
> */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's
> supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it
> gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases,
> but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always
> been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are
> valid ISO-8859-1 (accented German), but given that the rest of the file uses
> unicode escaps, I think they should be changed too (but add comments to say
> what they are, e.g. o-umlaut, u-umlaut)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira