-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Sylvie,
On 9/22/2009 11:01 AM, Sylvie Perrin wrote:
> The cause was the LC_ALL variable in my script starting tomcat.
> I set it to fr_FR.UTF-8 as you suggest and now, my test is OK !
I wonder if Java uses the file.encoding system property (which is set by
the portion of $LC_ALL after the .) to convert bytes returned from the
filesystem into filenames and vice versa.
Yeah, that appears to be the case:
import java.io.*;
public class FileEncodingTest
{
public static void main(String[] args)
throws Exception
{
System.out.println("Using file.encoding=" +
System.getProperty("file.encoding"));
File file = new File("\u03c0"); // That's a lowercase Greek pi
Writer out = new FileWriter(file);
out.write("A test file\n");
out.close();
file = new File(".");
File[] files = file.listFiles();
for(int i=0; i<files.length; ++i)
{
file = files[i];
System.out.print(file.getName());
System.out.print("\tunicode: ");
byte[] bytes =
file.getName().getBytes("UnicodeBigUnmarked"); // Trust me
for(int j=0; j<bytes.length; ++j)
{
String hex = Integer.toHexString(bytes[j]);
if(1 == hex.length())
System.out.print("0");
System.out.print(hex);
System.out.print(" ");
}
System.out.println();
}
}
}
Output on my system:
$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
? unicode: 00 3f
$ LC_ALL=en_US.UTF-8 java FileEncodingTest
Using file.encoding=UTF-8
FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
? unicode: 00 3f
? unicode: 03 c0 (/this correctly emitted the glyph for pi/)
Then, for good measure:
$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
? unicode: 00 3f
?? unicode: ff fd ff fd (/this did not/)
So, when running in ANSI_X3.4-1968-mode, Java takes the codepoint for pi
(0x03c0) and destroys it (note the two-character filename where the
first byte is NUL). I'm not really even sure how it does that... I'd
have expected some broken sign-extension or something but I have no idea
how 0x03c0 becomes 0x003f.
When running in UTF-8 mode, the correct code point is used for the
filename and read-back correctly using listFiles.
When running again in ANSI mode, the original (incorrect) filename is
(predictably) read- back in the same way as the original, but the
filename with the correct code point is again garbled (0x03c0 ->
0xfffdfffd).
Somebody needs to write a virus that just converts everything to UTF-8
so we can be done with it.
- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkq47lAACgkQ9CaO5/Lv0PCDjwCfWTArE2PRo2XTeBgd3yGD+AyZ
dCUAnAo8aSsYUdgT/eJBvqMjWA0KzXwF
=OEyH
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]