-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sylvie,
On 9/22/2009 11:01 AM, Sylvie Perrin wrote: > The cause was the LC_ALL variable in my script starting tomcat. > I set it to fr_FR.UTF-8 as you suggest and now, my test is OK ! I wonder if Java uses the file.encoding system property (which is set by the portion of $LC_ALL after the .) to convert bytes returned from the filesystem into filenames and vice versa. Yeah, that appears to be the case: import java.io.*; public class FileEncodingTest { public static void main(String[] args) throws Exception { System.out.println("Using file.encoding=" + System.getProperty("file.encoding")); File file = new File("\u03c0"); // That's a lowercase Greek pi Writer out = new FileWriter(file); out.write("A test file\n"); out.close(); file = new File("."); File[] files = file.listFiles(); for(int i=0; i<files.length; ++i) { file = files[i]; System.out.print(file.getName()); System.out.print("\tunicode: "); byte[] bytes = file.getName().getBytes("UnicodeBigUnmarked"); // Trust me for(int j=0; j<bytes.length; ++j) { String hex = Integer.toHexString(bytes[j]); if(1 == hex.length()) System.out.print("0"); System.out.print(hex); System.out.print(" "); } System.out.println(); } } } Output on my system: $ java FileEncodingTest Using file.encoding=ANSI_X3.4-1968 FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00 6c 00 61 00 73 00 73 FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00 61 00 76 00 61 ? unicode: 00 3f $ LC_ALL=en_US.UTF-8 java FileEncodingTest Using file.encoding=UTF-8 FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00 6c 00 61 00 73 00 73 FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00 61 00 76 00 61 ? unicode: 00 3f ? unicode: 03 c0 (/this correctly emitted the glyph for pi/) Then, for good measure: $ java FileEncodingTest Using file.encoding=ANSI_X3.4-1968 FileEncodingTest.class unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00 6c 00 61 00 73 00 73 FileEncodingTest.java unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00 63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00 61 00 76 00 61 ? unicode: 00 3f ?? unicode: ff fd ff fd (/this did not/) So, when running in ANSI_X3.4-1968-mode, Java takes the codepoint for pi (0x03c0) and destroys it (note the two-character filename where the first byte is NUL). I'm not really even sure how it does that... I'd have expected some broken sign-extension or something but I have no idea how 0x03c0 becomes 0x003f. When running in UTF-8 mode, the correct code point is used for the filename and read-back correctly using listFiles. When running again in ANSI mode, the original (incorrect) filename is (predictably) read- back in the same way as the original, but the filename with the correct code point is again garbled (0x03c0 -> 0xfffdfffd). Somebody needs to write a virus that just converts everything to UTF-8 so we can be done with it. - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkq47lAACgkQ9CaO5/Lv0PCDjwCfWTArE2PRo2XTeBgd3yGD+AyZ dCUAnAo8aSsYUdgT/eJBvqMjWA0KzXwF =OEyH -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org