On Sep 24 18:37, IWAMURO Motonori wrote: > 2009/9/24 Corinna Vinschen <corinna-cyg...@cygwin.com>: > > On Sep 24 16:03, IWAMURO Motonori wrote: > >> 2009/9/22 Andy Koppe <andy.ko...@gmail.com>: > >> > Let's use the Windows "ANSI" codepage as the character set for the C > >> > locale, for both the conversion functions and filenames. This means > >> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese > >> > ones, and so on. > >> > >> I oppose the approach (the ANSI codepage is used at C locale) because > >> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools. > >> > >> The reason is that the CP932 format contains a lot of meta characters > >> as follows. > >> > >> single character of CP932: > >> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/ > > > > I don't understand. Are you saying that the single character in CP932 > > consists of 12 bytes? As far as I can see, CP932 is S-JIS, which > > is a just a simple double byte character set. What am I missing. > > - CP932 (Shift_JIS) has 1byte character and 2bytes character. > > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF. > > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC. > > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC. > This includes "[", "\", "]", "^", "`", "{", "|", "}".
Ok, thanks for your examples, they show neatly where the problem is. As you might know, the codepage 20932 (EUC-JP) is also not the same as the UNIX EUC_JP implementation. The JIS-X-0212 three byte codes are folded into two-byte sequences as described in a comment in strfuncs.cc: /* Unfortunately, the Windows eucJP codepage 20932 is not really 100% compatible to eucJP. It's a cute approximation which makes it a doublebyte codepage. The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded into two byte codes as follows: The 0x8f is stripped, the next byte is taken as is, the third byte is mapped into the lower 7-bit area by masking it with 0x7f. So, for instance, the eucJP code 0x8f,0xdd,0xf8 becomes 0xdd,0x78 in CP 20932. To be really eucJP compatible, we have to map the JIS-X-0212 characters between CP 20932 and eucJP ourselves. */ My question is this: Is the S-JIS implementation on UNIX systems also using a different implementation to avoid using characters from the ASCII range? If so, can't we change the __sjis_wctomb and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb and __eucjp_mbtowc functions to get a safer implementation? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple