2009/9/22 Corinna Vinschen: >> >> Therefore, when converting a UTF-16 Windows filename to the current >> >> charset, 0xDC?? words should be treated like any other UTF-16 word >> >> that can't be represented in the current charset: it should be encoded >> >> as a ^N sequence.
(I started writing this before seeing your patch to the singlebyte codepage tables, which makes plenty of sense. Here goes anyway.) Having actually looked at strfuncs.cc, my diagnosis was too simplistic, because the U+DC?? codes are used not only for invalid UTF-8 bytes, but for invalid bytes in any charset. This even includes CP1252, which has a few holes in the 0x80..0x9F range. Therefore, the complete solution would be something like this: when sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte it encodes is indeed an invalid byte in the current charset. If it is, it translates it into that invalid byte, because on the way back it would once again be turned into the same 0xDC?? code. If the byte would represent (part of) a valid character, however, it would need to be encoded as a ^N sequence to ensure correct roundtripping. Now that shouldn't be too difficult to implement for singlebyte charsets, but it gets somewhat hairy for multibyte charsets, including UTF-8 itself. Here's how I think it could be done though: In sys_cp_wcstombs: * On encountering a DC?? code, extract the encoded byte, and feed it into f_mbtowc. A private mbstate for this is needed, starting in the initial state for each filename. Switch on the result of f_mbtowc: ** case -2 (incomplete sequence): add the byte to a buffer for this purpose ** case -1 (invalid sequence): copy anything already in the buffer plus the current byte into the target filename, as we can be sure that they'll turn back into U-DCbb again on the way back. ** case >0 (valid sequence): encode buffer contents and current byte as a ^N codes that don't represent valid UTF-8 * When encountering a non-DC?? code, copy any bytes left in the buffer into the target filename. Unfortunately the latter point still leaves a loophole, in case the incomplete sequence from the buffer and the subsequent bytes combine into something valid. Singlebyte charset aren 't affected though, because they don't have continuation bytes. Nor is UTF-8, because it was designed such that continuation bytes are distinct from initial bytes. Which leaves the DBCS charsets. However, it rather looks like DBCSs are an intractable problem here in any case, because of issues like this: http://support.microsoft.com/kb/170559: "There are some codes that are not matched one-to-one between Shift-JIS (Japanese character set supported by MS) and Unicode. When an application calls MultiByteToWideChar() and WideCharToMultiByte() to perform code conversion between Shift-JIS and Unicode, the function returns the wrong code value in some cases." Which leaves me scratching my head regarding the C locale. More later ... Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple