Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Wed, Sep 23, 2009 at 5:30 PM, Ross Smith wrote: > Corinna Vinschen wrote: >> >> However, if we default to UTF-8 for a subset of languages anyway, it >> gets even more interesting to ask, why not for all languages? Isn't it >> better in the long run to have the same default for all Cygwin >> installations? >> >> I'm really wondering if we shouldn't simply default to UTF-8 as charset >> throughout, in the application, the console, and for the filename >> conversion. Yes, not all applications will work OOTB with chars > 0x7f, >> but it was always a bug to make any assumptions for non-ASCII chars >> in the C locale. Applications can be fixed, right? > > In support of this plan, it occurs to me that any command line > applications that don't speak UTF-8 would presumably be showing the > same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's > main goals is providing a Linux-like environment on Windows, I don't > think Cygwin developers should feel obliged to go out of their way to > do _better_ than Linux in this regard. > > -- Ross Smith > > I don't have anything to add on the technical side of things but I will note that most linux distributions have been defaulting to UTF-8 lately. I think it would be highly appropriate to default to UTF-8 in cygwin. Robert Pendell shi...@elite-systems.org "A perfect world is one of chaos." Thawte Web of Trust Notary CAcert Assurer -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Corinna Vinschen wrote: However, if we default to UTF-8 for a subset of languages anyway, it gets even more interesting to ask, why not for all languages? Isn't it better in the long run to have the same default for all Cygwin installations? I'm really wondering if we shouldn't simply default to UTF-8 as charset throughout, in the application, the console, and for the filename conversion. Yes, not all applications will work OOTB with chars > 0x7f, but it was always a bug to make any assumptions for non-ASCII chars in the C locale. Applications can be fixed, right? In support of this plan, it occurs to me that any command line applications that don't speak UTF-8 would presumably be showing the same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's main goals is providing a Linux-like environment on Windows, I don't think Cygwin developers should feel obliged to go out of their way to do _better_ than Linux in this regard. -- Ross Smith -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 23 14:43, Corinna Vinschen wrote: > On Sep 23 13:34, Andy Koppe wrote: > > 2009/9/23 Corinna Vinschen: > > > I have a local patch ready to use the ANSI codepage by default in the > > > "C" locale. It appears to work nicely and has the additional positive > > > side effect to simplify the code in a few places. > > > > > > If I only new that eastern language users could happily live with > > > this change as well! > > > > Here's an idea to circumvent the DBCS troubles: default to UTF-8 when > > no charset is specified in the locale and the ANSI charset isn't > > singlebyte. > > > > Based on the following grounds: > > - Full CJK support (and more) out of the box. > > - DBCSs can't have worked very well in 1.5 in the first place, because > > the shell and most applications weren't aware of double-byte > > characters. Hence backward compatibility is less of an issue here. > > - Applications that don't (yet) work with UTF-8 are also unlikely to > > work correctly with DBCSs. > > - Iwamuro Motonori asked for it. > > Yeah, I was tinkering with this idea, too, but it's much more tricky to > implement. > > I'll think about it. Turns out, it's not complicated at all. However, if we default to UTF-8 for a subset of languages anyway, it gets even more interesting to ask, why not for all languages? Isn't it better in the long run to have the same default for all Cygwin installations? I'm really wondering if we shouldn't simply default to UTF-8 as charset throughout, in the application, the console, and for the filename conversion. Yes, not all applications will work OOTB with chars > 0x7f, but it was always a bug to make any assumptions for non-ASCII chars in the C locale. Applications can be fixed, right? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 23 13:34, Andy Koppe wrote: > 2009/9/23 Corinna Vinschen: > > I have a local patch ready to use the ANSI codepage by default in the > > "C" locale. It appears to work nicely and has the additional positive > > side effect to simplify the code in a few places. > > > > If I only new that eastern language users could happily live with > > this change as well! > > Here's an idea to circumvent the DBCS troubles: default to UTF-8 when > no charset is specified in the locale and the ANSI charset isn't > singlebyte. > > Based on the following grounds: > - Full CJK support (and more) out of the box. > - DBCSs can't have worked very well in 1.5 in the first place, because > the shell and most applications weren't aware of double-byte > characters. Hence backward compatibility is less of an issue here. > - Applications that don't (yet) work with UTF-8 are also unlikely to > work correctly with DBCSs. > - Iwamuro Motonori asked for it. Yeah, I was tinkering with this idea, too, but it's much more tricky to implement. I'll think about it. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
2009/9/23 Corinna Vinschen: > I have a local patch ready to use the ANSI codepage by default in the > "C" locale. It appears to work nicely and has the additional positive > side effect to simplify the code in a few places. > > If I only new that eastern language users could happily live with > this change as well! Here's an idea to circumvent the DBCS troubles: default to UTF-8 when no charset is specified in the locale and the ANSI charset isn't singlebyte. Based on the following grounds: - Full CJK support (and more) out of the box. - DBCSs can't have worked very well in 1.5 in the first place, because the shell and most applications weren't aware of double-byte characters. Hence backward compatibility is less of an issue here. - Applications that don't (yet) work with UTF-8 are also unlikely to work correctly with DBCSs. - Iwamuro Motonori asked for it. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 22 19:07, Corinna Vinschen wrote: > On Sep 22 17:12, Andy Koppe wrote: > > True, but that's an implementation issue rather than a design issue, > > i.e. the ^N conversion needs to do the UTF-8 conversion itself rather > > than invoke the __utf8 functions. Shall I look into creating a patch? > [...] > Hmm... maybe it's not that complicated. The ^N case checks for a valid > UTF-8 lead byte right now. The U+DCxx case could be handled by > generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a > non-valid lead byte, like 0xff. I applied a patch for that. It wasn't very tricky, but while doing it, I found a couple of annoyances in the conversion functions related to the invalid character handling. So the patch is somewhat bigger than anticipated. > Only singlebyte charsets are off the hook. So, your proposal to switch > to the default ANSI codepage for the C locale would be good for most > western languages, but it would still leave the eastern language users > with double-byte charsets behind. > > Note that I'm not as opposed to your proposal to use the ANSI codepage > as before this discussion. But I would like to see that the solution > works for most eastern language users as well. I have a local patch ready to use the ANSI codepage by default in the "C" locale. It appears to work nicely and has the additional positive side effect to simplify the code in a few places. If I only new that eastern language users could happily live with this change as well! *** REQUEST FOR HELP *** Is anybody here set up to build the Cygwin DLL *and* working with an eastern language Windows, namely using the codepages 932 (SJIS), 936 (GBK), 949 (EUC-KR), or 950 (Big5)? If so, please build your own Cygwin DLL using the latest from CVS plus the attached patch, and test if this setting works for you. The change will result in using your default Windows codepage in the "C" locale in all components, that is, in the application itself, as well as in the console and the filename conversion. In contrast to the current implementation using UTF-8 for filename conversion by default, there will be no state anymore in which the application, the console window, and the filename conversion routine have a different idea of the charset to use(*). Thanks in advance, Corinna (*) Except when the application switches the console to the "alternate charset", which usually happens when it's going to print semi-graphical frame and block characters. Index: newlib/libc/locale/locale.c === RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v retrieving revision 1.25 diff -u -p -r1.25 locale.c --- newlib/libc/locale/locale.c 25 Aug 2009 18:47:24 - 1.25 +++ newlib/libc/locale/locale.c 23 Sep 2009 11:53:02 - @@ -61,6 +61,11 @@ backward compatibility with older implem xxx in [437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258]. +Instead of <<"C-">>, you can specify also <<"C.">>. Both variations allow +to specify language neutral locales while using other charsets than ASCII, +for instance <<"C.UTF-8">>, which keeps all settings as in the C locale, +but uses the UTF-8 charset. + Even when using POSIX locale strings, the only charsets allowed are <<"UTF-8">>, <<"JIS">>, <<"EUCJP">>, <<"SJIS">>, <>, <>, <<"ISO-8859-x">> with 1 <= x <= 15, or <<"CPxxx">> with xxx in @@ -431,9 +436,19 @@ loadlocale(struct _reent *p, int categor if (!strcmp (locale, "POSIX")) strcpy (locale, "C"); if (!strcmp (locale, "C")) /* Default "C" locale */ +#ifdef __CYGWIN__ +__set_charset_from_codepage (GetACP (), charset); +#else strcpy (charset, "ASCII"); - else if (locale[0] == 'C' && locale[1] == '-') /* Old newlib style */ - strcpy (charset, locale + 2); +#endif + else if (locale[0] == 'C' + && (locale[1] == '-' /* Old newlib style */ + || locale[1] == '.'))/* Extension for the C locale to allow + specifying different charsets while + sticking to the C locale in terms + of sort order, etc. Proposed in + the Debian project. */ +strcpy (charset, locale + 2); else /* POSIX style */ { char *c = locale; Index: newlib/libc/stdlib/sb_charsets.c === RCS file: /cvs/src/src/newlib/libc/stdlib/sb_charsets.c,v retrieving revision 1.3 diff -u -p -r1.3 sb_charsets.c --- newlib/libc/stdlib/sb_charsets.c25 Aug 2009 18:47:24 - 1.3 +++ newlib/libc/stdlib/sb_charsets.c23 Sep 2009 11:53:02 - @@ -24,17 +24,17 @@ wchar_t __iso_8859_conv[14][0x60] = { 0x111, 0x144, 0x148, 0xf3, 0xf4, 0x151, 0xf6, 0xf7,
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
2009/9/22 Corinna Vinschen: >> >> Therefore, when converting a UTF-16 Windows filename to the current >> >> charset, 0xDC?? words should be treated like any other UTF-16 word >> >> that can't be represented in the current charset: it should be encoded >> >> as a ^N sequence. (I started writing this before seeing your patch to the singlebyte codepage tables, which makes plenty of sense. Here goes anyway.) Having actually looked at strfuncs.cc, my diagnosis was too simplistic, because the U+DC?? codes are used not only for invalid UTF-8 bytes, but for invalid bytes in any charset. This even includes CP1252, which has a few holes in the 0x80..0x9F range. Therefore, the complete solution would be something like this: when sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte it encodes is indeed an invalid byte in the current charset. If it is, it translates it into that invalid byte, because on the way back it would once again be turned into the same 0xDC?? code. If the byte would represent (part of) a valid character, however, it would need to be encoded as a ^N sequence to ensure correct roundtripping. Now that shouldn't be too difficult to implement for singlebyte charsets, but it gets somewhat hairy for multibyte charsets, including UTF-8 itself. Here's how I think it could be done though: In sys_cp_wcstombs: * On encountering a DC?? code, extract the encoded byte, and feed it into f_mbtowc. A private mbstate for this is needed, starting in the initial state for each filename. Switch on the result of f_mbtowc: ** case -2 (incomplete sequence): add the byte to a buffer for this purpose ** case -1 (invalid sequence): copy anything already in the buffer plus the current byte into the target filename, as we can be sure that they'll turn back into U-DCbb again on the way back. ** case >0 (valid sequence): encode buffer contents and current byte as a ^N codes that don't represent valid UTF-8 * When encountering a non-DC?? code, copy any bytes left in the buffer into the target filename. Unfortunately the latter point still leaves a loophole, in case the incomplete sequence from the buffer and the subsequent bytes combine into something valid. Singlebyte charset aren 't affected though, because they don't have continuation bytes. Nor is UTF-8, because it was designed such that continuation bytes are distinct from initial bytes. Which leaves the DBCS charsets. However, it rather looks like DBCSs are an intractable problem here in any case, because of issues like this: http://support.microsoft.com/kb/170559: "There are some codes that are not matched one-to-one between Shift-JIS (Japanese character set supported by MS) and Unicode. When an application calls MultiByteToWideChar() and WideCharToMultiByte() to perform code conversion between Shift-JIS and Unicode, the function returns the wrong code value in some cases." Which leaves me scratching my head regarding the C locale. More later ... Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 22 17:12, Andy Koppe wrote: > 2009/9/22 Corinna Vinschen: > >> Therefore, when converting a UTF-16 Windows filename to the current > >> charset, 0xDC?? words should be treated like any other UTF-16 word > >> that can't be represented in the current charset: it should be encoded > >> as a ^N sequence. > > > > How? Just like the incoming multibyte character didn't represent a valid > > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. > > Therefore, the ^N conversion will fail since U+DCxx can't be converted > > to valid UTF-8. > > True, but that's an implementation issue rather than a design issue, > i.e. the ^N conversion needs to do the UTF-8 conversion itself rather > than invoke the __utf8 functions. Shall I look into creating a patch? Well, sure I'm interested to see that patch (lazy me), but please note that we need a snail mailed copyright assignment per http://cygwin.com/assign.txt from you before we can apply any significant patches. Sorry for the hassle. Hmm... maybe it's not that complicated. The ^N case checks for a valid UTF-8 lead byte right now. The U+DCxx case could be handled by generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a non-valid lead byte, like 0xff. > >> This won't work correctly, because different POSIX filenames will map > >> to the same Windows filename. For example, the filenames "\xC3\xA4" > >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that > >> represents a-umlaut in 8859-1), will both map to Windows filename > >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file > >> called "\xC4", a readdir() would show that file as "\xC3\xA4". > > > > Right, but using your above suggestion will also lead to another filename > > in readdir, it would just be \x0e\xsome\xthing. > > I don't think the suggestion above is directly relevant to the problem > I tried to highlight here. > > Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4 > on disk, while "\xC4" turns into U+DCC4, and converting back yields > the original separate filenames. Well, right now it doesn't exactly. > If I understand your proposal > correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence > converting back would yield "\xC3\xA4" for both. This is wrong. Those > filenames shouldn't be clobbering each other, and a filename shouldn't > change between open() and readdir(), certainly not without switching > charset inbetween. I see your point. I was more thinking along the lines of how likely that clobbering is, apart from pathological testcases. > Having said that, if you did switch charset from UTF-8 e.g. to > ISO-8859-1, the on-disk U+DCC4 would indeed turn into > "\x0E\xsome\xthing". However, that issue applies to any UTF-16 You don't have to switch the charset. Assume you're using any non-singlebyte charset in which \xC4 is the start of a double- or multibyte sequence. open ("\xC4"); close; readdir(); will return "\x0E\xsome\xthing" on readdir. Only singlebyte charsets are off the hook. So, your proposal to switch to the default ANSI codepage for the C locale would be good for most western languages, but it would still leave the eastern language users with double-byte charsets behind. Note that I'm not as opposed to your proposal to use the ANSI codepage as before this discussion. But I would like to see that the solution works for most eastern language users as well. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
2009/9/22 Corinna Vinschen: >> > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by >> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. >> > The problem now is that readdir() will return the transposed characters >> > as if they are the original characters. >> >> Yep, that's where the bug is. Those 0xDC?? words represent invalid >> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters. >> >> Therefore, when converting a UTF-16 Windows filename to the current >> charset, 0xDC?? words should be treated like any other UTF-16 word >> that can't be represented in the current charset: it should be encoded >> as a ^N sequence. > > How? Just like the incoming multibyte character didn't represent a valid > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. > Therefore, the ^N conversion will fail since U+DCxx can't be converted > to valid UTF-8. True, but that's an implementation issue rather than a design issue, i.e. the ^N conversion needs to do the UTF-8 conversion itself rather than invoke the __utf8 functions. Shall I look into creating a patch? >> > So it looks like the current mechanism to handle invalid multibyte >> > sequences is too complicated for us. As far as I can see, it would be >> > much simpler and less error prone to translate the invalid bytes simply >> > to the equivalent UTF-16 value. That creates filenames with UTF-16 >> > values from the ISO-8859-1 range. >> >> This won't work correctly, because different POSIX filenames will map >> to the same Windows filename. For example, the filenames "\xC3\xA4" >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that >> represents a-umlaut in 8859-1), will both map to Windows filename >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file >> called "\xC4", a readdir() would show that file as "\xC3\xA4". > > Right, but using your above suggestion will also lead to another filename > in readdir, it would just be \x0e\xsome\xthing. I don't think the suggestion above is directly relevant to the problem I tried to highlight here. Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4 on disk, while "\xC4" turns into U+DCC4, and converting back yields the original separate filenames. If I understand your proposal correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence converting back would yield "\xC3\xA4" for both. This is wrong. Those filenames shouldn't be clobbering each other, and a filename shouldn't change between open() and readdir(), certainly not without switching charset inbetween. Having said that, if you did switch charset from UTF-8 e.g. to ISO-8859-1, the on-disk U+DCC4 would indeed turn into "\x0E\xsome\xthing". However, that issue applies to any UTF-16 character not in the target charset, not just those funny U+DC?? codes for representing invalid UTF-8 bytes. The only way to avoid the POSIX filenames changing depending on locale would be to assume UTF-8 for filenames no matter the locale charset. That's an entirely different can of worms though, extending the compatibility problems discussed on the "The C locale" thread to all non-UTF-8 locales, and putting the onus for converting filenames on applications. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 21 19:54, Andy Koppe wrote: > 2009/9/21 Corinna Vinschen: > > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by > > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. > > The problem now is that readdir() will return the transposed characters > > as if they are the original characters. > > Yep, that's where the bug is. Those 0xDC?? words represent invalid > UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters. > > Therefore, when converting a UTF-16 Windows filename to the current > charset, 0xDC?? words should be treated like any other UTF-16 word > that can't be represented in the current charset: it should be encoded > as a ^N sequence. How? Just like the incoming multibyte character didn't represent a valid UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. Therefore, the ^N conversion will fail since U+DCxx can't be converted to valid UTF-8. > > So it looks like the current mechanism to handle invalid multibyte > > sequences is too complicated for us. As far as I can see, it would be > > much simpler and less error prone to translate the invalid bytes simply > > to the equivalent UTF-16 value. That creates filenames with UTF-16 > > values from the ISO-8859-1 range. > > This won't work correctly, because different POSIX filenames will map > to the same Windows filename. For example, the filenames "\xC3\xA4" > (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that > represents a-umlaut in 8859-1), will both map to Windows filename > "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file > called "\xC4", a readdir() would show that file as "\xC3\xA4". Right, but using your above suggestion will also lead to another filename in readdir, it would just be \x0e\xsome\xthing. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
2009/9/21 Corinna Vinschen: >> % cat t.c >> int main() { >> fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 >> fopen("b-\xF6\xE4\xFC\xDFz", "w"); >> fopen("c-\xF6\xE4\xFC\xDFzz", "w"); >> fopen("d-\xF6\xE4\xFC\xDFzzz", "w"); >> fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w"); >> return 0; >> } > > Ok, I see what happens. The problem is that the mechanism which is > supposed to handle invalid multibyte sequences handles the first such > byte, but misses to reset the multibyte shift state after the byte has > been handled. Basically, resetting the shift state after such a > sequence has been encountered fixes that problem. Great! > Unfortunately this is only the first half of a solution. This is what > `ls' prints after running t: > > $ ls -l --show-control-chars > total 21 > -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 a-öäüß > -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 c-öäüßzz > -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 d-öäüßzzz > -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 e-öäüßöäüß > > But this is what ls prints when setting $LANG to something "non-C": > > $ setenv LANG en (implies codepage 1252) > $ ls -l --show-control-chars > ls: cannot access a-öäüß: No such file or directory > ls: cannot access c-öäüßzz: No such file or directory > ls: cannot access d-öäüßzzz: No such file or directory > ls: cannot access e-öäüßöäüß: No such file or directory > total 21 > -? ? ? ? ? ? a-öäüß > -? ? ? ? ? ? c-öäüßzz > -? ? ? ? ? ? d-öäüßzzz > -? ? ? ? ? ? e-öäüßöäüß Btw, the same thing will happen with en.C-ISO-8859-1 or C.ASCII too. > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. > The problem now is that readdir() will return the transposed characters > as if they are the original characters. Yep, that's where the bug is. Those 0xDC?? words represent invalid UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters. Therefore, when converting a UTF-16 Windows filename to the current charset, 0xDC?? words should be treated like any other UTF-16 word that can't be represented in the current charset: it should be encoded as a ^N sequence. > ls uses some mbtowc function > to create a valid widechar string, and then uses the resulting widechar > string in some wctomb function to call stat(). It's not 'ls' that does that conversion. On the POSIX side, filenames are simply sequences of bytes, hence 'ls' would be very wrong to do any conversion between readdir() and stat(). No, it's stat() itself converting the CP1252 sequence "a-öäüß" to UTF-16, which yields L"a-öäüß". This does not contain the 0xDC?? codepoints that the actual filename contained, hence stat() fails. > So it looks like the current mechanism to handle invalid multibyte > sequences is too complicated for us. As far as I can see, it would be > much simpler and less error prone to translate the invalid bytes simply > to the equivalent UTF-16 value. That creates filenames with UTF-16 > values from the ISO-8859-1 range. This won't work correctly, because different POSIX filenames will map to the same Windows filename. For example, the filenames "\xC3\xA4" (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that represents a-umlaut in 8859-1), will both map to Windows filename "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file called "\xC4", a readdir() would show that file as "\xC3\xA4". Note also that invalid UTF-8 sequences would be much less of an issue if the C locale didn't mix UTF-8 filenames with a ISO-8859-1 console. They'd still occur e.g. when unpacking a tarball with ISO-encoded filenames while a UTF-8 locale is active. However, that sort of situation is not handled well on Linux either. Regards, Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 16 00:38, Lapo Luchini wrote: > Andy Koppe wrote: > > Hmm, we've lost the \xDF somewhere, and I'd guess it was when the > > filename got translated to UTF-16 in fopen(), which would explain what > > you're seeing > > More data: it's not simply "the last character", is something more > complex than that. > > % cat t.c > int main() { > fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 > fopen("b-\xF6\xE4\xFC\xDFz", "w"); > fopen("c-\xF6\xE4\xFC\xDFzz", "w"); > fopen("d-\xF6\xE4\xFC\xDFzzz", "w"); > fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w"); > return 0; > } Ok, I see what happens. The problem is that the mechanism which is supposed to handle invalid multibyte sequences handles the first such byte, but misses to reset the multibyte shift state after the byte has been handled. Basically, resetting the shift state after such a sequence has been encountered fixes that problem. Unfortunately this is only the first half of a solution. This is what `ls' prints after running t: $ ls -l --show-control-chars total 21 -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 a-öäüà -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 c-öäüÃzz -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 d-öäüÃzzz -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 e-öäüÃöäüà But this is what ls prints when setting $LANG to something "non-C": $ setenv LANG en (implies codepage 1252) $ ls -l --show-control-chars ls: cannot access a-öäüÃ: No such file or directory ls: cannot access c-öäüÃzz: No such file or directory ls: cannot access d-öäüÃzzz: No such file or directory ls: cannot access e-öäüÃöäüÃ: No such file or directory total 21 -? ? ? ??? a-öäüà -? ? ? ??? c-öäüÃzz -? ? ? ??? d-öäüÃzzz -? ? ? ??? e-öäüÃöäüà As you might know, invalid bytes >= 0x80 are translated to UTF-16 by transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. The problem now is that readdir() will return the transposed characters as if they are the original characters. ls uses some mbtowc function to create a valid widechar string, and then uses the resulting widechar string in some wctomb function to call stat(). However, *that* string will use a valid mutlibyte sequence to represent the character and the resulting filename is suddenly different from the actual filename on disk and stat returns with errno set to ENOENT. Since the conversion fro and to is independent of each other, there's no way to detect whether the incoming string of a wctomb was originally based on a transposed character or not. I'm not sure if I could explain this clear enough... So it looks like the current mechanism to handle invalid multibyte sequences is too complicated for us. As far as I can see, it would be much simpler and less error prone to translate the invalid bytes simply to the equivalent UTF-16 value. That creates filenames with UTF-16 values from the ISO-8859-1 range. I tested this with the files created by the above testcase. While the filenames appeared to be different dependent on the used charset, ls always handled the files gracefully. Any objections? I can also just check it in and the entire locale challenged part of the community can test it... Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Andy Koppe wrote: > Hmm, we've lost the \xDF somewhere, and I'd guess it was when the > filename got translated to UTF-16 in fopen(), which would explain what > you're seeing More data: it's not simply "the last character", is something more complex than that. % cat t.c int main() { fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 fopen("b-\xF6\xE4\xFC\xDFz", "w"); fopen("c-\xF6\xE4\xFC\xDFzz", "w"); fopen("d-\xF6\xE4\xFC\xDFzzz", "w"); fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w"); return 0; } % gcc -o t t.c % ./t % find . . ./a-??? ./b-??? ./c-??? ./d-??? ./e-??? ./t.c ./t.exe It seems that once one "high bit set" byte is encountered, everything past the last of them (itself included) is lost. Also, I can confirm this works too: % rm a-$'\366'$'\344'$'\374'$'\337' but also this, since last one doesn't count: % rm a-$'\366'$'\344'$'\374'$'\336' BTW: I didn't know about that kind of escaping, but zsh auto-completed that for me (excluding the last character, of course) -- Lapo Luchini - http://lapo.it/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
2009/9/10 Lapo Luchini: > But the real problem with that test is not really what shows and how, > the biggest problem is that it seems that filenames created with a > "wrong" filename are quite limited in usage and can't seemingly be deleted. > > % export LANG=en_EN.UTF-8 > % cat t.c > #include > int main() { > fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 > fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8 > return 0; > } > % gcc -o t t.c > % mkdir test ; cd test ; ../t ; cd .. > % ls -l test > ls: cannot access test/a-▒▒▒: No such file or directory > total 0 > -? ? ? ? ? ? a-▒▒▒ > -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß > % find test > test > test/a-??? > test/b-öäüß > % find test -delete > find: cannot delete `test/a-\366\344\374': No such file or directory Hmm, we've lost the \xDF somewhere, and I'd guess it was when the filename got translated to UTF-16 in fopen(), which would explain what you're seeing: 'find' reads the filename correctly, invokes remove() on it, which translates it to UTF-16 again, whereby we lose a second byte, so we're down to a-\366\344, which can't be deleted because it doesn't exist. > remove("a-\xF6\xE4\xFC\xDF"); Now here we start with the full name again, so if we lose the last byte we get what's actually on disk, hence the call succeeds. Bytes that don't contribute to valid UTF-8 characters get mapped to a certain subrange of UTF-16 low surrogates at 0xDC80, which is a clever trick for encoding such bytes into UTF-16 and get them out again after decoding. I stared at the code for this in sys_cp_mbstowcs for a bit, but haven't spotted where those missing byte might have gone. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
[1.7] Invalid UTF8 while creating a file -> cannot delete?
After a few problems with monotone's unit tests on Cygwin-1.7, I began searching and experimenting a bit with new 1.7 support for wide chars. I also read the full thread about its last change: http://www.cygwin.com/ml/cygwin/2009-05/msg00344.html which really makes some sense to me (when I create a file from the console I want "ls" to show back that file to me with same encoding). Problem is, that unit test assumes filenames are "raw data" and tries to create three types of filenames: ISO-8859-1, EUC-JP and UTF-8. Except on OSX where it only tries UTF-8 as that's the disk format. Now we have an UTF-16 disk format, except the library is using LANG-value-from-process-start to initialize some LANG-to-UTF16 conversion as far as I understoof so there's not really one "correct" format: it depends on the LANG env value when the test unit is launched. OK, that's a side issue since I can probably modify the tests to always be launched with LANG=C instead of using the current value so that at least it is consitent. And then maybe remove the creation of ISO-8859-1 and EUC-JP tests just like on OSX. Which could be correct... but a bit less so than on OSX itself, when that is really "the format" and not the "the DEFAULT format which could be overridden with a correct setlocale". But the real problem with that test is not really what shows and how, the biggest problem is that it seems that filenames created with a "wrong" filename are quite limited in usage and can't seemingly be deleted. % export LANG=en_EN.UTF-8 % cat t.c #include int main() { fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8 return 0; } % gcc -o t t.c % mkdir test ; cd test ; ../t ; cd .. % ls -l test ls: cannot access test/a-▒▒▒: No such file or directory total 0 -? ? ???? a-▒▒▒ -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß % find test test test/a-??? test/b-öäüß % find test -delete find: cannot delete `test/a-\366\344\374': No such file or directory find: cannot delete `test': Directory not empty % find test test test/a-??? Now... I don't know how exactly `find` works but it seems strange to me it isn't capable of deleting something it is capable of listing. Also seems strange `ls` is not capable of stat-ing something it's capable of listing. Yep, I do know that filename is "broken" in the first place, but since in the Unix world such stuff can happen as filenames are really raw data, I think probably an error on file creation would be better than creating a file that can't be consequently stat-ed or even unlinked. % cat u.c #include int main() { remove("a-\xF6\xE4\xFC\xDF"); remove("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F"); return 0; } % gcc -o u u.c OK, a program using a similarly-broken filename can delete it, but the fact it can't be deleted with "normal" tools is a bit of an inconvenience... -- Lapo Luchini - http://lapo.it/ “Premature optimisation is the root of all evil in programming.” (C. A. R. Hoare) -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple