strftime %b is broken on ja_JP locale
Hi. strftime %b is broken on ja_JP locale on cygwin-1.7.5-1. [monthtest.c] #include stdio.h #include time.h #include locale.h int main(void) { time_t now; struct tm *tm; char buffer[4096]; setlocale(LC_ALL, ja_JP.UTF-8); time(now); tm = localtime(now); strftime(buffer, sizeof(buffer), [%B][%b]\n, tm); puts(buffer); return 0; } - result on Cygwin: [5月][5] - missing suffix 月 (U+6708). - result on Debian lenny: [5月][ 5月] -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/29 wynfi...@gmail.com: Also the following be suitable if possible.. LANG=ja - iso-2022-jp LANG=ja_JP - iso-2022-jp Hmmm, I think that it is unreal. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/29 Corinna Vinschen corinna-cyg...@cygwin.com: The downside is that a user, who needs to work under the default ANSI codepage for some reason, has to know the name of the default ANSI codepage. If the problem is a problem of 1.5-1.7 migration, how about building in the wizard which sets the locale environment variable to setup.exe? Is not it proper as the solution? -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/29 Corinna Vinschen corinna-cyg...@cygwin.com: I asked if the default charset for the japanese language should be set to EUCJP rather than SJIS. The actual implementation would have been like this if (lang=xx or lang=xx_XX with x in [a-z] and X in [A-Z]?) set_charset_from_codepage() set_charset_from_codepage() { switch (GetANSI ()) [...] case 932: charset=EUCJP -- Instead of the current `charset=SJIS [...] } I think that it is not good for Japanese users because EUCJP doesn't become substitution of SJIS. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/27 IWAMURO Motonori deenhe...@gmail.com: LANG=ja - EUCJP LANG=ja_JP - EUCJP Hmmm, It is a difficult problem. I think selecting UTF-8 is good because eucJP is legacy. But, for interoperability with other UNIX-like system(*), I don't think selecting UTF-8 is good. * Solaris: ja, ja_JP - eucJP * Linux (Debian): ja - Unknown, ja_JP - eucJP I need to think more... My conclusion is as follows as a result of hearing other Japanese people's opinion: LANG=ja - UTF-8 LANG=ja_JP - UTF-8 Because, we specify eucJP explicitly when we need it. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
Hi. the default ANSI and OEM codepage on Japanese Windows systems is 932/SJIS, right? Yes. LANG=C - UTF-8 (snip) LANG=ja_JP.SJIS - SJIS It's good. LANG=ja - EUCJP LANG=ja_JP - EUCJP Hmmm, It is a difficult problem. I think selecting UTF-8 is good because eucJP is legacy. But, for interoperability with other UNIX-like system(*), I don't think selecting UTF-8 is good. * Solaris: ja, ja_JP - eucJP * Linux (Debian): ja - Unknown, ja_JP - eucJP I need to think more... -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/24 Corinna Vinschen corinna-cyg...@cygwin.com: My question is this: Is the S-JIS implementation on UNIX systems also using a different implementation to avoid using characters from the ASCII range? If so, can't we change the __sjis_wctomb and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb and __eucjp_mbtowc functions to get a safer implementation? I don't think that it is necessary to think about it. The problem of eucJP is not caused on the SJIS environment because SJIS don't support JIS-X-0212. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/22 Andy Koppe andy.ko...@gmail.com: Let's use the Windows ANSI codepage as the character set for the C locale, for both the conversion functions and filenames. This means CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese ones, and so on. I oppose the approach (the ANSI codepage is used at C locale) because CP932 (the codepage for Japanese) is hostile to the UNIX-like tools. The reason is that the CP932 format contains a lot of meta characters as follows. single character of CP932: /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/ This has a ruined influence to the tools that don't see locale. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
2009/9/24 Corinna Vinschen corinna-cyg...@cygwin.com: On Sep 24 16:03, IWAMURO Motonori wrote: 2009/9/22 Andy Koppe andy.ko...@gmail.com: Let's use the Windows ANSI codepage as the character set for the C locale, for both the conversion functions and filenames. This means CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese ones, and so on. I oppose the approach (the ANSI codepage is used at C locale) because CP932 (the codepage for Japanese) is hostile to the UNIX-like tools. The reason is that the CP932 format contains a lot of meta characters as follows. single character of CP932: /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/ I don't understand. Are you saying that the single character in CP932 consists of 12 bytes? As far as I can see, CP932 is S-JIS, which is a just a simple double byte character set. What am I missing. - CP932 (Shift_JIS) has 1byte character and 2bytes character. - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF. - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC. - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC. This includes [, \, ], ^, `, {, |, }. A lot of problems of the tools (don't see locale and use escaped string, globbing or regexp) are caused by the last fact. - Can't open file or directory. - Destroy filenames. - Lost files. For example: Case1: The CP932 byte sequence of 項目表.xls is 8D 80 96 DA 95 *5C* (=='\') 2E 78 6C 73. When this character string is treated as a character string with the escape without locale, 0x5C disappears. Case2: When use regexp of /スポット/, I expect that it matches the character strings including スポット. But, the tools (don't see locale) treat as /ス\x83|ット/ because the byte sequence of スポット is 83 58 83 *7C* (=='|') 83 62 83 67. As a result, the strings not expected are matched. Case3: When use glob of データ0[0-9].dat, it treated as デ\x81[\x83^0[0-9].dat. As a result, the files expected are not matched. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: The C locale
Hi. 2009/9/2 Andy Koppe andy.ko...@gmail.com: I see two good solutions: - Use the default Windows codepage for filenames, console, and multibyte functions. This is what happens already if you specifiy a locale with a language but no charset, e.g. en. Maximum 1.5 compatibility. - Use UTF-8 throughout. Full Unicode support out-of-the box. I want to use UTF-8 throughout. Because: - a lot of UNIX tools using network (e.g. rsync, scp, ...) treat the file name as 8bit byte array. - default locale of modern UNIX based OS is *.UTF-8. - The file with the filename including the character outside the codepage (e.g. files in iTunes folder) can be handled. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])
Hi. 2009/6/27 Andy Koppe andy.ko...@gmail.com: And then there's the Linux compatibility angle, where ja_JP.UTF-8 means ambiguous width 1 not 2. I want you not to judge it based on the behavior of current Linux. Because: - I don't think the behavior is correct. - Now, I am creating the patch for the problem. -- IWAMURO Motnori http://vmi.jp/ -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])
2009/6/15 Corinna Vinschen corinna-cyg...@cygwin.com: Yes, but the guideline exists. http://cygwin.com/ml/cygwin/2009-05/msg00444.html A single mail in a single mailing list of a single project. That's rather a suggestion than a guideline... Sorry, my writing was bad. My quotation is a part of Unicode Standard Annex #11 EAST ASIAN WIDTH. Please see When processing or displaying data of 5 Recommendations at http://www.unicode.org/unicode/reports/tr11/ . If everybody agrees to this suggestion, here's the patch. Is the name of modifier prefix cjk- good? It influences not CJK characters but a part of symbols and European characters. Please refer to Andy's opinion: http://cygwin.com/ml/cygwin/2009-06/msg00240.html It personally proposes ambinarrow because the switch of Vim is ambiwidth. And, I don't think that it is symmetrical. How about the following patch? (I have not changed the name of modifier prefix) --- libc/locale/locale.c.ORIG 2009-06-15 23:05:40.81250 +0900 +++ libc/locale/locale.c2009-06-15 22:56:35.546875000 +0900 @@ -398,7 +398,8 @@ int (*l_mbtowc) (struct _reent *, wchar_t *, const char *, size_t, const char *, mbstate_t *); #ifdef _MB_CAPABLE - int cjknarrow = 0; +#define CJK_DEFAULT -1 + int cjk_lang = CJK_DEFAULT; #endif /* POSIX is translated to C, as on Linux. */ @@ -453,11 +454,14 @@ if (c[0] == '@') { /* Modifier */ - /* Only one modifier is recognized right now. cjknarrow is used -to modify the behaviour of wcwidth() for East Asian languages. -For details see the comment at the end of this function. */ + /* Only one modifier is recognized right now. cjknarrow and +cjkwide are used to modify the behaviour of wcwidth() for +East Asian languages. For details see the comment at the +end of this function. */ if (!strcmp (c + 1, cjknarrow)) - cjknarrow = 1; + cjk_lang = 0; + else if (!strcmp (c + 1, cjkwide)) + cjk_lang = 1; } #endif } @@ -627,10 +631,11 @@ The result is stored in lc_ctype_cjk_lang and tested in wcwidth() to figure out the width to return (1 or 2) for the CJK Ambiguous Width category of characters. */ - lc_ctype_cjk_lang = !cjknarrow - ((strncmp (locale, ja, 2) == 0 -|| strncmp (locale, ko, 2) == 0 -|| strncmp (locale, zh, 2) == 0)); + lc_ctype_cjk_lang = cjk_lang != CJK_DEFAULT + ? cjk_lang + : ((strncmp (locale, ja, 2) == 0 + || strncmp (locale, ko, 2) == 0 + || strncmp (locale, zh, 2) == 0)); #endif } else if (category == LC_MESSAGES) -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])
OK. I withdraw my proposal. 2009/6/16 Corinna Vinschen corinna-cyg...@cygwin.com: On Jun 15 23:35, IWAMURO Motonori wrote: 2009/6/15 Corinna Vinschen: If everybody agrees to this suggestion, here's the patch. Is the name of modifier prefix cjk- good? It influences not CJK characters but a part of symbols and European characters. Please refer to Andy's opinion: http://cygwin.com/ml/cygwin/2009-06/msg00240.html It personally proposes ambinarrow because the switch of Vim is ambiwidth. I think cjk in the name is the right choice. There are no ambiguous characters in western languages (well, probably there are, but the ambiguity is not on the level of character widths). This is a problem which only has a meaning in these so called CJK languages. It makes sense to me to use this in the modifier name. And, I don't think that it is symmetrical. How about the following patch? (I have not changed the name of modifier prefix) I'm not convinced that we need symmetry. It looks like a nice idea for Cygwin or newlib, given that the setlocale language string is checked and picked to pieces hardcoded in the loadlocale function. However, besides of being unnecessary, other systems like Linux or BSD use the language string as directory name relative to the /usr/share/locale directory. If this gets ever used on non-Cygwin systems, the symmetry (which has no precedent in the locale arena) would require these systems to create yet another subdirectory or symlink for the same purpose. Even worse, if you propose that @cjkwide is a valid modifier for *any* language, you would make the whole mechanism on non-newlib based systems more complicated for no apparent reason. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/6/13 Thomas Wolff t...@towo.net: I have checked source data files in /usr/share/i18n/charmaps on my Linux system, e.g. UTF-8.gz. snip character widths are the same for all locales with the same charmap. It was reported as a bug, but it isn't fixed now...X-( http://sourceware.org/bugzilla/show_bug.cgi?id=4335 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021 If you think you can get your proposal passed up-stream, go ahead and try it, please! If you succeed, everything is fine. Hmmm, I think that you have misunderstood something because my explanation is bad. I called up-stream as the maintainance team of each OS, library, or application. I don't think that there is something single up-stream. Japanese language users have tried to fix of the problem for many years, but it doesn't progress so much now. - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. So the same issue (of compliance and portability, especially in the remote case) should be discussed in the NetBSD community. (Is there a suitable forum or mailing list to check?) Sorry, I don't know it because I was personally advised by one of the NetBSD maintainer ( http://www.hi-matic.org/ (written in Japanese) ). I think that ALL locale implementations should treat East Asian Ambiguous Character Width as 2 for CJK locale. Again, I agree that IF you manage to get ALL implementations to follow this approach, the solution is fine. Please go ahead. I will do so, but I want to solve the problem on Cygwin first of all. How to detect it? The application using wcwidth is not necessarily executed with terminal emulator. (e.g. text formatter) OK, my arguments refer to an interactive application that wants to control the precise representation of text on the screen. If for example a text formatter formats for paper printing, it would need to apply completely different assumptions anyway. The dreadful single/double width issue of cell-based terminals isn't relevant at all in that case. I am assuming the application that depends on the fixed-pitch font as text-formatter. (like 'indent' command) I hope the following two results become the same. - the auto-format filter program using 'wcwidth'. - run auto-format command on editor. (e.g. fill-paragraph, indent-region, etc on Emacs) -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/6/13 Corinna Vinschen vinsc...@redhat.com: I'm not sure which standard you are referring to. The problem appears to be that there is no standard for the handling of ambiguous characters. Yes, but the guideline exists. http://cygwin.com/ml/cygwin/2009-05/msg00444.html 2) Unicode Standard Annex #11 http://www.unicode.org/unicode/reports/tr11/ recommends: 5 Recommendations (snip) When processing or displaying data (snip) Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default. Define the default for ja, ko, and zh to use width = 2, with a @cjknarrow (or whatever) modifier to use width = 1. I think it is good idea. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
I oppose your proposal because I think that it is useless for us. 2009/6/6 Thomas Wolff t...@towo.net: the intention is that the codepage information should be the same for all locales having thbe UTF-8 (or any other) charmap. So you cannot freely change width information among locales with the same charmap. I don't think that there is such a restriction. The standard of the character doesn't provide for the width of the character as a standard. Also, if ja_JP.UTF-8 would mean CJK width, how would you specify a working locale setting for a terminal that does not run a CJK width font but should yet use other Japanese settings? E.g. with rxvt which does not support CJK width. Oh, we ALWAYS have a hard time in this problem VERY VERY VERY much. case1: We use only the application that treats the width of the character without locale. case2: We make the patch that solves the character width problem, and throw it out up-stream. case3: We make the patch, and apply it locally. case4: We tearfully give up the correct display of the screen. case5: We tearfully give up using the application. I selected case5 for rxvt. Thus you could define e.g. ja_jp.ut...@cjk or ja_jp.ut...@cjkwidth to indicate CJK width properties. I guess this is the most compliant way to go. I don't think that it is the good idea because: - It is a cygwin-specific solution (or workaround). - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. # to be continued. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
# Continuation of discussion. # # I hope that all the applications work correctly only by setting LANG=ja_JP.UTF-8. # I don't hope that I give up the use of the binary packages and that I keep applying many local patches. I don't think that it is the good idea because: - It is a cygwin-specific solution (or workaround). - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. - and, I don't think that I need make special cases give priority more than general cases. - I heard that there is an existing implementation that behave like my proposal. (Sorry, I didn't hear the system name.) Even if so, I think the way I described is more compatible with the locale mechanism as used elsewhere. I think that ALL locale implementations should treat East Asian Ambiguous Character Width as 2 for CJK locale. It is no problem because we -- most Japanese language users -- need not change the settings of mintty and locale after first setup. We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. In any case, mined running in mintty will detect CJK width itself, regardless of locale setting, with coming versions of both programs even when it gets changed on-the-fly :) Sorry, I can't understand above because I am not good at English. This sounds complicated. I don't think so. I think that we should consider the following issues if a new mechanism is introduced. The existing locale / terminal API don't support: - Unicode BiDi. - Unicode control characters. - Unicode combining characters. - Multilingualization. (*) - Detect font/fontset information selected with terminal emulator. (including, need to consider the case of no-tty) * Now, we can't use Japanese, Chinese, and Korean at the same time even if we use Unicode. Because many font glyphs are quite different even if the code point is the same in each language. With my proposal, an application that wishes to auto-adjust on width properties (maybe even when changing) and which (unlike mined) uses the system wcwidth functions could proceed as follows: * Detect CJK width by using a simple test string width detection. * (Optional) When receiving a SIGWINCH signal (future version of MinTTY), repeat this detection. * If e.g. LC_CTYPE starts with ja_JP.UTF-8, call setlocale with either ja_jp.ut...@cjkwidth or ja_JP.UTF-8. How to detect it? The application using wcwidth is not necessarily executed with terminal emulator. (e.g. text formatter) I'm not happy with the idea of a cygwin-specific solution (or workaround). I think that it is not cygwin-specific solution. As I tried to suggest above, using UTF-8 for different width data on one system would be quite specific, using the @ modifier syntax would not. UTF-8 is only an encoding scheme. It does not specify the character width. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/6/6 Andy Koppe andy.ko...@gmail.com: However, to make the locale setting more convenient for CJK users, there could be modifiers for both widths. Without modifier, the CJK locales would default to Ambiguous Wide, while everything else would default to Ambiguous Narrow. It is acceptable for me. Puzzled that this hasn't been solved in glibc years ago ... I also examined it. But, I was not able to discover the reason. One Debian user is trying to fix it, but it doesn't progress... http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021 http://sourceware.org/bugzilla/show_bug.cgi?id=4335 -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/6/6 Corinna Vinschen corinna-cyg...@cygwin.com: I vote for @cjkwide, regardless of Andy's objection. People using CJK will know the meaning and it has the additional advantage to be a rather simple to memorize identifier. I oppose @cjkwide approach because I don't think that I need make special cases give priority more than general cases. I think that Andy's approach is better. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7][BUG] MOJIBAKE title bar
Hi. The title bar is MOJIBAKE when the following 'wintitle.sh' works on command prompt in the UTF-8 environment (for example: LANG=ja_JP.UTF-8). http://vmi.jp/tmp/wintitle.sh http://vmi.jp/tmp/01good-mintty.png is the good result on MinTTY. http://vmi.jp/tmp/02bad-cmd.png is the bad result on command prompt. Thanks. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
Hi. How about the addition of the setting of the locale environment variable (like LANG) to the Cygwin installer? 2009/6/3 Corinna Vinschen corinna-cyg...@cygwin.com: On Jun 3 09:18, Edward Lam wrote: Corinna Vinschen wrote: The question is, what do you expect? [...] [...] Wikipedia has several suggestions on how to handle invalid UTF-8 byte sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the rule that uses the replacement character. Chris implemented using the invalid code point solution. The discussion in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html supports this solution. What's missing so far is the way back, from an invalid single second half of a surrogate pair in the 0xDCxx range back to the correct byte value. I'm just looking into that. How is anybody supposed to know that the file which consists of the single byte 0xa9 has *any* meaning at all? Why should it be the copyright sign, of all things? What I was attempting to do was to have NO conversion. In the real case that I into this, the bug.exe was the one to properly interpret what the byte 0xA9 meant from the command line. Yes, I know there are several workarounds. The command line is always converted to UTF-16 when calling a native Win32 application. If we don't do it (because we call CreateProcessA), Windows would do it. As matters stand, we have to convert ourselves, because we must call CreateProcessW. Either way, the problem persists. We just don't know what the correct conversion is for the given input. We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE. If we default to the ANSI codepage, you will have the same problem, just upside down. In both cases you will have even more problems if you start using characters not available in your default codepage. This is where I disagreed with Alexey. What we're really arguing here is whether which default will run into the least problems for the most common usage. This is subjective of course. Definitely. The right solution is always only right for a given value of right. What if the user has set LANG to, say, ja_JP.eucJP? That user of course expects that the stuff on the command line is converted to UTF-16 using the eucJP encoding. Everything else would just be very surprising. What's left as questionable is the LANG=C default case. Due to the discussion from the last month we now use UTF-8 as default encoding, because it's the only encoding which covers all (valid) characters. Sure, we could also convert the command line using the current ANSI codepage as Windows does it when calling CreateProcessA in this case. Maybe we should do that for testing? Anybody having a strong opinion here? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
I think that this problem is caused by missing setting the locale environment variable. Therefore, I think that the problem can be solved by compelling the setting with setup.exe. 2009/6/4 Corinna Vinschen corinna-cyg...@cygwin.com: http://cygwin.com/acronyms/#PCYMTNQREAIYR http://cygwin.com/acronyms/#TOFU On Jun 4 00:03, IWAMURO Motonori wrote: 2009/6/3 Corinna Vinschen What's left as questionable is the LANG=C default case. Due to the discussion from the last month we now use UTF-8 as default encoding, because it's the only encoding which covers all (valid) characters. Sure, we could also convert the command line using the current ANSI codepage as Windows does it when calling CreateProcessA in this case. Maybe we should do that for testing? Anybody having a strong opinion here? How about the addition of the setting of the locale environment variable (like LANG) to the Cygwin installer? I'm sorry, but I don't understand how that's connected to the behaviour of the Cygwin DLL. Setup.exe is an entirely different beast. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
And, I think that UTF-8 is best solution when the setting of LC_CTYPE category is C. 2009/6/4 IWAMURO Motonori deenhe...@gmail.com: I think that this problem is caused by missing setting the locale environment variable. Therefore, I think that the problem can be solved by compelling the setting with setup.exe. 2009/6/4 Corinna Vinschen corinna-cyg...@cygwin.com: http://cygwin.com/acronyms/#PCYMTNQREAIYR http://cygwin.com/acronyms/#TOFU On Jun 4 00:03, IWAMURO Motonori wrote: 2009/6/3 Corinna Vinschen What's left as questionable is the LANG=C default case. Due to the discussion from the last month we now use UTF-8 as default encoding, because it's the only encoding which covers all (valid) characters. Sure, we could also convert the command line using the current ANSI codepage as Windows does it when calling CreateProcessA in this case. Maybe we should do that for testing? Anybody having a strong opinion here? How about the addition of the setting of the locale environment variable (like LANG) to the Cygwin installer? I'm sorry, but I don't understand how that's connected to the behaviour of the Cygwin DLL. Setup.exe is an entirely different beast. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7][BUG] winsup/cygwin/strfuncs.cc
Hi. I found a trivial bug. *pmbs is unsigned char. '\x80' is -128 because it is char literal (not unsigned char). - *pmbs '\x80' is always true. # Is not 0x80 but = 0x80 correct? --- winsup/cygwin/strfuncs.cc 31 May 2009 03:59:38 - 1.30 +++ winsup/cygwin/strfuncs.cc 3 Jun 2009 17:59:23 - @@ -572,7 +572,7 @@ --len; } } - else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms, charset, ps)) 0 *pmbs '\x80') + else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms, charset, ps)) 0 *pmbs 0x80) { /* This should probably be handled in f_mbtowc which can operate on sequences rather than individual characters. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
Hi. The encoding of C locale is ASCII, and not ISO-8859-1. I don't think ASCII is the same as ISO-8859-1. Does it work on LANG=en_US.ISO-8859-1? 2009/5/29 Edward Lam edw...@sidefx.com: Alexey Borzenkov wrote: On Thu, May 28, 2009 at 7:28 PM, Edward Lam edw...@sidefx.com wrote: PS. In case you haven't noticed, copyright.txt is not a long file. It consists of a single byte, 0xA9. Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8 and the encoder fails. How is one supposed to determine one's locale in cygwin? I do NOT have LANG, or any of the LC environment variables set. I even tried explicitly setting LANG=C and it still fails. The problem does seem to stem from the new UTF-8 support in cygwin 1.7. However, I think something is going on here that is unexpected because trying something similar on Linux has no problems. To confirm that it was an UTF-8 related problem, let me repeat the steps slightly differently again. Here we assume that I've already got bug.exe compiled which simply prints out its arguments. $ export LANG=C $ ./bug arg1 before `cat copyright.txt` after arg3 0: E:\cygwin1.7\tmp\bug.exe 1: arg1 2: before *Notice that argc is 3 when it should be 4!* $ piconv -f iso-8859-1 -t utf8 copyright.txt fubar.txt $ ./bug arg1 before `cat fubar.txt` after arg3 0: E:\cygwin1.7\tmp\bug.exe 1: arg1 2: before © after 3: arg3 *So now everything works because I converted the character into UTF-8.* I think what this points to is some form of invalid source encoding of the command line argument when spawning NATIVE applications. Here's what happens when I try to compile bug.c using cygwin's gcc: $ gcc bug.c -o bug-gcc.exe $ ./bug-gcc arg1 before `cat copyright.txt` after arg3 0: ./bug-gcc 1: arg1 2: before © after 3: arg3 So there seems to be some sort of special marshaling of the command line arguments that only works when spawning cygwin apps, but breaks when running under native apps. Regards, -Edward -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
I think that you should set export LANG=en_US.ISO-8859-1 instead of export LANG=LANG=en_US.ISO-8859-1. 2009/5/30 Edward Lam edw...@sidefx.com: IWAMURO Motonori wrote: The encoding of C locale is ASCII, and not ISO-8859-1. I don't think ASCII is the same as ISO-8859-1. Does it work on LANG=en_US.ISO-8859-1? No, it doesn't. Mind you though, I haven't managed to get piconv to recognize any of my LANG settings other than C in cygwin 1.7. $ export LANG=LANG=en_US.ISO-8859-1 $ piconv perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LC_ALL = (unset), LANG = LANG=en_US.ISO-8859-1 are supported and installed on your system. (... usage omitted...) $ ./bug arg1 before `cat copyright.txt` after arg3 0: E:\cygwin1.7\tmp\bug.exe 1: arg1 2: before Regards, -Edward 2009/5/29 Edward Lam edw...@sidefx.com: Alexey Borzenkov wrote: On Thu, May 28, 2009 at 7:28 PM, Edward Lam edw...@sidefx.com wrote: PS. In case you haven't noticed, copyright.txt is not a long file. It consists of a single byte, 0xA9. Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8 and the encoder fails. How is one supposed to determine one's locale in cygwin? I do NOT have LANG, or any of the LC environment variables set. I even tried explicitly setting LANG=C and it still fails. The problem does seem to stem from the new UTF-8 support in cygwin 1.7. However, I think something is going on here that is unexpected because trying something similar on Linux has no problems. To confirm that it was an UTF-8 related problem, let me repeat the steps slightly differently again. Here we assume that I've already got bug.exe compiled which simply prints out its arguments. $ export LANG=C $ ./bug arg1 before `cat copyright.txt` after arg3 0: E:\cygwin1.7\tmp\bug.exe 1: arg1 2: before *Notice that argc is 3 when it should be 4!* $ piconv -f iso-8859-1 -t utf8 copyright.txt fubar.txt $ ./bug arg1 before `cat fubar.txt` after arg3 0: E:\cygwin1.7\tmp\bug.exe 1: arg1 2: before © after 3: arg3 *So now everything works because I converted the character into UTF-8.* I think what this points to is some form of invalid source encoding of the command line argument when spawning NATIVE applications. Here's what happens when I try to compile bug.c using cygwin's gcc: $ gcc bug.c -o bug-gcc.exe $ ./bug-gcc arg1 before `cat copyright.txt` after arg3 0: ./bug-gcc 1: arg1 2: before © after 3: arg3 So there seems to be some sort of special marshaling of the command line arguments that only works when spawning cygwin apps, but breaks when running under native apps. Regards, -Edward -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] wprintf is broken?
Sorry, my report is not correct. Because we must mix neither wide-character I/O nor multibyte-character I/O in the specification. (see the manual of fwide() function) 2009/5/17 Corinna Vinschen corinna-cyg...@cygwin.com: On May 16 23:56, IWAMURO Motonori wrote: Hi. wprintf is broken? I compile run the following source: #include stdio.h #include locale.h #include wchar.h int main(void) { setlocale(LC_ALL, en_US.UTF-8); wprintf(L%ls\n, LTest\n); printf(Test\n); return 0; } Result text: http://vmi.jp/tmp/wprintf-is-broken.txt Works for me: $ ./wp | od -c 000 T e s t \n \n T e s t \n 013 Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
I correct my proposal. 2009/5/15 IWAMURO Motonori deenhe...@gmail.com: I propose to use *_cjk() when the language part of LC_CTYPE is 'ja', 'ko', 'vi' or 'zh'. LC_CTYPE is 'ja', 'ko', or 'zh'. I remove 'vi'. (advice from a NetBSD locale part maintainer) -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/5/21 Thomas Wolff t...@towo.net: Therefore, I propose to use *_cjk() when the language part of LC_CTYPE is 'ja', 'ko', 'vi' or 'zh'. The problem with this is 1. As you say, there is no standard. But, - I think that my proposal doesn't violate any specification. - I heard that there is an existing implementation that behave like my proposal. (Sorry, I didn't hear the system name.) 2. If you wish to handle character widths compliant with the terminal your application is running in, there is no guarantee that your assumption of CJK width (or the actual locale setting if that model would be implemented) does indeed reflect the terminal's width properties. Yes, I understand it, too. My proposal is completely workaround. But it is the best solution because we have no specification/standard for my wish. 3. In mintty, you can dynamically change width properties by selecting different fonts; mintty changes CJK width behaviour according to certain font properties. static configuration in your shell using a locale variable would not reflect this change It is no problem because we -- most Japanese language users -- need not change the settings of mintty and locale after first setup. We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. I see two ways to handle this: a) Ask Andy (author of mintty) to not do this switching; It is not necessary bacause the mechanism is based on my another poroposal. (deenheart is my handle on google code.) other terminals don't switch either. If we use other terminals, we need switch CJK width option manually. (xterm, mlterm, putty, ...) b) Determine the actual CJK width behaviour dynamically. That's what mined does (in addition to other width property detection in general). It is the best solution. I think that we need specify the following: - the escape sequence about language context for terminal emulater. -- setting language context -- getting language context -- getting capability of language context (context is fixed, static or dynamic / acceptable languages) - new multilingualized string/terminal API for terminal based applications. And, we need rewrite too many applications by new API. I'm not happy with the idea of a cygwin-specific solution (or workaround). I think that it is not cygwin-specific solution. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7] wprintf is broken?
Hi. wprintf is broken? I compile run the following source: #include stdio.h #include locale.h #include wchar.h int main(void) { setlocale(LC_ALL, en_US.UTF-8); wprintf(L%ls\n, LTest\n); printf(Test\n); return 0; } Result text: http://vmi.jp/tmp/wprintf-is-broken.txt -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/17 Lenik le...@bodz.net: Thanks, but where can I get this patch? You can checkout it from CVS HEAD. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com: I have just trouble with SJIS, but that's not something I can easily test. Maybe you can look into that in the next couple of days? Maybe I can. Please explain details of the trouble. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com: Should the following part not be modified? winsup/cygwin/fhandler_console.cc: dev_state-con_mbtowc = __mbtowc; dev_state-con_wctomb = __wctomb; I'd rather not. It only affects the console and if LANG=C I'd rather see the single bytes which make up the path instead of the corresponding UTF-8 character. Hm, maybe I misunderstood. In which manner should this be modifed? I think: dev_state-con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc; dev_state-con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb; -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com: I see a couple of potential problems. What problems are those? And have some time to discuss whether these are something the user can or even should fix or workaround alone. I think that the application that use locale by the environment variable and the application that use no locale should be able to read and write the same byte sequence. However, I don't strongly request it because the applications work correctly in UTF-8. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [Fwd: [1.7] wcwidth failing configure tests]
2009/5/13 Corinna Vinschen vinsc...@redhat.com: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c This looks nice. Do you import Markus Kuhn's wcwidth implementation? Trouble is, there's the thorny issue of the CJK Ambiguous Width category of characters, which consists of things like Greek and Cyrillic letters as well as line drawing symbols. Those have a width of 1 in Western use, yet with CJK fonts they have a width of 2. That's why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. We should use the standard variation alone, imho. I don't think so. 1) It is very very inconvenient for me :-) (Now, I apply the local patch of CJK width support to cygwin1.dll in my environment.) 2) Unicode Standard Annex #11 http://www.unicode.org/unicode/reports/tr11/ recommends: 5 Recommendations (snip) When processing or displaying data (snip) Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default. The recommendation is independent of legacy encoding. I think that a new locale category that specifies the context is necessary. Because the context influences only the display or text layout. However, there is no such standard now. Therefore, I propose to use *_cjk() when the language part of LC_CTYPE is 'ja', 'ko', 'vi' or 'zh'. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com: Here's one problem. What if an application uses setenv(LANG, ...)? Oh. Hmmm, I think that anything should not occur. Do you want Cygwin to intercept all calls to setenv() to check for setting $LC_ALL/LC_CTYPE/LANG? No. I think that only setlocale() has to do the check. The reason: - setlocale(LC_CTYPE, C) is called from Cygwin startup. - The following code become valid. setenv(LANG, ...); setlocale(LC_ALL, ); -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Hi. My idea is as follows: 1) separate mbtowc/wctomb function entries to library usage and system usage. (__mbtowc/__wctomb __sys_mbtowc/__sys_wctomb) 2) If call setlocale(LC_CTYPE) by locale != C, then lib == sys. 3) If call setlocale(LC_CTYPE) by locale == C, then sys is set by LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8 converter. Cygwin startup call setlocale(LC_CTYPE, C) at winsup/cygwin/dcrt0.cc. I think that the result is as follows: 1) LANG=C lib = ascii converter, sys = UTF-8 converter. 2) LANG=xx_XX.ENCODING not call setlocale. lib = ascii converter, sys = ENCODING converter. 3) LANG=xx_XX.ENCODING call setlocale(LC_ALL, ). lib = ENCODING converter, sys = ENCODING converter. I think that [cat `read_dir_entry_and_print_app`] works correctly above all. I am writing this patch and test code now. One problem can't be solved this way: If an application fetches and stores a filename, then switches the locale, and then tries to use the filename in another system call, the filename is potentially broken. If the application switches the encoding while processing, I think that the problem is a responsibility of the application. 2009/5/13 Corinna Vinschen corinna-cyg...@cygwin.com: On May 12 19:37, Corinna Vinschen wrote: On May 13 02:29, IWAMURO Motonori wrote: I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8. There are three reasons: That's an interesting thought. Do you have a patch and, if so, did you try it? Does it, for instance, help for the issue reported in the thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html? After examining the issue Lenik reported in the above thread, I'm at a loss how to solve this problem in a generic way. The problem is that the filename changes dependent on the character set used in $LANG. The reason is that every time a multibyte filename has to be generated, it has to be converted from UTF-16 to multibyte. For instance, taking one of the filename from Lenik's example. It's stored on the filesystem as the UTF-16 sequence \u684c \u9762. If I set LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2 If I set LANG to en_US.GBK, `ls' returns the filename 0xd7 0xc0 0xc3 0xe6 And in case LANG=C, `ls' returns 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2 So, dependent on the character set setting in the application, the idea of the filename differs. That's not exactly helpful for interoperability between different applications. I can think of two potential solutions to fix this problem: (1) Always return filenames in UTF-8 encoding and pretend that UTF-8 is the way files are stored on disk. That results in unchangable filenames which are always valid. But what if an application sets LANG=.SJIS and tries to create a file using SJIS character encoding? Should the file be created using the SJIS-UTF-16 conversion or should open fail with EILSEQ? That's not good. (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then Cygwin uses the LC_CTYPE setting which corresponds to the current codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, Cygwin uses that to convert pathnames. If the application uses setlocale, Cygwin uses that setting to convert pathnames. One problem can't be solved this way: If an application fetches and stores a filename, then switches the locale, and then tries to use the filename in another system call, the filename is potentially broken. Any better ideas? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com: That's basically how my patch works. Sorry, I can't parse this sentence because of my poor English parser... Do you be writing the patch for this problem? Btw., if you plan to write more and bigger patches for Cygwin, it would be necessary to sign a copyright assignment form. That's explained on http://cygwin.com/contrib.html. Ummm, it seems to take time very much... -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com: I already wrote that patch, see http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html It seems to do what you are proposing. I read it and built cygwin1.dll. It seems to work correctly. Should the following part not be modified? winsup/cygwin/fhandler_console.cc: dev_state-con_mbtowc = __mbtowc; dev_state-con_wctomb = __wctomb; But I think the patch solves only the case of UTF-8 in the thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html. It is necessary to separate the following variables for the library and for the system to support encoding that is not UTF-8. - __mb_cur_max - lc_ctype_charset - __mbtowc - __wctomb And these variables are set by LC_ALL/LC_CTYPE/LANG if call setlocale(LC_CTYPE, C). -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Hi. I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8. There are three reasons: 1. for the interoperability between Cygwin and various UNIX-like systems (Linux, *BSD, Solaris, and so on). UNIX-like systems treat the filename as 8bit byte array, and many applications on the systems send or receive filename information without locale. (mercurial, git, rsync, and so on). 2. UTF-8 is the only encoding that can treat multi languages. 3. Today, the default encoding of modern UNIX-like systems is UTF-8. Please examine it. Thanks. -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7][python] File operation API to multibyte filenames fails.
Hi. File operation API to multibyte filenames fails on Python and Cygwin-1.7. Which Python or Cygwin-1.7 should be fixed? My environment: Windows XP SP3, Cygwin-1.7.0-46, and LANG=ja_JP.UTF-8 The following code fails on the directory which has multibyte filenames: import os os.listdir(.) Traceback (most recent call last): File stdin, line 1, in module OSError: [Errno 138] Invalid or incomplete multibyte or wide character: '.' The following code works correctly: import os import locale locale.setlocale(locale.LC_CTYPE, '') 'ja_JP.UTF-8' os.listdir(.) [(snip), '\xe3\x82\xb9\xe3\x82\xbf\xe3\x83\xbc\xe3\x83\x88 \xe3\x83\xa1\xe3\x83\x8b\xe3\x83\xa5\xe3\x83\xbc', '\xe3\x83\x87\xe3\x82\xb9\xe3\x82\xaf\xe3\x83\x88\xe3\x83\x83\xe3\x83\x97'] However, it is impossible to fix all the python scripts. There are two causes. - Python has intentionally evaded the execution of setlocale(LC_ALL, ) and/or setlocale(LC_CTYPE, ). - When locale is not appropriately set, Cygwin-1.7 converts non-ASCII character into a special sequence. (see Convert chars invalid in the current codepage to a sequence ASCII SO part of sys_cp_wcstombs in winsup/cygwin/strfuncs.cc) Which Python or Cygwin-1.7 should be fixed? -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7][python] File operation API to multibyte filenames fails.
Hi. 2009/5/8 Corinna Vinschen corinna-cyg...@cygwin.com: Your scripts. Python correctly doesn't use setlocale because it's the responsibility of the application to set the local if it uses non-ASCII chars. And Cygwin simply has no chance to convert UTF-8 to UTF-16 if the application doesn't ask for UTF-8. Oh, it is very very difficult. Because ALL python utilities which access files or directories fail. For example, Mercurial doesn't work. hg stat abort: Invalid or incomplete multibyte or wide character: /home/iwa -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7][python] File operation API to multibyte filenames fails.
2009/5/9 Corinna Vinschen corinna-cyg...@cygwin.com: can't see a fault in Cygwin. Neither from strace, nor in a GDB session. The readdir calls return the filenames using the SO sequences so that a valid byte-stream is created which also works in the C locale. However, for some reason there's a EILSEQ (138) errno generated, but from what I can tell it's not generated in Cygwin or newlib code. I think that I found Cygwin-1.7's bug. int bytes = f_wctomb (_REENT, buf, pw, charset, ps); f_wctomb is __ascii_wctomb when not using setlocale(LC_CTYPE). If return value of __ascii_wctomb == -1, errno == EILSEQ. I think that it is necessary to reset errno after wctomb. --- a/winsup/cygwin/strfuncs.cc Thu May 07 12:29:17 2009 +0900 +++ b/winsup/cygwin/strfuncs.cc Sat May 09 04:01:33 2009 +0900 @@ -432,6 +432,7 @@ ASCII SO; UTF-8 representation of invalid char. */ if (bytes == -1 *charset != 'U'/*TF-8*/) { + errno = 0; buf[0] = 0x0e; /* ASCII SO */ bytes = __utf8_wctomb (_REENT, buf + 1, pw, charset, ps); if (bytes == -1) [test code] #include stdio.h #include dirent.h #include errno.h int main(void) { DIR *dir; struct dirent *ent; dir = opendir(.); while ((ent = readdir(dir)) != NULL) printf(%d\n, ent-d_name, errno); printf(%d\n, errno); closedir(dir); return 0; } [result 1.7.0-47] 0 0 138 138 138 [result applied above patch] 0 0 0 0 0 -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7][python] File operation API to multibyte filenames fails.
Sorry, test code is bad. - printf(%d\n, ent-d_name, errno); + printf(%d\n, errno); -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
Re: [1.7][python] File operation API to multibyte filenames fails.
2009/5/9 Corinna Vinschen corinna-cyg...@cygwin.com: Cool. Thanks for the patch. This actually solves the problem. I applied the patch with just a little tweak. Thanks. The following patch might be better. --- a/winsup/cygwin/strfuncs.cc Thu May 07 12:29:17 2009 +0900 +++ b/winsup/cygwin/strfuncs.cc Sat May 09 04:39:49 2009 +0900 @@ -427,7 +427,9 @@ path names) is transform_chars in path.cc. */ if ((pw 0xff00) == 0xf000) pw = 0xff; + int eno = errno; int bytes = f_wctomb (_REENT, buf, pw, charset, ps); + errno = eno; /* Convert chars invalid in the current codepage to a sequence ASCII SO; UTF-8 representation of invalid char. */ if (bytes == -1 *charset != 'U'/*TF-8*/) Nevertheless, it looks like python has a problem as well. Why does it check an errno if the functions returned successfully? That doesn't sound right to me. When the last readdir returns NULL, python detects the error because readdir keeps previous errno. 1) ep = readdir(dirp); // ep-d_name == ., errno == 0 Python check only ep != NULL. - OK 2) ep = readdir(dirp); // ep-d_name == .., errno == 0 Python check only ep != NULL. - OK 3) ep = readdir(dirp); // ep-d_name == \xe3\x82..., errno == 138 Python check only ep != NULL. - OK 4) ep = readdir(dirp); // ep-d_name == \xe3\x83..., errno == 138 Python check only ep != NULL. - OK 5) ep = readdir(dirp); // ep == NULL, errno == 138 Python check ep == NULL and errno != 0. - Fail! -- IWAMURO Motnori http://vmi.jp/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/
[1.7] cygstart with non-ASCII arguments and UTF-8 locale don't work.
Hi. cygstart with non-ASCII arguments and UTF-8 locale don't work on cygwin-1.7.0. ls -l total 1 -rw-rw-r-- 1 iwa None 7 Apr 28 00:22 αβγ.txt cygstart αβγ.txt Unable to start 'C:\cygwin-1.7\tmp\αβγ.txt': The specified file was not found. -- IWAMURO Motnori http://vmi.jp/ cygstart.patch Description: Binary data -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/