strftime %b is broken on ja_JP locale

2010-05-12 Thread IWAMURO Motonori
Hi.

strftime %b is broken on ja_JP locale on cygwin-1.7.5-1.

[monthtest.c]

#include stdio.h
#include time.h
#include locale.h
int main(void) {
  time_t now;
  struct tm *tm;
  char buffer[4096];
  setlocale(LC_ALL, ja_JP.UTF-8);
  time(now);
  tm = localtime(now);
  strftime(buffer, sizeof(buffer), [%B][%b]\n, tm);
  puts(buffer);
  return 0;
}


- result on Cygwin: [5月][5] - missing suffix 月 (U+6708).
- result on Debian lenny: [5月][ 5月]
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-29 Thread IWAMURO Motonori
2009/9/29  wynfi...@gmail.com:
 Also the following be suitable if possible..
        LANG=ja - iso-2022-jp
     LANG=ja_JP - iso-2022-jp

Hmmm, I think that it is unreal.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-29 Thread IWAMURO Motonori
2009/9/29 Corinna Vinschen corinna-cyg...@cygwin.com:
 The downside is that a user, who needs to work under the default ANSI
 codepage for some reason, has to know the name of the default ANSI
 codepage.

If the problem is a problem of 1.5-1.7 migration, how about building
in the wizard which sets the locale environment variable to setup.exe?
Is not it proper as the solution?
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-29 Thread IWAMURO Motonori
2009/9/29 Corinna Vinschen corinna-cyg...@cygwin.com:
 I asked if the default charset for the japanese language should be set
 to EUCJP rather than SJIS.  The actual implementation would have been
 like this

  if (lang=xx or lang=xx_XX with x in [a-z] and X in [A-Z]?)
    set_charset_from_codepage()

  set_charset_from_codepage()
  {
    switch (GetANSI ())
    [...]
    case 932:
      charset=EUCJP    -- Instead of the current `charset=SJIS
    [...]
  }

I think that it is not good for Japanese users because EUCJP doesn't
become substitution of SJIS.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-28 Thread IWAMURO Motonori
2009/9/27 IWAMURO Motonori deenhe...@gmail.com:
 LANG=ja - EUCJP
 LANG=ja_JP - EUCJP

 Hmmm, It is a difficult problem.

 I think selecting UTF-8 is good because eucJP is legacy.

 But, for interoperability with other UNIX-like system(*), I don't
 think selecting UTF-8 is good.

 * Solaris: ja, ja_JP - eucJP
 * Linux (Debian): ja - Unknown, ja_JP - eucJP

 I need to think more...

My conclusion is as follows as a result of hearing other Japanese
people's opinion:

LANG=ja - UTF-8
LANG=ja_JP - UTF-8

Because, we specify eucJP explicitly when we need it.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-26 Thread IWAMURO Motonori
Hi.

 the default ANSI and OEM codepage on Japanese Windows systems is
 932/SJIS, right?

Yes.

 LANG=C - UTF-8
(snip)
 LANG=ja_JP.SJIS - SJIS

It's good.

 LANG=ja - EUCJP
 LANG=ja_JP - EUCJP

Hmmm, It is a difficult problem.

I think selecting UTF-8 is good because eucJP is legacy.

But, for interoperability with other UNIX-like system(*), I don't
think selecting UTF-8 is good.

* Solaris: ja, ja_JP - eucJP
* Linux (Debian): ja - Unknown, ja_JP - eucJP

I need to think more...

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-26 Thread IWAMURO Motonori
2009/9/24 Corinna Vinschen corinna-cyg...@cygwin.com:
 My question is this:  Is the S-JIS implementation on UNIX systems
 also using a different implementation to avoid using characters
 from the ASCII range?  If so, can't we change the __sjis_wctomb
 and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
 and __eucjp_mbtowc functions to get a safer implementation?

I don't think that it is necessary to think about it.

The problem of eucJP is not caused on the SJIS environment because
SJIS don't support JIS-X-0212.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-24 Thread IWAMURO Motonori
2009/9/22 Andy Koppe andy.ko...@gmail.com:
 Let's use the Windows ANSI codepage as the character set for the C
 locale, for both the conversion functions and filenames. This means
 CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
 ones, and so on.

I oppose the approach (the ANSI codepage is used at C locale) because
CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.

The reason is that the CP932 format contains a lot of meta characters
as follows.

  single character of CP932:
/[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/

This has a ruined influence to the tools that don't see locale.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-24 Thread IWAMURO Motonori
2009/9/24 Corinna Vinschen corinna-cyg...@cygwin.com:
 On Sep 24 16:03, IWAMURO Motonori wrote:
 2009/9/22 Andy Koppe andy.ko...@gmail.com:
  Let's use the Windows ANSI codepage as the character set for the C
  locale, for both the conversion functions and filenames. This means
  CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
  ones, and so on.

 I oppose the approach (the ANSI codepage is used at C locale) because
 CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.

 The reason is that the CP932 format contains a lot of meta characters
 as follows.

   single character of CP932:
 /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/

 I don't understand.  Are you saying that the single character in CP932
 consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
 is a just a simple double byte character set.  What am I missing.

- CP932 (Shift_JIS) has 1byte character and 2bytes character.

- The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.

- The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.

- The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
  This includes [, \, ], ^, `, {, |, }.

A lot of problems of the tools (don't see locale and use escaped
string, globbing or regexp) are caused by the last fact.

- Can't open file or directory.
- Destroy filenames.
- Lost files.

For example:

Case1: The CP932 byte sequence of 項目表.xls is 8D 80 96 DA 95 *5C*
(=='\') 2E 78 6C 73. When this character string is treated as a
character string with the escape without locale, 0x5C disappears.

Case2: When use regexp of /スポット/, I expect that it matches the
character strings including スポット. But, the tools (don't see locale)
treat as /ス\x83|ット/ because the byte sequence of スポット is 83 58 83
*7C* (=='|') 83 62 83 67. As a result, the strings not expected are
matched.

Case3: When use glob of データ0[0-9].dat, it treated as
デ\x81[\x83^0[0-9].dat. As a result, the files expected are not
matched.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: The C locale

2009-09-02 Thread IWAMURO Motonori
Hi.

2009/9/2 Andy Koppe andy.ko...@gmail.com:
 I see two good solutions:
 - Use the default Windows codepage for filenames, console, and
 multibyte functions. This is what happens already if you specifiy a
 locale with a language but no charset, e.g. en. Maximum 1.5
 compatibility.
 - Use UTF-8 throughout. Full Unicode support out-of-the box.

I want to use UTF-8 throughout.
Because:
- a lot of UNIX tools using network (e.g. rsync, scp, ...) treat the
file name as 8bit byte array.
- default locale of modern UNIX based OS is *.UTF-8.
- The file with the filename including the character outside the
codepage (e.g. files in iTunes folder) can be handled.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])

2009-06-27 Thread IWAMURO Motonori
Hi.

2009/6/27 Andy Koppe andy.ko...@gmail.com:
 And then there's the Linux compatibility angle, where ja_JP.UTF-8
 means ambiguous width 1 not 2.

I want you not to judge it based on the behavior of current Linux.
Because:
- I don't think the behavior is correct.
- Now, I am creating the patch for the problem.
-- 
IWAMURO Motnori http://vmi.jp/

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])

2009-06-15 Thread IWAMURO Motonori
2009/6/15 Corinna Vinschen corinna-cyg...@cygwin.com:
 Yes, but the guideline exists.
 http://cygwin.com/ml/cygwin/2009-05/msg00444.html

 A single mail in a single mailing list of a single project.  That's rather
 a suggestion than a guideline...

Sorry, my writing was bad. My quotation is a part of Unicode Standard
Annex #11 EAST ASIAN WIDTH.
Please see When processing or displaying data of 5 Recommendations
at http://www.unicode.org/unicode/reports/tr11/ .

 If everybody agrees to this suggestion, here's the patch.

Is the name of modifier prefix cjk- good? It influences not CJK
characters but a part of symbols and European characters.
Please refer to Andy's opinion:
http://cygwin.com/ml/cygwin/2009-06/msg00240.html

It personally proposes ambinarrow because the switch of Vim is ambiwidth.

And, I don't think that it is symmetrical. How about the following
patch? (I have not changed the name of modifier prefix)

--- libc/locale/locale.c.ORIG   2009-06-15 23:05:40.81250 +0900
+++ libc/locale/locale.c2009-06-15 22:56:35.546875000 +0900
@@ -398,7 +398,8 @@
   int (*l_mbtowc) (struct _reent *, wchar_t *, const char *, size_t,
   const char *, mbstate_t *);
 #ifdef _MB_CAPABLE
-  int cjknarrow = 0;
+#define CJK_DEFAULT -1
+  int cjk_lang = CJK_DEFAULT;
 #endif

   /* POSIX is translated to C, as on Linux. */
@@ -453,11 +454,14 @@
   if (c[0] == '@')
{
  /* Modifier */
- /* Only one modifier is recognized right now.  cjknarrow is used
-to modify the behaviour of wcwidth() for East Asian languages.
-For details see the comment at the end of this function. */
+ /* Only one modifier is recognized right now.  cjknarrow and
+cjkwide are used to modify the behaviour of wcwidth() for
+East Asian languages. For details see the comment at the
+end of this function. */
  if (!strcmp (c + 1, cjknarrow))
-   cjknarrow = 1;
+   cjk_lang = 0;
+ else if (!strcmp (c + 1, cjkwide))
+   cjk_lang = 1;
}
 #endif
 }
@@ -627,10 +631,11 @@
The result is stored in lc_ctype_cjk_lang and tested in wcwidth()
to figure out the width to return (1 or 2) for the CJK Ambiguous
Width category of characters. */
-  lc_ctype_cjk_lang = !cjknarrow
- ((strncmp (locale, ja, 2) == 0
-|| strncmp (locale, ko, 2) == 0
-|| strncmp (locale, zh, 2) == 0));
+  lc_ctype_cjk_lang = cjk_lang != CJK_DEFAULT
+   ? cjk_lang
+   : ((strncmp (locale, ja, 2) == 0
+  || strncmp (locale, ko, 2) == 0
+  || strncmp (locale, zh, 2) == 0));
 #endif
 }
   else if (category == LC_MESSAGES)
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [PATCH] Add @cjknarrow modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests])

2009-06-15 Thread IWAMURO Motonori
OK. I withdraw my proposal.

2009/6/16 Corinna Vinschen corinna-cyg...@cygwin.com:
 On Jun 15 23:35, IWAMURO Motonori wrote:
 2009/6/15 Corinna Vinschen:
  If everybody agrees to this suggestion, here's the patch.

 Is the name of modifier prefix cjk- good? It influences not CJK
 characters but a part of symbols and European characters.
 Please refer to Andy's opinion:
 http://cygwin.com/ml/cygwin/2009-06/msg00240.html

 It personally proposes ambinarrow because the switch of Vim is ambiwidth.

 I think cjk in the name is the right choice.  There are no ambiguous
 characters in western languages (well, probably there are, but the
 ambiguity is not on the level of character widths).  This is a problem
 which only has a meaning in these so called CJK languages.  It makes
 sense to me to use this in the modifier name.

 And, I don't think that it is symmetrical. How about the following
 patch? (I have not changed the name of modifier prefix)

 I'm not convinced that we need symmetry.  It looks like a nice idea for
 Cygwin or newlib, given that the setlocale language string is checked
 and picked to pieces hardcoded in the loadlocale function.

 However, besides of being unnecessary, other systems like Linux or BSD
 use the language string as directory name relative to the
 /usr/share/locale directory.  If this gets ever used on non-Cygwin
 systems, the symmetry (which has no precedent in the locale arena) would
 require these systems to create yet another subdirectory or symlink for
 the same purpose.  Even worse, if you propose that @cjkwide is a valid
 modifier for *any* language, you would make the whole mechanism on
 non-newlib based systems more complicated for no apparent reason.


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat




-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-14 Thread IWAMURO Motonori
2009/6/13 Thomas Wolff t...@towo.net:
 I have checked source data files in /usr/share/i18n/charmaps on my Linux 
 system, e.g. UTF-8.gz.
snip
 character widths are the same for all locales with the same charmap.

It was reported as a bug, but it isn't fixed now...X-(

http://sourceware.org/bugzilla/show_bug.cgi?id=4335
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021

 If you think you can get your proposal passed up-stream,
 go ahead and try it, please! If you succeed, everything is fine.

 Hmmm, I think that you have misunderstood something because my
explanation is bad.
 I called up-stream as the maintainance team of each OS, library, or
application.
 I don't think that there is something single up-stream.

Japanese language users have tried to fix of the problem for many
years, but it doesn't progress so much now.

 - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters 
 returns 2 by CJK locale is planned.
 So the same issue (of compliance and portability, especially in the
 remote case) should be discussed in the NetBSD community.
 (Is there a suitable forum or mailing list to check?)

Sorry, I don't know it because I was personally advised by one of the
NetBSD maintainer ( http://www.hi-matic.org/ (written in Japanese) ).

 I think that ALL locale implementations should treat East Asian
 Ambiguous Character Width as 2 for CJK locale.
 Again, I agree that IF you manage to get ALL implementations to follow
 this approach, the solution is fine. Please go ahead.

I will do so, but I want to solve the problem on Cygwin first of all.

 How to detect it? The application using wcwidth is not necessarily
 executed with terminal emulator. (e.g. text formatter)
 OK, my arguments refer to an interactive application that wants to
 control the precise representation of text on the screen.
 If for example a text formatter formats for paper printing, it would
 need to apply completely different assumptions anyway. The dreadful
 single/double width issue of cell-based terminals isn't relevant at
 all in that case.

I am assuming the application that depends on the fixed-pitch font as
text-formatter. (like 'indent' command)

I hope the following two results become the same.
- the auto-format filter program using 'wcwidth'.
- run auto-format command on editor. (e.g. fill-paragraph,
indent-region, etc on Emacs)
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-14 Thread IWAMURO Motonori
2009/6/13 Corinna Vinschen vinsc...@redhat.com:
 I'm not sure which standard you are referring to.

 The problem appears to be that there is no standard for the handling
 of ambiguous characters.

Yes, but the guideline exists.
http://cygwin.com/ml/cygwin/2009-05/msg00444.html
 2) Unicode Standard Annex #11
 http://www.unicode.org/unicode/reports/tr11/ recommends:
  5 Recommendations
 (snip)
  When processing or displaying data
 (snip)
  Ambiguous characters behave like wide or narrow characters depending
  on the context (language tag, script identification, associated
  font, source of data, or explicit markup; all can provide the
  context). If the context cannot be established reliably, they should
  be treated as narrow characters by default.

 Define the default for ja, ko, and zh to use width = 2, with a
 @cjknarrow (or whatever) modifier to use width = 1.

I think it is good idea.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-06 Thread IWAMURO Motonori
I oppose your proposal because I think that it is useless for us.

2009/6/6 Thomas Wolff t...@towo.net:
 the intention is that the codepage information should be the same
 for all locales having thbe UTF-8 (or any other) charmap.  So you
 cannot freely change width information among locales with the same
 charmap.

I don't think that there is such a restriction.
The standard of the character doesn't provide for the width of the
character as a standard.

 Also, if ja_JP.UTF-8 would mean CJK width, how would you specify a
 working locale setting for a terminal that does not run a CJK width
 font but should yet use other Japanese settings? E.g. with rxvt
 which does not support CJK width.

Oh, we ALWAYS have a hard time in this problem VERY VERY VERY much.

case1: We use only the application that treats the width of the
character without locale.
case2: We make the patch that solves the character width problem, and
throw it out up-stream.
case3: We make the patch, and apply it locally.
case4: We tearfully give up the correct display of the screen.
case5: We tearfully give up using the application.

I selected case5 for rxvt.

 Thus you could define e.g.
ja_jp.ut...@cjk
 or
ja_jp.ut...@cjkwidth
 to indicate CJK width properties. I guess this is the most compliant way to 
 go.

I don't think that it is the good idea because:

- It is a cygwin-specific solution (or workaround).
- In NetBSD, the change to which wcwidth of East Asian Ambiguous
Characters returns 2 by CJK locale is planned.

# to be continued.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-06 Thread IWAMURO Motonori
# Continuation of discussion.
#
# I hope that all the applications work correctly only by setting
LANG=ja_JP.UTF-8.
# I don't hope that I give up the use of the binary packages and that
I keep applying many local patches.


 I don't think that it is the good idea because:

 - It is a cygwin-specific solution (or workaround).
 - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters 
 returns 2 by CJK locale is planned.

- and, I don't think that I need make special cases give priority more
than general cases.

 - I heard that there is an existing implementation that behave like my
 proposal. (Sorry, I didn't hear the system name.)
 Even if so, I think the way I described is more compatible with the locale
 mechanism as used elsewhere.

I think that ALL locale implementations should treat East Asian
Ambiguous Character Width as 2 for CJK locale.

 It is no problem because we -- most Japanese language users -- need
 not change the settings of mintty and locale after first setup.
 We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty.
 In any case, mined running in mintty will detect CJK width itself,
 regardless of locale setting, with coming versions of both programs
 even when it gets changed on-the-fly :)

Sorry, I can't understand above because I am not good at English.

 This sounds complicated.

I don't think so. I think that we should consider the following issues
if a new mechanism is introduced.

The existing locale / terminal API don't support:
- Unicode BiDi.
- Unicode control characters.
- Unicode combining characters.
- Multilingualization. (*)
- Detect font/fontset information selected with terminal emulator.
(including, need to consider the case of no-tty)

* Now, we can't use Japanese, Chinese, and Korean at the same time
even if we use Unicode.
  Because many font glyphs are quite different even if the code point
is the same in each language.

 With my proposal, an application that wishes to auto-adjust on width
 properties (maybe even when changing) and which (unlike mined) uses
 the system wcwidth functions could proceed as follows:
 * Detect CJK width by using a simple test string width detection.
 * (Optional) When receiving a SIGWINCH signal (future version of MinTTY),
  repeat this detection.
 * If e.g. LC_CTYPE starts with ja_JP.UTF-8, call setlocale with
  either ja_jp.ut...@cjkwidth or ja_JP.UTF-8.

How to detect it? The application using wcwidth is not necessarily
executed with terminal emulator. (e.g. text formatter)

  I'm not happy with the idea of a cygwin-specific solution (or workaround).
 I think that it is not cygwin-specific solution.
 As I tried to suggest above, using UTF-8 for different width data on one
 system would be quite specific, using the @ modifier syntax would not.

UTF-8 is only an encoding scheme. It does not specify the character width.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-06 Thread IWAMURO Motonori
2009/6/6 Andy Koppe andy.ko...@gmail.com:
 However, to make the locale setting more convenient for CJK users,
 there could be modifiers for both widths. Without modifier, the CJK
 locales would default to Ambiguous Wide, while everything else would
 default to Ambiguous Narrow.

It is acceptable for me.

 Puzzled that this hasn't been solved in glibc years ago ...

I also examined it.
But, I was not able to discover the reason.

One Debian user is trying to fix it, but it doesn't progress...

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-06-06 Thread IWAMURO Motonori
2009/6/6 Corinna Vinschen corinna-cyg...@cygwin.com:
 I vote for @cjkwide, regardless of Andy's objection.  People using CJK
 will know the meaning and it has the additional advantage to be a rather
 simple to memorize identifier.

I oppose @cjkwide approach because I don't think that I need make
special cases give priority more than general cases.

I think that Andy's approach is better.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7][BUG] MOJIBAKE title bar

2009-06-04 Thread IWAMURO Motonori
Hi.

The title bar is MOJIBAKE when the following 'wintitle.sh' works on
command prompt in the UTF-8 environment (for example:
LANG=ja_JP.UTF-8).

http://vmi.jp/tmp/wintitle.sh

http://vmi.jp/tmp/01good-mintty.png is the good result on MinTTY.
http://vmi.jp/tmp/02bad-cmd.png is the bad result on command prompt.

Thanks.
--
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
Hi.

How about the addition of the setting of the locale environment
variable (like LANG) to the Cygwin installer?

2009/6/3 Corinna Vinschen corinna-cyg...@cygwin.com:
 On Jun  3 09:18, Edward Lam wrote:
 Corinna Vinschen wrote:
 The question is, what do you expect?  [...]
 [...]
 Wikipedia has several suggestions on how to handle invalid UTF-8 byte
 sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the
 rule that uses the replacement character.

 Chris implemented using the invalid code point solution.  The discussion
 in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
 supports this solution.  What's missing so far is the way back, from
 an invalid single second half of a surrogate pair in the 0xDCxx range
 back to the correct byte value.  I'm just looking into that.

  How is anybody supposed to know that the file which consists
  of the single byte 0xa9 has *any* meaning at all?  Why should it be
  the copyright sign, of all things?

 What I was attempting to do was to have NO conversion. In the
 real case that I into this, the bug.exe was the one to properly
 interpret what the byte 0xA9 meant from the command line. Yes, I know
 there are several workarounds.

 The command line is always converted to UTF-16 when calling a native
 Win32 application.  If we don't do it (because we call CreateProcessA),
 Windows would do it.  As matters stand, we have to convert ourselves,
 because we must call CreateProcessW.  Either way, the problem persists.
 We just don't know what the correct conversion is for the given input.
 We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE.

 If we default to the ANSI codepage, you will have the same problem,
 just upside down.  In both cases you will have even more problems if
 you start using characters not available in your default codepage.

 This is where I disagreed with Alexey. What we're really arguing here is
 whether which default will run into the least problems for the most
 common usage. This is subjective of course.

 Definitely.  The right solution is always only right for a given value
 of right.  What if the user has set LANG to, say, ja_JP.eucJP?  That
 user of course expects that the stuff on the command line is converted
 to UTF-16 using the eucJP encoding.  Everything else would just be very
 surprising.

 What's left as questionable is the LANG=C default case.  Due to the
 discussion from the last month we now use UTF-8 as default encoding,
 because it's the only encoding which covers all (valid) characters.
 Sure, we could also convert the command line using the current ANSI
 codepage as Windows does it when calling CreateProcessA in this case.

 Maybe we should do that for testing?  Anybody having a strong opinion
 here?


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
I think that this problem is caused by missing setting the locale
environment variable.
Therefore, I think that the problem can be solved by compelling the
setting with setup.exe.

2009/6/4 Corinna Vinschen corinna-cyg...@cygwin.com:

 http://cygwin.com/acronyms/#PCYMTNQREAIYR
 http://cygwin.com/acronyms/#TOFU

 On Jun  4 00:03, IWAMURO Motonori wrote:
 2009/6/3 Corinna Vinschen
  What's left as questionable is the LANG=C default case.  Due to the
  discussion from the last month we now use UTF-8 as default encoding,
  because it's the only encoding which covers all (valid) characters.
  Sure, we could also convert the command line using the current ANSI
  codepage as Windows does it when calling CreateProcessA in this case.
 
  Maybe we should do that for testing?  Anybody having a strong opinion
  here?

 How about the addition of the setting of the locale environment
 variable (like LANG) to the Cygwin installer?

 I'm sorry, but I don't understand how that's connected to the behaviour
 of the Cygwin DLL.  Setup.exe is an entirely different beast.


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
And, I think that UTF-8 is best solution when the setting of LC_CTYPE
category is C.

2009/6/4 IWAMURO Motonori deenhe...@gmail.com:
 I think that this problem is caused by missing setting the locale
 environment variable.
 Therefore, I think that the problem can be solved by compelling the
 setting with setup.exe.

 2009/6/4 Corinna Vinschen corinna-cyg...@cygwin.com:

 http://cygwin.com/acronyms/#PCYMTNQREAIYR
 http://cygwin.com/acronyms/#TOFU

 On Jun  4 00:03, IWAMURO Motonori wrote:
 2009/6/3 Corinna Vinschen
  What's left as questionable is the LANG=C default case.  Due to the
  discussion from the last month we now use UTF-8 as default encoding,
  because it's the only encoding which covers all (valid) characters.
  Sure, we could also convert the command line using the current ANSI
  codepage as Windows does it when calling CreateProcessA in this case.
 
  Maybe we should do that for testing?  Anybody having a strong opinion
  here?

 How about the addition of the setting of the locale environment
 variable (like LANG) to the Cygwin installer?

 I'm sorry, but I don't understand how that's connected to the behaviour
 of the Cygwin DLL.  Setup.exe is an entirely different beast.


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





 --
 IWAMURO Motnori http://vmi.jp/




-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7][BUG] winsup/cygwin/strfuncs.cc

2009-06-03 Thread IWAMURO Motonori
Hi.

I found a trivial bug.

*pmbs is unsigned char.
'\x80' is -128 because it is char literal (not unsigned char).
- *pmbs  '\x80' is always true.

# Is not  0x80 but = 0x80 correct?

--- winsup/cygwin/strfuncs.cc   31 May 2009 03:59:38 -  1.30
+++ winsup/cygwin/strfuncs.cc   3 Jun 2009 17:59:23 -
@@ -572,7 +572,7 @@
  --len;
}
}
-  else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs,
nms, charset, ps))  0  *pmbs  '\x80')
+  else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs,
nms, charset, ps))  0  *pmbs  0x80)
{
  /* This should probably be handled in f_mbtowc which can operate
 on sequences rather than individual characters.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread IWAMURO Motonori
Hi.

The encoding of C locale is ASCII, and not ISO-8859-1.
I don't think ASCII is the same as ISO-8859-1.
Does it work on LANG=en_US.ISO-8859-1?

2009/5/29 Edward Lam edw...@sidefx.com:
 Alexey Borzenkov wrote:

 On Thu, May 28, 2009 at 7:28 PM, Edward Lam edw...@sidefx.com wrote:

 PS. In case you haven't noticed, copyright.txt is not a long file. It
 consists of a single byte, 0xA9.

 Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
 and the encoder fails.

 How is one supposed to determine one's locale in cygwin? I do NOT have LANG,
 or any of the LC environment variables set. I even tried explicitly setting
 LANG=C and it still fails.

 The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
 However, I think something is going on here that is unexpected because
 trying something similar on Linux has no problems. To confirm that it was an
 UTF-8 related problem, let me repeat the steps slightly differently again.
 Here we assume that I've already got bug.exe compiled which simply prints
 out its arguments.

 $ export LANG=C

 $ ./bug arg1 before `cat copyright.txt` after arg3
 0: E:\cygwin1.7\tmp\bug.exe
 1: arg1
 2: before

 *Notice that argc is 3 when it should be 4!*

 $ piconv -f iso-8859-1 -t utf8  copyright.txt  fubar.txt

 $ ./bug arg1 before `cat fubar.txt` after arg3
 0: E:\cygwin1.7\tmp\bug.exe
 1: arg1
 2: before © after
 3: arg3

 *So now everything works because I converted the character into UTF-8.*

 I think what this points to is some form of invalid source encoding of the
 command line argument when spawning NATIVE applications.

 Here's what happens when I try to compile bug.c using cygwin's gcc:

 $ gcc bug.c -o bug-gcc.exe

 $ ./bug-gcc arg1 before `cat copyright.txt` after arg3
 0: ./bug-gcc
 1: arg1
 2: before © after
 3: arg3

 So there seems to be some sort of special marshaling of the command line
 arguments that only works when spawning cygwin apps, but breaks when running
 under native apps.

 Regards,
 -Edward

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread IWAMURO Motonori
I think that you should set export LANG=en_US.ISO-8859-1 instead of
export LANG=LANG=en_US.ISO-8859-1.

2009/5/30 Edward Lam edw...@sidefx.com:
 IWAMURO Motonori wrote:

 The encoding of C locale is ASCII, and not ISO-8859-1.
 I don't think ASCII is the same as ISO-8859-1.
 Does it work on LANG=en_US.ISO-8859-1?

 No, it doesn't. Mind you though, I haven't managed to get piconv to
 recognize any of my LANG settings other than C in cygwin 1.7.

 $ export LANG=LANG=en_US.ISO-8859-1

 $ piconv
 perl: warning: Setting locale failed.
 perl: warning: Please check that your locale settings:
        LC_ALL = (unset),
        LANG = LANG=en_US.ISO-8859-1
    are supported and installed on your system.

 (... usage omitted...)

 $ ./bug arg1 before `cat copyright.txt` after arg3
 0: E:\cygwin1.7\tmp\bug.exe
 1: arg1
 2: before

 Regards,
 -Edward

 2009/5/29 Edward Lam edw...@sidefx.com:

 Alexey Borzenkov wrote:

 On Thu, May 28, 2009 at 7:28 PM, Edward Lam edw...@sidefx.com wrote:

 PS. In case you haven't noticed, copyright.txt is not a long file. It
 consists of a single byte, 0xA9.

 Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
 and the encoder fails.

 How is one supposed to determine one's locale in cygwin? I do NOT have
 LANG,
 or any of the LC environment variables set. I even tried explicitly
 setting
 LANG=C and it still fails.

 The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
 However, I think something is going on here that is unexpected because
 trying something similar on Linux has no problems. To confirm that it was
 an
 UTF-8 related problem, let me repeat the steps slightly differently
 again.
 Here we assume that I've already got bug.exe compiled which simply prints
 out its arguments.

 $ export LANG=C

 $ ./bug arg1 before `cat copyright.txt` after arg3
 0: E:\cygwin1.7\tmp\bug.exe
 1: arg1
 2: before

 *Notice that argc is 3 when it should be 4!*

 $ piconv -f iso-8859-1 -t utf8  copyright.txt  fubar.txt

 $ ./bug arg1 before `cat fubar.txt` after arg3
 0: E:\cygwin1.7\tmp\bug.exe
 1: arg1
 2: before © after
 3: arg3

 *So now everything works because I converted the character into UTF-8.*

 I think what this points to is some form of invalid source encoding of
 the
 command line argument when spawning NATIVE applications.

 Here's what happens when I try to compile bug.c using cygwin's gcc:

 $ gcc bug.c -o bug-gcc.exe

 $ ./bug-gcc arg1 before `cat copyright.txt` after arg3
 0: ./bug-gcc
 1: arg1
 2: before © after
 3: arg3

 So there seems to be some sort of special marshaling of the command line
 arguments that only works when spawning cygwin apps, but breaks when
 running
 under native apps.

 Regards,
 -Edward

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/







 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] wprintf is broken?

2009-05-27 Thread IWAMURO Motonori
Sorry, my report is not correct.
Because we must mix neither wide-character I/O nor multibyte-character
I/O in the specification.
(see the manual of fwide() function)

2009/5/17 Corinna Vinschen corinna-cyg...@cygwin.com:
 On May 16 23:56, IWAMURO Motonori wrote:
 Hi.

 wprintf is broken?

 I compile  run the following source:

 #include stdio.h
 #include locale.h
 #include wchar.h
 int main(void) {
   setlocale(LC_ALL, en_US.UTF-8);
   wprintf(L%ls\n, LTest\n);
   printf(Test\n);
   return 0;
 }

 Result text:
 http://vmi.jp/tmp/wprintf-is-broken.txt

 Works for me:

  $ ./wp | od -c
  000   T   e   s   t  \n  \n   T   e   s   t  \n
  013


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-05-26 Thread IWAMURO Motonori
I correct my proposal.

2009/5/15 IWAMURO Motonori deenhe...@gmail.com:
 I propose to use *_cjk() when the language part of LC_CTYPE
 is 'ja', 'ko', 'vi' or 'zh'.

LC_CTYPE is 'ja', 'ko', or 'zh'. I remove 'vi'. (advice from a NetBSD
locale part maintainer)
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-05-20 Thread IWAMURO Motonori
2009/5/21 Thomas Wolff t...@towo.net:
  Therefore, I propose to use *_cjk() when the language part of LC_CTYPE
  is 'ja', 'ko', 'vi' or 'zh'.
 The problem with this is
 1. As you say, there is no standard.

But,
- I think that my proposal doesn't violate any specification.
- I heard that there is an existing implementation that behave like my
proposal. (Sorry, I didn't hear the system name.)

 2. If you wish to handle character widths compliant with the terminal
   your application is running in, there is no guarantee that your
   assumption of CJK width (or the actual locale setting if that model
   would be implemented) does indeed reflect the terminal's width properties.

Yes, I understand it, too. My proposal is completely workaround.
But it is the best solution because we have no specification/standard
for my wish.

 3. In mintty, you can dynamically change width properties by selecting
   different fonts; mintty changes CJK width behaviour according to certain
   font properties. static configuration in your shell using a locale
   variable would not reflect this change

It is no problem because we -- most Japanese language users -- need
not change the settings of mintty and locale after first setup.
We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty.

   I see two ways to handle this:
   a) Ask Andy (author of mintty) to not do this switching;

It is not necessary bacause the mechanism is based on my another
poroposal. (deenheart is my handle on google code.)

 other terminals don't switch either.

If we use other terminals, we need switch CJK width option manually.
(xterm, mlterm, putty, ...)

   b) Determine the actual CJK width behaviour dynamically. That's what
      mined does (in addition to other width property detection in general).

It is the best solution. I think that we need specify the following:
- the escape sequence about language context for terminal emulater.
-- setting language context
-- getting language context
-- getting capability of language context
   (context is fixed, static or dynamic / acceptable languages)
- new multilingualized string/terminal API for terminal based applications.

And, we need rewrite too many applications by new API.

 I'm not happy with the idea of a cygwin-specific solution (or workaround).

I think that it is not cygwin-specific solution.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7] wprintf is broken?

2009-05-16 Thread IWAMURO Motonori
Hi.

wprintf is broken?

I compile  run the following source:

#include stdio.h
#include locale.h
#include wchar.h
int main(void) {
  setlocale(LC_ALL, en_US.UTF-8);
  wprintf(L%ls\n, LTest\n);
  printf(Test\n);
  return 0;
}

Result text:
http://vmi.jp/tmp/wprintf-is-broken.txt
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread IWAMURO Motonori
2009/5/17 Lenik le...@bodz.net:
 Thanks, but where can I get this patch?

You can checkout it from CVS HEAD.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread IWAMURO Motonori
2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
 I have just trouble with SJIS, but that's not something I can easily
 test. Maybe you can look into that in the next couple of days?

Maybe I can. Please explain details of the trouble.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
  Should the following part not be modified?
 
  winsup/cygwin/fhandler_console.cc:
   dev_state-con_mbtowc = __mbtowc;
   dev_state-con_wctomb = __wctomb;

 I'd rather not.  It only affects the console and if LANG=C I'd rather
 see the single bytes which make up the path instead of the corresponding
 UTF-8 character.

 Hm, maybe I misunderstood.  In which manner should this be modifed?

I think:

dev_state-con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc;
dev_state-con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb;
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 I see a couple of potential problems.

What problems are those?

 And have some time to discuss whether these are something the
 user can or even should fix or workaround alone.

I think that the application that use locale by the environment
variable and the application that use no locale should be able to read
and write the same byte sequence.

However, I don't strongly request it because the applications work
correctly in UTF-8.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [Fwd: [1.7] wcwidth failing configure tests]

2009-05-14 Thread IWAMURO Motonori
2009/5/13 Corinna Vinschen vinsc...@redhat.com:
 http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

 This looks nice.

Do you import Markus Kuhn's wcwidth implementation?

 Trouble is, there's the thorny issue of the CJK Ambiguous Width
 category of characters, which consists of things like Greek and
 Cyrillic letters as well as line drawing symbols. Those have a width
 of 1 in Western use, yet with CJK fonts they have a width of 2. That's
 why Markus Kuhn's code includes the mk_wcswidth_cjk() variant.

 We should use the standard variation alone, imho.

I don't think so.

1) It is very very inconvenient for me :-)
(Now, I apply the local patch of CJK width support to cygwin1.dll in
my environment.)

2) Unicode Standard Annex #11
http://www.unicode.org/unicode/reports/tr11/ recommends:
 5 Recommendations
(snip)
 When processing or displaying data
(snip)
 Ambiguous characters behave like wide or narrow characters depending
 on the context (language tag, script identification, associated
 font, source of data, or explicit markup; all can provide the
 context). If the context cannot be established reliably, they should
 be treated as narrow characters by default.

The recommendation is independent of legacy encoding.

I think that a new locale category that specifies the context is necessary.
Because the context influences only the display or text layout.

However, there is no such standard now.

Therefore, I propose to use *_cjk() when the language part of LC_CTYPE
is 'ja', 'ko', 'vi' or 'zh'.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori
2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
 Here's one problem.  What if an application uses setenv(LANG, ...)?

Oh. Hmmm, I think that anything should not occur.

 Do you want Cygwin to intercept all calls to setenv() to check for
 setting $LC_ALL/LC_CTYPE/LANG?

No. I think that only setlocale() has to do the check.
The reason:
- setlocale(LC_CTYPE, C) is called from Cygwin startup.
- The following code become valid.
setenv(LANG, ...);
setlocale(LC_ALL, );
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori
Hi.

My idea is as follows:

1)  separate mbtowc/wctomb function entries to library usage and
system usage. (__mbtowc/__wctomb  __sys_mbtowc/__sys_wctomb)

2) If call setlocale(LC_CTYPE) by locale != C, then lib == sys.

3) If call setlocale(LC_CTYPE) by locale == C, then sys is set by
LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8
converter.

Cygwin startup call setlocale(LC_CTYPE, C) at winsup/cygwin/dcrt0.cc.

I think that the result is as follows:

1) LANG=C
   lib = ascii converter, sys = UTF-8 converter.

2) LANG=xx_XX.ENCODING  not call setlocale.
   lib = ascii converter, sys = ENCODING converter.

3) LANG=xx_XX.ENCODING  call setlocale(LC_ALL, ).
   lib = ENCODING converter, sys = ENCODING converter.

I think that [cat `read_dir_entry_and_print_app`] works correctly above all.

I am writing this patch and test code now.

 One problem can't be solved this way:  If an application fetches
 and stores a filename, then switches the locale, and then tries
 to use the filename in another system call, the filename is
 potentially broken.

If the application switches the encoding while processing, I think
that the problem is a responsibility of the application.

2009/5/13 Corinna Vinschen corinna-cyg...@cygwin.com:
 On May 12 19:37, Corinna Vinschen wrote:
 On May 13 02:29, IWAMURO Motonori wrote:
  I propose that the filename encoding in C locale uses UTF-8 instead of 
  SO/UTF-8.
 
  There are three reasons:

 That's an interesting thought.  Do you have a patch and, if so, did you
 try it?  Does it, for instance, help for the issue reported in the
 thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?

 After examining the issue Lenik reported in the above thread, I'm at
 a loss how to solve this problem in a generic way.

 The problem is that the filename changes dependent on the character
 set used in $LANG.  The reason is that every time a multibyte filename
 has to be generated, it has to be converted from UTF-16 to multibyte.

 For instance, taking one of the filename from Lenik's example.  It's
 stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
 LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

 If I set LANG to en_US.GBK, `ls' returns the filename

  0xd7 0xc0 0xc3 0xe6

 And in case LANG=C, `ls' returns

  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

 So, dependent on the character set setting in the application, the idea
 of the filename differs.  That's not exactly helpful for interoperability
 between different applications.

 I can think of two potential solutions to fix this problem:

 (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
    is the way files are stored on disk.  That results in unchangable
    filenames which are always valid.

    But what if an application sets LANG=.SJIS and tries to create
    a file using SJIS character encoding?  Should the file be created
    using the SJIS-UTF-16 conversion or should open fail with EILSEQ?
    That's not good.

 (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
    Cygwin uses the LC_CTYPE setting which corresponds to the current
    codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment,
    Cygwin uses that to convert pathnames.  If the application uses
    setlocale, Cygwin uses that setting to convert pathnames.

    One problem can't be solved this way:  If an application fetches
    and stores a filename, then switches the locale, and then tries
    to use the filename in another system call, the filename is
    potentially broken.

 Any better ideas?


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 That's basically how my patch works.

Sorry, I can't parse this sentence because of my poor English parser...
Do you be writing the patch for this problem?

 Btw., if you plan to write more and bigger patches for Cygwin, it would
 be necessary to sign a copyright assignment form.  That's explained on
 http://cygwin.com/contrib.html.

Ummm, it seems to take time very much...
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori
2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 I already wrote that patch, see
 http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
 It seems to do what you are proposing.

I read it and built cygwin1.dll. It seems to work correctly.

Should the following part not be modified?

winsup/cygwin/fhandler_console.cc:
 dev_state-con_mbtowc = __mbtowc;
 dev_state-con_wctomb = __wctomb;

But I think the patch solves only the case of UTF-8 in the thread
starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html.

It is necessary to separate the following variables for the library
and for the system to support encoding that is not UTF-8.

- __mb_cur_max
- lc_ctype_charset
- __mbtowc
- __wctomb

And these variables are set by LC_ALL/LC_CTYPE/LANG if call
setlocale(LC_CTYPE, C).
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread IWAMURO Motonori
Hi.

I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8.

There are three reasons:

1. for the interoperability between Cygwin and various UNIX-like
systems (Linux, *BSD, Solaris, and so on).
   UNIX-like systems treat the filename as 8bit byte array, and many
applications on the systems send or receive filename information
without locale. (mercurial, git, rsync, and so on).

2. UTF-8 is the only encoding that can treat multi languages.

3. Today, the default encoding of modern UNIX-like systems is UTF-8.

Please examine it.

Thanks.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7][python] File operation API to multibyte filenames fails.

2009-05-08 Thread IWAMURO Motonori
Hi.

File operation API to multibyte filenames fails on Python and Cygwin-1.7.
Which Python or Cygwin-1.7 should be fixed?

My environment: Windows XP SP3, Cygwin-1.7.0-46, and LANG=ja_JP.UTF-8

The following code fails on the directory which has multibyte filenames:

 import os
 os.listdir(.)
Traceback (most recent call last):
  File stdin, line 1, in module
OSError: [Errno 138] Invalid or incomplete multibyte or wide character: '.'

The following code works correctly:

 import os
 import locale
 locale.setlocale(locale.LC_CTYPE, '')
'ja_JP.UTF-8'
 os.listdir(.)
[(snip), '\xe3\x82\xb9\xe3\x82\xbf\xe3\x83\xbc\xe3\x83\x88
\xe3\x83\xa1\xe3\x83\x8b\xe3\x83\xa5\xe3\x83\xbc',
'\xe3\x83\x87\xe3\x82\xb9\xe3\x82\xaf\xe3\x83\x88\xe3\x83\x83\xe3\x83\x97']

However, it is impossible to fix all the python scripts.

There are two causes.

- Python has intentionally evaded the execution of setlocale(LC_ALL,
) and/or setlocale(LC_CTYPE, ).
- When locale is not appropriately set, Cygwin-1.7 converts non-ASCII
character into a special sequence. (see Convert chars invalid in the
current codepage to a sequence ASCII SO part of sys_cp_wcstombs in
winsup/cygwin/strfuncs.cc)

Which Python or Cygwin-1.7 should be fixed?
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7][python] File operation API to multibyte filenames fails.

2009-05-08 Thread IWAMURO Motonori
Hi.

2009/5/8 Corinna Vinschen corinna-cyg...@cygwin.com:
 Your scripts.  Python correctly doesn't use setlocale because it's
 the responsibility of the application to set the local if it uses
 non-ASCII chars.  And Cygwin simply has no chance to convert UTF-8
 to UTF-16 if the application doesn't ask for UTF-8.

Oh, it is very very difficult.
Because ALL python utilities which access files or directories fail.
For example, Mercurial doesn't work.

 hg stat
abort: Invalid or incomplete multibyte or wide character: /home/iwa
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7][python] File operation API to multibyte filenames fails.

2009-05-08 Thread IWAMURO Motonori
2009/5/9 Corinna Vinschen corinna-cyg...@cygwin.com:
 can't see a fault in Cygwin. Neither from strace, nor in a GDB session.
 The readdir calls return the filenames using the SO sequences so that
 a valid byte-stream is created which also works in the C locale.
 However, for some reason there's a EILSEQ (138) errno generated, but
 from what I can tell it's not generated in Cygwin or newlib code.

I think that I found Cygwin-1.7's bug.

 int bytes = f_wctomb (_REENT, buf, pw, charset, ps);

f_wctomb is __ascii_wctomb when not using setlocale(LC_CTYPE).
If return value of __ascii_wctomb == -1, errno == EILSEQ.

I think that it is necessary to reset errno after wctomb.

--- a/winsup/cygwin/strfuncs.cc Thu May 07 12:29:17 2009 +0900
+++ b/winsup/cygwin/strfuncs.cc Sat May 09 04:01:33 2009 +0900
@@ -432,6 +432,7 @@
  ASCII SO; UTF-8 representation of invalid char. */
   if (bytes == -1  *charset != 'U'/*TF-8*/)
 {
+ errno = 0;
  buf[0] = 0x0e; /* ASCII SO */
  bytes = __utf8_wctomb (_REENT, buf + 1, pw, charset, ps);
  if (bytes == -1)

[test code]
#include stdio.h
#include dirent.h
#include errno.h

int main(void) {
  DIR *dir;
  struct dirent *ent;
  dir = opendir(.);
  while ((ent = readdir(dir)) != NULL)
printf(%d\n, ent-d_name, errno);
  printf(%d\n, errno);
  closedir(dir);
  return 0;
}

[result 1.7.0-47]
0
0
138
138
138

[result applied above patch]
0
0
0
0
0
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7][python] File operation API to multibyte filenames fails.

2009-05-08 Thread IWAMURO Motonori
Sorry, test code is bad.

-   printf(%d\n, ent-d_name, errno);
+   printf(%d\n, errno);
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: [1.7][python] File operation API to multibyte filenames fails.

2009-05-08 Thread IWAMURO Motonori
2009/5/9 Corinna Vinschen corinna-cyg...@cygwin.com:
 Cool.  Thanks for the patch.  This actually solves the problem.
 I applied the patch with just a little tweak.

Thanks.

The following patch might be better.

--- a/winsup/cygwin/strfuncs.cc Thu May 07 12:29:17 2009 +0900
+++ b/winsup/cygwin/strfuncs.cc Sat May 09 04:39:49 2009 +0900
@@ -427,7 +427,9 @@
 path names) is transform_chars in path.cc. */
   if ((pw  0xff00) == 0xf000)
pw = 0xff;
+  int eno = errno;
   int bytes = f_wctomb (_REENT, buf, pw, charset, ps);
+  errno = eno;
   /* Convert chars invalid in the current codepage to a sequence
  ASCII SO; UTF-8 representation of invalid char. */
   if (bytes == -1  *charset != 'U'/*TF-8*/)

 Nevertheless, it looks like python has a problem as well.  Why does it
 check an errno if the functions returned successfully?  That doesn't
 sound right to me.

When the last readdir returns NULL, python detects the error because
readdir keeps previous errno.

1) ep = readdir(dirp); // ep-d_name == ., errno == 0
   Python check only ep != NULL. - OK
2) ep = readdir(dirp); // ep-d_name == .., errno == 0
   Python check only ep != NULL. - OK
3) ep = readdir(dirp); // ep-d_name == \xe3\x82..., errno == 138
   Python check only ep != NULL. - OK
4) ep = readdir(dirp); // ep-d_name == \xe3\x83..., errno == 138
   Python check only ep != NULL. - OK
5) ep = readdir(dirp); // ep == NULL, errno == 138
   Python check ep == NULL and errno != 0. - Fail!
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[1.7] cygstart with non-ASCII arguments and UTF-8 locale don't work.

2009-04-27 Thread IWAMURO Motonori
Hi.

cygstart with non-ASCII arguments and UTF-8 locale don't work on cygwin-1.7.0.

 ls -l
total 1
-rw-rw-r-- 1 iwa None 7 Apr 28 00:22 αβγ.txt
 cygstart αβγ.txt
Unable to start 'C:\cygwin-1.7\tmp\αβγ.txt': The specified file was not found.
-- 
IWAMURO Motnori http://vmi.jp/


cygstart.patch
Description: Binary data
--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/