subject:"\[1.7\] Proposal\: the filename encoding in C locale uses UTF\-8 instead of SO\/UTF\-8"

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Christopher Faylor

On Mon, May 18, 2009 at 01:41:28PM +0800, Lenik wrote:
The expr error is fixed, and I can build cygpath from source now.
Though I don't have NTDDK in hand, I'm suprised how it could be
compiled.

The cygwin build is fairly self-contained.  We certainly don't need
anything like a DDK to build.

I can get the correct result from the new cygpath now, without -C
option.

Thank you guys.

I think the main person you should be thanking isn't a guy.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Lenik


On 2009-5-18 14:09, Christopher Faylor wrote:


I think the main person you should be thanking isn't a guy.


Ok. Thank you gods.

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Dave Korn

Lenik wrote:
 On 2009-5-18 14:09, Christopher Faylor wrote:

 I think the main person you should be thanking isn't a guy.

 Ok. Thank you gods.
 

  Hey Corinna?  Congrats!  You just got a promotion!

cheers,
  DaveK


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Mark J. Reed

On Mon, May 18, 2009 at 9:17 AM, Dave Korn  wrote:
 Lenik wrote:
 On 2009-5-18 14:09, Christopher Faylor wrote:

 I think the main person you should be thanking isn't a guy.

 Ok. Thank you gods.


  Hey Corinna?  Congrats!  You just got a promotion!

All praise to the great Corinna!

-- 
Mark J. Reed markjr...@gmail.com

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 10:09, IWAMURO Motonori wrote:

2009/5/17 Lenikle...@bodz.net:

Thanks, but where can I get this patch?


You can checkout it from CVS HEAD.


Thanks for your information, well, I'm not expect to build from source, 
that really frustrates me, and brings me even more problems.


Is there any mirror site for nightly builds? so I can use rsync to get 
it (If this patch is too minor to increase any of the version numbers). 
I've just looked at snapshots, but the last update time is 2009-05-13.


I can't build from source, here are some errors, I guess there will be 
more errors, so I hope someone will compile cygpath at the first time, 6 
weeks to the next release maybe too long to wait.


1, cvs update failed:
... (ignored)
cvs update: Updating src/winsup/testsuite/winsup.api/samples
cvs update: Updating src/winsup/utils
cvs update: Updating src/winsup/w32api
cvs update: Updating src/winsup/w32api/include
cvs update: Updating src/winsup/w32api/include/GL
cvs update: Updating src/winsup/w32api/include/ddk
cvs update: Updating src/winsup/w32api/include/directx
cvs update: Updating src/winsup/w32api/lib
cvs update: Updating src/winsup/w32api/lib/ddk
cvs update: Updating src/winsup/w32api/lib/directx
cvs update: closing down connection to cygwin.com: Transport 
endpoint is not connected


2, configure failed:
bash-3.2$ ./configure
  5 [main] expr 952 _cygtls::handle_exceptions: Error while dumping 
state (probably corrupted stack)
./configure: line 56:   952 Segmentation fault  (core dumped) expr a 
: '\(a\)'  /dev/null 21
  4 [main] expr 2808 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 3516 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 3328 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 2648 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 900 _cygtls::handle_exceptions: Error while dumping 
state (probably corrupted stack)
  5 [main] expr 1840 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 2972 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)


Thanks,
Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Corinna Vinschen

On May 17 11:09, IWAMURO Motonori wrote:
 2009/5/17 Lenik le...@bodz.net:
  Thanks, but where can I get this patch?
 
 You can checkout it from CVS HEAD.

It occured to me that, if you're using a charset which differs from your
current ANSI or OEM codepage, you might run into trouble with native
Windows tools.  Therefore I also added a new -C/--codepage option to
cygpath to specify the codepage used to create a WIndows path from a
Cygwin path.  For instance:

  cygpath -C ANSI -aw .

creates the full path of the CWD in the current ANSI codepage.  The
-C/--codepage option takes the following parameters:

- ANSI   to specify the current ANSI codepage (for interaction with GUI tools).

- OEMto specify the current OEM codepage (for interaction with CLI tools).

- UTF8   just guess...
  UTF-8

- n  A decimal codepage number according to the following table:
 http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
 Note that not all installations support all codepages.

I hope that helps.  Please note that the -C option doesn't work yet for
the -p option.  That's something I'll do after my vacation.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Corinna Vinschen

On May 17 15:52, Lenik wrote:
 On 2009-5-17 10:09, IWAMURO Motonori wrote:
 2009/5/17 Lenikle...@bodz.net:
 Thanks, but where can I get this patch?

 You can checkout it from CVS HEAD.
[...]
 6 weeks to the next release maybe too long to wait.

We have about 2 weeks between the test releases.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 19:53, Corinna Vinschen wrote:

On May 17 15:52, Lenik wrote:

On 2009-5-17 10:09, IWAMURO Motonori wrote:

2009/5/17 Lenikle...@bodz.net:

Thanks, but where can I get this patch?

You can checkout it from CVS HEAD.

[...]
6 weeks to the next release maybe too long to wait.


We have about 2 weeks between the test releases.


Corinna



Thank you, I'll be very happy if I can apply your great patch in next 
morning if not earlier. I'd rather hope I can get everything immediately 
when I read your reply, and IMHO that should be very easy, all what you 
have to do is make your working directory public and accessible. Stupid 
idea, heh? :)


Currently I resolved it by a simple function:


function _u2w() {
local p=$(cygpath -au $1)
if [ ${p:0:5} = /mnt/ -o ${p:0:10} = /cygdrive/ ]; then
p=${p:1}
p=${p#*/}
p=${p/\//:/}
else
if [ ${p:0:9} = /usr/bin/ ]; then p=${p:4}; fi
if [ ${p:0:9} = /usr/lib/ ]; then p=${p:4}; fi
p=$(cygpath -am /)$p
fi
p=${p//\//\\}
echo $p
}

path=$(_u2w $path)


Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 15:52, Lenik wrote:

2, configure failed:
bash-3.2$ ./configure
5 [main] expr 952 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
./configure: line 56: 952 Segmentation fault (core dumped) expr a :
'\(a\)'  /dev/null 21
4 [main] expr 2808 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 3516 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 3328 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 2648 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 900 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 1840 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 2972 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)


The expr error is fixed, and I can build cygpath from source now. Though 
I don't have NTDDK in hand, I'm suprised how it could be compiled.


I can get the correct result from the new cygpath now, without -C option.

Thank you guys.
Lenik



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread Corinna Vinschen

On May 16 13:17, Lenik wrote:
 (This mail is encoded in utf-8)

 After tested with 1.7.0-48, many problems are eliminated.

 But cygpath doesn't return good pathnames, see:

Looks like cygpath gets the wcstombs system call from ntdll rather than
from cygwin1.dll due to a linking order problem.  Unfortunately ntdll
exports a couple of convenient C functions like wcstombs, or even
sprintf.  I applied a patch so the next version of cygpath should
do the conversion more correctly.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread Lenik


On 2009-5-16 23:49, Corinna Vinschen wrote:

Looks like cygpath gets the wcstombs system call from ntdll rather than
from cygwin1.dll due to a linking order problem.  Unfortunately ntdll
exports a couple of convenient C functions like wcstombs, or even
sprintf.  I applied a patch so the next version of cygpath should
do the conversion more correctly.


Corinna


Thanks, but where can I get this patch?

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread IWAMURO Motonori

2009/5/17 Lenik le...@bodz.net:
 Thanks, but where can I get this patch?

You can checkout it from CVS HEAD.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread IWAMURO Motonori

2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
 I have just trouble with SJIS, but that's not something I can easily
 test. Maybe you can look into that in the next couple of days?

Maybe I can. Please explain details of the trouble.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread Corinna Vinschen

On May 15 20:34, IWAMURO Motonori wrote:
 2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
  I have just trouble with SJIS, but that's not something I can easily
  test. Maybe you can look into that in the next couple of days?
 
 Maybe I can. Please explain details of the trouble.

Probably I only fall over my own feet.  I was surprised to see the
filenames using chinese characters (from Lenik's examples) using SO/UTF
sequences.  I didn't expect that, but maybe that was correct.  The whole
problem already starts with me not being able to see non-western chars
in the console window.  The two available console fonts simple don't
provide them, so I only see squares, even if the characters are printed
correctly.

It would be cool if you could simply use SJIS for testing and see if
everything looks basically ok.

For the records:  The internationalization stuff is a heck of a lot
of effort.  Even if my replies might sound mean sometimes, I'm glad
for your input and help coding.


Thanks,
Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread Lenik


(This mail is encoded in utf-8)

After tested with 1.7.0-48, many problems are eliminated.

But cygpath doesn't return good pathnames, see:

1, Get absolute path of current directory:

C:\Profiles\Shecti\桌面 set LANG=zh_CN.GBK cygpath -am .
C:/Profiles/Shecti/桌面 (good)

C:\Profiles\Shecti\桌面 set LANG=zh_CN.GBK cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

C:\Profiles\Shecti\桌面 set LANG=zh_CN.UTF-8 cygpath -am .
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面 set LANG=zh_CN.UTF-8 cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

C:\Profiles\Shecti\桌面 set LANG=C cygpath -am .
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面 set LANG=C cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

Conclusion:
1.1 only GBK works for `cygpath -am .' (also -aw)
1.2 all work for `cygpath -au .'

2, Get absolute path of specified path

C:\Profiles\Shecti\桌面set LANG=zh_CN.GBK cygpath -am C:\Profiles 
\Shecti\桌面

C:/Profiles/Shecti/妗岄潰 (bad)

C:\Profiles\Shecti\桌面set LANG=zh_CN.GBK cygpath -au C:\Profiles 
\Shecti\桌面

/mnt/c/Profiles/Shecti/妗岄潰 (bad)

C:\Profiles\Shecti\桌面set LANG=zh_CN.UTF-8 cygpath -am 
C:\Profiles\Shecti\桌面

C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面set LANG=zh_CN.UTF-8 cygpath -au 
C:\Profiles\Shecti\桌面

/mnt/c/Profiles/Shecti/桌面 (good)

C:\Profiles\Shecti\桌面set LANG=C cygpath -am C:\Profiles\Shecti\桌面
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面set LANG=C cygpath -au C:\Profiles\Shecti\桌面
/mnt/c/Profiles/Shecti/桌面 (good)

Conclusion:
2.1 none works for `cygpath -am PathContainsNonascii'
2.2 GBK doesn't work for `cygpath -au PathContainsNonascii'

Now the problem is, I must use GBK for 1.1, and I cannot use GBK for 
2.2. and no more choice.   -_-||...


Lenik



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 13 23:49, Matthias Andree wrote:
 Am 13.05.2009, 17:17 Uhr, schrieb Corinna Vinschen  
 corinna-cyg...@cygwin.com:

 I followed the suggestion to use UTF-8 for internal conversions when the
 locale is set to C.  This will also be used as default conversion when
 converting the Windows environment from UTF-16 to multibyte, unless the
 environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
 working directory was also potentially unusable, if an application
 switched the locale.  Now the CWD is re-evaluated after a setlocale call.

 Is Unicode normalization an issue here?

Not really, I guess.  Either a character is available or it isn't.
We don't have composition or decomposition capabilities for most
multibyte character sets anyway.  If at all, we could only do that
for SJIS, EUCJP, and GBK.  None of them would profit a lot of that.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
  Should the following part not be modified?
 
  winsup/cygwin/fhandler_console.cc:
   dev_state-con_mbtowc = __mbtowc;
   dev_state-con_wctomb = __wctomb;

 I'd rather not.  It only affects the console and if LANG=C I'd rather
 see the single bytes which make up the path instead of the corresponding
 UTF-8 character.

 Hm, maybe I misunderstood.  In which manner should this be modifed?

I think:

dev_state-con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc;
dev_state-con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb;
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 21:39, IWAMURO Motonori wrote:
 2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
   Should the following part not be modified?
  
   winsup/cygwin/fhandler_console.cc:
dev_state-con_mbtowc = __mbtowc;
dev_state-con_wctomb = __wctomb;
 
  I'd rather not.  It only affects the console and if LANG=C I'd rather
  see the single bytes which make up the path instead of the corresponding
  UTF-8 character.
 
  Hm, maybe I misunderstood.  In which manner should this be modifed?
 
 I think:
 
 dev_state-con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc;
 dev_state-con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb;

Oh, ok.  So I understood right.  But that's exactly what I didn't want
to do.  The idea is that, even though UTF-8 is used for the filename
conversion, the console should default to standard ASCII behaviour,
unless you specify another charset before starting the first Cygwin
process in the console.

I'm also wondering if we should perhaps only allow either ASCII or
UTF-8 as console charsets, but for now I don't want to touch this
more than necessary.  I just found that the console I/O doesn't work
well for non-ASCII chars anyway.  The core function which echos input
to the terminal only handles singlebyte chars, which can be easily
reproduced using copy/paste.  Oh well.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 I see a couple of potential problems.

What problems are those?

 And have some time to discuss whether these are something the
 user can or even should fix or workaround alone.

I think that the application that use locale by the environment
variable and the application that use no locale should be able to read
and write the same byte sequence.

However, I don't strongly request it because the applications work
correctly in UTF-8.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 23:06, IWAMURO Motonori wrote:
 2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
  I see a couple of potential problems.
 
 What problems are those?

I have no example off-hand.  When I thought about it I always got sick
thinking about scenarios where the library is using, say, UTF-8, and the
application is using SJIS, and what happens to the filenames in this
case.  In theory the lib should provide what the application thinks it
right.

  And have some time to discuss whether these are something the
  user can or even should fix or workaround alone.
 
 I think that the application that use locale by the environment
 variable and the application that use no locale should be able to read
 and write the same byte sequence.

Ok, you got as point there.  Assuming we leave out any application
which deliberately uses a non-C locale which differs from the setting
in the environment.  Then we're getting into trouble.

If Cygwin uses internally always the environment setting, we have a
valid, identical byte stream for all applications using
setlocale(LC_ALL, ), *and* for non locale-aware applications.

But if the application uses a deliberately different setting via
setlocale, ..., hmm.  It should get what it asks for.

Maybe, you're right.  I have to test this a bit.


Thanks for your input,
Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 16:42, Corinna Vinschen wrote:
 On May 14 23:06, IWAMURO Motonori wrote:
  2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
   I see a couple of potential problems.
  
  What problems are those?
 
 I have no example off-hand.  When I thought about it I always got sick
 thinking about scenarios where the library is using, say, UTF-8, and the
 application is using SJIS, and what happens to the filenames in this
 case.  In theory the lib should provide what the application thinks it
 right.

Here's one problem.  What if an application uses setenv(LANG, ...)?
Do you want Cygwin to intercept all calls to setenv() to check for
setting $LC_ALL/LC_CTYPE/LANG?  Right now, the only time these variables
are read by Cygwin is at the start of the first Cygwin process in a
Cygwin process tree.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
 Here's one problem.  What if an application uses setenv(LANG, ...)?

Oh. Hmmm, I think that anything should not occur.

 Do you want Cygwin to intercept all calls to setenv() to check for
 setting $LC_ALL/LC_CTYPE/LANG?

No. I think that only setlocale() has to do the check.
The reason:
- setlocale(LC_CTYPE, C) is called from Cygwin startup.
- The following code become valid.
setenv(LANG, ...);
setlocale(LC_ALL, );
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 15 01:34, IWAMURO Motonori wrote:
 2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
  Here's one problem.  What if an application uses setenv(LANG, ...)?
 
 Oh. Hmmm, I think that anything should not occur.
 
  Do you want Cygwin to intercept all calls to setenv() to check for
  setting $LC_ALL/LC_CTYPE/LANG?
 
 No. I think that only setlocale() has to do the check.
 The reason:
 - setlocale(LC_CTYPE, C) is called from Cygwin startup.
 - The following code become valid.
 setenv(LANG, ...);
 setlocale(LC_ALL, );

Ok, that makes sense.  I'm just testing a patch which use the
environment settings internally as you proposed (still keeping UTF-8 the
default for the C locale).  So far it works fine, I have just trouble
with SJIS, but that's not something I can easily test.  I'm simply
lacking a real understanding of the SJIS encoding.  Maybe you can look
into that in the next couple of days?  I'll be mostly offline next week
and the week after so there's a lot time for testing ;-)


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 19:23, Corinna Vinschen wrote:
 On May 15 01:34, IWAMURO Motonori wrote:
  2009/5/15 Corinna Vinschen corinna-cyg...@cygwin.com:
   Here's one problem.  What if an application uses setenv(LANG, ...)?
  
  Oh. Hmmm, I think that anything should not occur.
  
   Do you want Cygwin to intercept all calls to setenv() to check for
   setting $LC_ALL/LC_CTYPE/LANG?
  
  No. I think that only setlocale() has to do the check.
  The reason:
  - setlocale(LC_CTYPE, C) is called from Cygwin startup.
  - The following code become valid.
  setenv(LANG, ...);
  setlocale(LC_ALL, );
 
 Ok, that makes sense.  I'm just testing a patch which use the
 environment settings internally as you proposed (still keeping UTF-8 the
 default for the C locale).  So far it works fine, I have just trouble
 with SJIS, but that's not something I can easily test.  I'm simply
 lacking a real understanding of the SJIS encoding.  Maybe you can look
 into that in the next couple of days?  I'll be mostly offline next week
 and the week after so there's a lot time for testing ;-)

I applied the patch.  The charset used by Cygwin now only depends on the
last call to setlocale in an application and eventually on the setting
of the internationalization environment variables LC_ALL/LC_CTYPE/LANG.

A side effect of this change is that the console charset depends solely
on the environment setting again.  So you can change the console charset
used in an application on the fly again by changing the LC_ALL/LC_CTYPE/LANG
setting in the environment(*), instead of setting it only once at startup
of the first Cygwin process in the console.

The (in)famous ssh to a remote machine from a UTF-8 console doesn't
work problem(**) should be unaffected by this change and still work
now, if, for instance, LANG is set to en_US.UTF-8 in the calling
shell.

Just the documentation needs to be updated again.

I really hope this change makes it finally easier to use Cygwin with
native characters.  I'll create a new Cygwin package tomorrow.


Corinna


(*) Still depends on a call to setlocale, but the Cygwin shells do that
anyway.
(**) http://cygwin.com/ml/cygwin/2009-04/msg00055.html

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Lenik


http://cygwin.com/ml/cygwin/2009-05/msg00245.html?


I found this web page doesn't display utf-8 characters correctly.

BTW, I'm using thunderbird as news reader.

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 12 19:37, Corinna Vinschen wrote:
 On May 13 02:29, IWAMURO Motonori wrote:
  I propose that the filename encoding in C locale uses UTF-8 instead of 
  SO/UTF-8.
  
  There are three reasons:
 
 That's an interesting thought.  Do you have a patch and, if so, did you
 try it?  Does it, for instance, help for the issue reported in the
 thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?

After examining the issue Lenik reported in the above thread, I'm at 
a loss how to solve this problem in a generic way.

The problem is that the filename changes dependent on the character
set used in $LANG.  The reason is that every time a multibyte filename
has to be generated, it has to be converted from UTF-16 to multibyte.

For instance, taking one of the filename from Lenik's example.  It's
stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

If I set LANG to en_US.GBK, `ls' returns the filename

 0xd7 0xc0 0xc3 0xe6

And in case LANG=C, `ls' returns

 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

So, dependent on the character set setting in the application, the idea
of the filename differs.  That's not exactly helpful for interoperability
between different applications.

I can think of two potential solutions to fix this problem:

(1) Always return filenames in UTF-8 encoding and pretend that UTF-8
is the way files are stored on disk.  That results in unchangable
filenames which are always valid.

But what if an application sets LANG=.SJIS and tries to create
a file using SJIS character encoding?  Should the file be created
using the SJIS-UTF-16 conversion or should open fail with EILSEQ?
That's not good.

(2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
Cygwin uses the LC_CTYPE setting which corresponds to the current
codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment,
Cygwin uses that to convert pathnames.  If the application uses
setlocale, Cygwin uses that setting to convert pathnames.

One problem can't be solved this way:  If an application fetches
and stores a filename, then switches the locale, and then tries
to use the filename in another system call, the filename is
potentially broken.

Any better ideas?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Matthias Andree

Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen  
corinna-cyg...@cygwin.com:



On May 12 19:37, Corinna Vinschen wrote:

On May 13 02:29, IWAMURO Motonori wrote:
 I propose that the filename encoding in C locale uses UTF-8 instead  
of SO/UTF-8.


 There are three reasons:

That's an interesting thought.  Do you have a patch and, if so, did you
try it?  Does it, for instance, help for the issue reported in the
thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?


After examining the issue Lenik reported in the above thread, I'm at
a loss how to solve this problem in a generic way.

The problem is that the filename changes dependent on the character
set used in $LANG.  The reason is that every time a multibyte filename
has to be generated, it has to be converted from UTF-16 to multibyte.

For instance, taking one of the filename from Lenik's example.  It's
stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

If I set LANG to en_US.GBK, `ls' returns the filename

 0xd7 0xc0 0xc3 0xe6

And in case LANG=C, `ls' returns

 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

So, dependent on the character set setting in the application, the idea
of the filename differs.  That's not exactly helpful for interoperability
between different applications.

I can think of two potential solutions to fix this problem:

(1) Always return filenames in UTF-8 encoding and pretend that UTF-8
is the way files are stored on disk.  That results in unchangable
filenames which are always valid.
   But what if an application sets LANG=.SJIS and tries to create
a file using SJIS character encoding?  Should the file be created
using the SJIS-UTF-16 conversion or should open fail with EILSEQ?
That's not good.


Why would it have to interpreted as all? Aren't filenames just opaque  
strings - with exceptions, say, for / and NUL to UNIX kernels?




(2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
Cygwin uses the LC_CTYPE setting which corresponds to the current
codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the  
environment,

Cygwin uses that to convert pathnames.  If the application uses
setlocale, Cygwin uses that setting to convert pathnames.

One problem can't be solved this way:  If an application fetches
and stores a filename, then switches the locale, and then tries
to use the filename in another system call, the filename is
potentially broken.

Any better ideas?


Just questions to kindle some brainstorming:

- why do you need to touch the filename at all? I haven't read all of it.  
Is the UTF-16 on disk and we need to work around UTF-16 being intractable  
as C string?


- some applications in the GNOME ballpark, for instance Gnumerica, do  
something like treat as Unicode and fall back to  
SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated  
list - not sure)


- adding to my interspersed comment above: isn't the issue more about  
*presentation* of filenames to the user than internal workings? To me the  
main issue appears to be that filenames should look alike in a Cygwin  
application and in a native Windows application. I'd assume that  
applications can get really confused if you change file names behind their  
back.


- if you speak of UTF-8, do you want to normalize file names? (I'd think  
you do.) Which normalization form will you choose? NFC (canonical) or NFD  
(compatibility)?


--
Matthias Andree

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Andy Koppe

 - why do you need to touch the filename at all? I haven't read all of it. Is
 the UTF-16 on disk and we need to work around UTF-16 being intractable as C
 string?

Yes. If you simply treated each UTF-16 symbol as two chars, you'd get
unintended NULs and slashes. For starters, the upper halves of all
ISO-8859-1 characters are NUL in UTF-16. And even without that, the
resulting filenames would be completely unusable.

Andy

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 15:54, Andy Koppe wrote:
  - why do you need to touch the filename at all? I haven't read all of it. Is
  the UTF-16 on disk and we need to work around UTF-16 being intractable as C
  string?
 
 Yes. If you simply treated each UTF-16 symbol as two chars, you'd get
 unintended NULs and slashes. For starters, the upper halves of all
 ISO-8859-1 characters are NUL in UTF-16. And even without that, the
 resulting filenames would be completely unusable.

Right.  That's the crux when using UTF-16 filenames but many different
multibyte codepages.  In contrast to a system in which the filename is
just a byte stream, we have to perform widechar to multibyte conversion
and outside of the UTF-8 domain, every other conversion is lossy.

For the time being, I applied a patch to Cygwin which should ease the
pain.

I followed the suggestion to use UTF-8 for internal conversions when the
locale is set to C.  This will also be used as default conversion when
converting the Windows environment from UTF-16 to multibyte, unless the
environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
working directory was also potentially unusable, if an application
switched the locale.  Now the CWD is re-evaluated after a setlocale call.

I'm sure this change doesn't fix all problems, but this worked much better
in my environment when using japanese and chinese characters in filenames.

There are a few other changes to the Cygwin DLL in the loop, but I will
update Cygwin 1.7 end of the week.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

RE: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Jason Pyeron

Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
 On May 12 19:37, Corinna Vinschen wrote:
 On May 13 02:29, IWAMURO Motonori wrote:
 I propose that the filename encoding in C locale uses UTF-8 instead
 of SO/UTF-8. 
 
 There are three reasons:
 
 That's an interesting thought.  Do you have a patch and, if so, did
 you try it?  Does it, for instance, help for the issue reported in
 the thread starting at
 http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
 
 After examining the issue Lenik reported in the above thread,
 I'm at a loss how to solve this problem in a generic way.
 

I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).

 The problem is that the filename changes dependent on the
 character set used in $LANG.  The reason is that every time a
 multibyte filename has to be generated, it has to be
 converted from UTF-16 to multibyte.
 
 For instance, taking one of the filename from Lenik's
 example.  It's stored on the filesystem as the UTF-16
 sequence \u684c \u9762.  If I set LANG to en_US.UTF-8, a
 readdir(2) call returns the multibyte sequence
 
  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
 
 If I set LANG to en_US.GBK, `ls' returns the filename
 
  0xd7 0xc0 0xc3 0xe6
 
 And in case LANG=C, `ls' returns
 
  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
 
 So, dependent on the character set setting in the
 application, the idea of the filename differs.  That's not
 exactly helpful for interoperability between different applications.
 
 I can think of two potential solutions to fix this problem:
 
 (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
 is the way files are stored on disk.  That results in unchangable
 filenames which are always valid.
 
 But what if an application sets LANG=.SJIS and
 tries to create
 a file using SJIS character encoding?  Should the file be created
 using the SJIS-UTF-16 conversion or should open fail with
 EILSEQ? That's not good. 
 
 (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
 Cygwin uses the LC_CTYPE setting which corresponds to the current
 codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in
 the environment,

If nothing is set use UTF-8 as it will work in existing code.

 Cygwin uses that to convert pathnames.  If the application uses
 setlocale, Cygwin uses that setting to convert pathnames.
 
 One problem can't be solved this way:  If an application fetches
 and stores a filename, then switches the locale, and then tries
 to use the filename in another system call, the filename is
 potentially broken. 

This is the user's problem to resolve.

 
 Any better ideas?
 

Not necessarily better, but here is a chart:

Sys:App:function expects/returns
NULL:   NULL:   UTF-8
C/UA:   NULL:   UTF-8
NULL:   C/UA:   UTF-8
C/UA:   C/UA:   UTF-8
SPEC:   NULL:   System Locale
SPEC:   C/UA:   UTF-8
NULLSPEC:   Application Locale
C/UA:   SPEC:   Application Locale
SPEC:   SPEC:   Application Locale


Key:

Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)

-jason

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-   -
- Jason Pyeron  PD Inc. http://www.pdinc.us -
- Principal Consultant  10 West 24th Street #100-
- +1 (443) 269-1555 x333Baltimore, Maryland 21218   -
-   -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 11:41, Jason Pyeron wrote:
 Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
  On May 12 19:37, Corinna Vinschen wrote:
  On May 13 02:29, IWAMURO Motonori wrote:
  I propose that the filename encoding in C locale uses UTF-8 instead
  of SO/UTF-8. 
  
  There are three reasons:
  
  That's an interesting thought.  Do you have a patch and, if so, did
  you try it?  Does it, for instance, help for the issue reported in
  the thread starting at
  http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
  
  After examining the issue Lenik reported in the above thread,
  I'm at a loss how to solve this problem in a generic way.
  
 
 I may be dense, as all of my internationlization experience was from the late
 90's. But in my experience the only solution for this is a cognizant effort on
 behalf of the user (or admin).
 [...]
  Any better ideas?
 
 Not necessarily better, but here is a chart:
 
 Sys:  App:function expects/returns
 NULL: NULL:   UTF-8
 C/UA: NULL:   UTF-8
 NULL: C/UA:   UTF-8
 C/UA: C/UA:   UTF-8
 SPEC: NULL:   System Locale
 SPEC: C/UA:   UTF-8
 NULL  SPEC:   Application Locale
 C/UA: SPEC:   Application Locale
 SPEC: SPEC:   Application Locale

What I just implemented basically matches the above, except for

  SPEC: NULL:   System Locale

This will also use UTF-8.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

Hi.

My idea is as follows:

1)  separate mbtowc/wctomb function entries to library usage and
system usage. (__mbtowc/__wctomb  __sys_mbtowc/__sys_wctomb)

2) If call setlocale(LC_CTYPE) by locale != C, then lib == sys.

3) If call setlocale(LC_CTYPE) by locale == C, then sys is set by
LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8
converter.

Cygwin startup call setlocale(LC_CTYPE, C) at winsup/cygwin/dcrt0.cc.

I think that the result is as follows:

1) LANG=C
   lib = ascii converter, sys = UTF-8 converter.

2) LANG=xx_XX.ENCODING  not call setlocale.
   lib = ascii converter, sys = ENCODING converter.

3) LANG=xx_XX.ENCODING  call setlocale(LC_ALL, ).
   lib = ENCODING converter, sys = ENCODING converter.

I think that [cat `read_dir_entry_and_print_app`] works correctly above all.

I am writing this patch and test code now.

 One problem can't be solved this way:  If an application fetches
 and stores a filename, then switches the locale, and then tries
 to use the filename in another system call, the filename is
 potentially broken.

If the application switches the encoding while processing, I think
that the problem is a responsibility of the application.

2009/5/13 Corinna Vinschen corinna-cyg...@cygwin.com:
 On May 12 19:37, Corinna Vinschen wrote:
 On May 13 02:29, IWAMURO Motonori wrote:
  I propose that the filename encoding in C locale uses UTF-8 instead of 
  SO/UTF-8.
 
  There are three reasons:

 That's an interesting thought.  Do you have a patch and, if so, did you
 try it?  Does it, for instance, help for the issue reported in the
 thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?

 After examining the issue Lenik reported in the above thread, I'm at
 a loss how to solve this problem in a generic way.

 The problem is that the filename changes dependent on the character
 set used in $LANG.  The reason is that every time a multibyte filename
 has to be generated, it has to be converted from UTF-16 to multibyte.

 For instance, taking one of the filename from Lenik's example.  It's
 stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
 LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

 If I set LANG to en_US.GBK, `ls' returns the filename

  0xd7 0xc0 0xc3 0xe6

 And in case LANG=C, `ls' returns

  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

 So, dependent on the character set setting in the application, the idea
 of the filename differs.  That's not exactly helpful for interoperability
 between different applications.

 I can think of two potential solutions to fix this problem:

 (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
    is the way files are stored on disk.  That results in unchangable
    filenames which are always valid.

    But what if an application sets LANG=.SJIS and tries to create
    a file using SJIS character encoding?  Should the file be created
    using the SJIS-UTF-16 conversion or should open fail with EILSEQ?
    That's not good.

 (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
    Cygwin uses the LC_CTYPE setting which corresponds to the current
    codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment,
    Cygwin uses that to convert pathnames.  If the application uses
    setlocale, Cygwin uses that setting to convert pathnames.

    One problem can't be solved this way:  If an application fetches
    and stores a filename, then switches the locale, and then tries
    to use the filename in another system call, the filename is
    potentially broken.

 Any better ideas?


 Corinna

 --
 Corinna Vinschen                  Please, send mails regarding Cygwin to
 Cygwin Project Co-Leader          cygwin AT cygwin DOT com
 Red Hat

 --
 Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
 Problem reports:       http://cygwin.com/problems.html
 Documentation:         http://cygwin.com/docs.html
 FAQ:                   http://cygwin.com/faq/





-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Andy Koppe

 Not necessarily better, but here is a chart:

 Sys:  App:    function expects/returns
 NULL: NULL:   UTF-8
 C/UA: NULL:   UTF-8
 NULL: C/UA:   UTF-8
 C/UA: C/UA:   UTF-8
 SPEC: NULL:   System Locale
 SPEC: C/UA:   UTF-8
 NULL  SPEC:   Application Locale
 C/UA: SPEC:   Application Locale
 SPEC: SPEC:   Application Locale

What is the System Locale in this context? Asking because Windows
doesn't have locales as such, although it does have a default ANSI
codepage.

Andy

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 01:03, IWAMURO Motonori wrote:
 Hi.
 
 My idea is as follows:
 
 1)  separate mbtowc/wctomb function entries to library usage and
 system usage. (__mbtowc/__wctomb  __sys_mbtowc/__sys_wctomb)
 
 2) If call setlocale(LC_CTYPE) by locale != C, then lib == sys.
 
 3) If call setlocale(LC_CTYPE) by locale == C, then sys is set by
 LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8
 converter.

That's basically how my patch works.

 Cygwin startup call setlocale(LC_CTYPE, C) at winsup/cygwin/dcrt0.cc.

Yes, it does already.

 I am writing this patch and test code now.

Btw., if you plan to write more and bigger patches for Cygwin, it would
be necessary to sign a copyright assignment form.  That's explained on
http://cygwin.com/contrib.html.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 That's basically how my patch works.

Sorry, I can't parse this sentence because of my poor English parser...
Do you be writing the patch for this problem?

 Btw., if you plan to write more and bigger patches for Cygwin, it would
 be necessary to sign a copyright assignment form.  That's explained on
 http://cygwin.com/contrib.html.

Ummm, it seems to take time very much...
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 02:25, IWAMURO Motonori wrote:
 2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
  That's basically how my patch works.
 
 Sorry, I can't parse this sentence because of my poor English parser...

No worries.

 Do you be writing the patch for this problem?

I already wrote that patch, see
http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
It seems to do what you are proposing.

  Btw., if you plan to write more and bigger patches for Cygwin, it would
  be necessary to sign a copyright assignment form.  That's explained on
  http://cygwin.com/contrib.html.
 
 Ummm, it seems to take time very much...

Yes, unfortunately.  But it's really required for non-trivial patches.
Sorry.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
 I already wrote that patch, see
 http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
 It seems to do what you are proposing.

I read it and built cygwin1.dll. It seems to work correctly.

Should the following part not be modified?

winsup/cygwin/fhandler_console.cc:
 dev_state-con_mbtowc = __mbtowc;
 dev_state-con_wctomb = __wctomb;

But I think the patch solves only the case of UTF-8 in the thread
starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html.

It is necessary to separate the following variables for the library
and for the system to support encoding that is not UTF-8.

- __mb_cur_max
- lc_ctype_charset
- __mbtowc
- __wctomb

And these variables are set by LC_ALL/LC_CTYPE/LANG if call
setlocale(LC_CTYPE, C).
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 04:13, IWAMURO Motonori wrote:
 2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
  I already wrote that patch, see
  http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
  It seems to do what you are proposing.
 
 I read it and built cygwin1.dll. It seems to work correctly.
 
 Should the following part not be modified?
 
 winsup/cygwin/fhandler_console.cc:
  dev_state-con_mbtowc = __mbtowc;
  dev_state-con_wctomb = __wctomb;

I'd rather not.  It only affects the console and if LANG=C I'd rather
see the single bytes which make up the path instead of the corresponding
UTF-8 character.

 But I think the patch solves only the case of UTF-8 in the thread
 starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html.
 
 It is necessary to separate the following variables for the library
 and for the system to support encoding that is not UTF-8.
 
 - __mb_cur_max
 - lc_ctype_charset
 - __mbtowc
 - __wctomb

I understand what you're up to, but right now I'm not really sure that
this is the way to go.  I had this idea as well at one point, but,
thinking about it, I see a couple of potential problems.  I don't want
to decouple the libraries' idea of a string from the application's idea.
I tried various scenarios with the current solution and they all worked
ok, one way or the other.  I'm sure there are still some which don't
work, but before doing what you propose, I'd rather see explicit
failures.  And have some time to discuss whether these are something the
user can or even should fix or workaround alone.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 21:46, Corinna Vinschen wrote:
 On May 14 04:13, IWAMURO Motonori wrote:
  2009/5/14 Corinna Vinschen corinna-cyg...@cygwin.com:
   I already wrote that patch, see
   http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
   It seems to do what you are proposing.
  
  I read it and built cygwin1.dll. It seems to work correctly.
  
  Should the following part not be modified?
  
  winsup/cygwin/fhandler_console.cc:
   dev_state-con_mbtowc = __mbtowc;
   dev_state-con_wctomb = __wctomb;
 
 I'd rather not.  It only affects the console and if LANG=C I'd rather
 see the single bytes which make up the path instead of the corresponding
 UTF-8 character.

Hm, maybe I misunderstood.  In which manner should this be modifed?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Matthias Andree

Am 13.05.2009, 17:17 Uhr, schrieb Corinna Vinschen  
corinna-cyg...@cygwin.com:



I followed the suggestion to use UTF-8 for internal conversions when the
locale is set to C.  This will also be used as default conversion when
converting the Windows environment from UTF-16 to multibyte, unless the
environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
working directory was also potentially unusable, if an application
switched the locale.  Now the CWD is re-evaluated after a setlocale call.


Is Unicode normalization an issue here?

--
Matthias Andree

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread IWAMURO Motonori

Hi.

I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8.

There are three reasons:

1. for the interoperability between Cygwin and various UNIX-like
systems (Linux, *BSD, Solaris, and so on).
   UNIX-like systems treat the filename as 8bit byte array, and many
applications on the systems send or receive filename information
without locale. (mercurial, git, rsync, and so on).

2. UTF-8 is the only encoding that can treat multi languages.

3. Today, the default encoding of modern UNIX-like systems is UTF-8.

Please examine it.

Thanks.
-- 
IWAMURO Motnori http://vmi.jp/

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 13 02:29, IWAMURO Motonori wrote:
 Hi.
 
 I propose that the filename encoding in C locale uses UTF-8 instead of 
 SO/UTF-8.
 
 There are three reasons:
 
 1. for the interoperability between Cygwin and various UNIX-like
 systems (Linux, *BSD, Solaris, and so on).
UNIX-like systems treat the filename as 8bit byte array, and many
 applications on the systems send or receive filename information
 without locale. (mercurial, git, rsync, and so on).
 
 2. UTF-8 is the only encoding that can treat multi languages.
 
 3. Today, the default encoding of modern UNIX-like systems is UTF-8.

That's an interesting thought.  Do you have a patch and, if so, did you
try it?  Does it, for instance, help for the issue reported in the
thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Mark J. Reed

 On May 13 02:29, IWAMURO Motonori wrote:
 Hi.

 I propose that the filename encoding in C locale uses UTF-8 instead of 
 SO/UTF-8

What the heck is SO/UTF-8?

-- 
Mark J. Reed markjr...@gmail.com

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 12 15:13, Mark J. Reed wrote:
  On May 13 02:29, IWAMURO Motonori wrote:
  Hi.
 
  I propose that the filename encoding in C locale uses UTF-8 instead of 
  SO/UTF-8
 
 What the heck is SO/UTF-8?

http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Mark J. Reed

On Tue, May 12, 2009 at 3:22 PM, Corinna Vinschen

 http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual

OK, got it.  So Mr. Iwamuro's proposal is that Cygwin ignore the
locale setting, and just automatically convert the Windows UTF-16
filenames to UTF-8 (and back) no matter what.

That seems rife with possible confusion, though. If I have my codepage
set to ISO-2022 and paste in a filename, I expect it to be interpreted
as ISO-2022, not as UTF-8 (which will probably fail with an invalid
encoding sequence).

OTOH, the SO/UTF-8 hack would seem to bode ill for the portability of,
say, tar archives created under Cygwin.

-- 
Mark J. Reed markjr...@gmail.com

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 12 15:53, Mark J. Reed wrote:
 On Tue, May 12, 2009 at 3:22 PM, Corinna Vinschen
 
  http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual
 
 OK, got it.  So Mr. Iwamuro's proposal is that Cygwin ignore the
 locale setting, and just automatically convert the Windows UTF-16
 filenames to UTF-8 (and back) no matter what.

No.  Only if LANG=C.

 That seems rife with possible confusion, though. If I have my codepage
 set to ISO-2022 and paste in a filename, I expect it to be interpreted

Cygwin 1.7 doesn't use the codepage.  It uses what $LANG says.  See
http://cygwin.com/1.7/cygwin-ug-net/setup-locale.html

 as ISO-2022, not as UTF-8 (which will probably fail with an invalid
 encoding sequence).
 
 OTOH, the SO/UTF-8 hack would seem to bode ill for the portability of,
 say, tar archives created under Cygwin.

The filenames potentially look weird, but they are valid filenames.
If anybody has a better idea how to workaround the problem of UTF-16
chars which don't translate into the current singlebyte or multibyte
charset, feel free to suggest.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

46 matches

Mail list logo