subject:"\"\\\[1.7\\\] Proposal\\\: the filename encoding in C locale uses UTF\\\-8 instead of SO\\\/UTF\\\-8\""

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Mark J. Reed

On Mon, May 18, 2009 at 9:17 AM, Dave Korn  wrote:
> Lenik wrote:
>> On 2009-5-18 14:09, Christopher Faylor wrote:
>>>
>>> I think the main person you should be thanking isn't a guy.
>>>
>> Ok. Thank you gods.
>>
>
>  Hey Corinna?  Congrats!  You just got a promotion!

All praise to the great Corinna!

-- 
Mark J. Reed 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Dave Korn

Lenik wrote:
> On 2009-5-18 14:09, Christopher Faylor wrote:
>>
>> I think the main person you should be thanking isn't a guy.
>>
> Ok. Thank you gods.
> 

  Hey Corinna?  Congrats!  You just got a promotion!

cheers,
  DaveK


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-18 Thread Lenik


On 2009-5-18 14:09, Christopher Faylor wrote:


I think the main person you should be thanking isn't a guy.


Ok. Thank you gods.

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Christopher Faylor

On Mon, May 18, 2009 at 01:41:28PM +0800, Lenik wrote:
>The expr error is fixed, and I can build cygpath from source now.
>Though I don't have NTDDK in hand, I'm suprised how it could be
>compiled.

The cygwin build is fairly self-contained.  We certainly don't need
anything like a DDK to build.

>I can get the correct result from the new cygpath now, without -C
>option.
>
>Thank you guys.

I think the main person you should be thanking isn't a guy.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 15:52, Lenik wrote:

2, configure failed:
bash-3.2$ ./configure
5 [main] expr 952 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
./configure: line 56: 952 Segmentation fault (core dumped) expr a :
'\(a\)' > /dev/null 2>&1
4 [main] expr 2808 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 3516 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 3328 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 2648 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 900 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 1840 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)
5 [main] expr 2972 _cygtls::handle_exceptions: Error while dumping state
(probably corrupted stack)


The expr error is fixed, and I can build cygpath from source now. Though 
I don't have NTDDK in hand, I'm suprised how it could be compiled.


I can get the correct result from the new cygpath now, without -C option.

Thank you guys.
Lenik



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 19:53, Corinna Vinschen wrote:

On May 17 15:52, Lenik wrote:

On 2009-5-17 10:09, IWAMURO Motonori wrote:

2009/5/17 Lenik:

Thanks, but where can I get this patch?

You can checkout it from CVS HEAD.

[...]
6 weeks to the next release maybe too long to wait.


We have about 2 weeks between the test releases.


Corinna



Thank you, I'll be very happy if I can apply your great patch in next 
morning if not earlier. I'd rather hope I can get everything immediately 
when I read your reply, and IMHO that should be very easy, all what you 
have to do is make your working directory public and accessible. Stupid 
idea, heh? :)


Currently I resolved it by a simple function:


function _u2w() {
local p="$(cygpath -au $1)"
if [ "${p:0:5}" = "/mnt/" -o "${p:0:10}" = "/cygdrive/" ]; then
p="${p:1}"
p="${p#*/}"
p="${p/\//:/}"
else
if [ "${p:0:9}" = /usr/bin/ ]; then p="${p:4}"; fi
if [ "${p:0:9}" = /usr/lib/ ]; then p="${p:4}"; fi
p="$(cygpath -am /)$p"
fi
p="${p//\//\\}"
echo "$p"
}

path="$(_u2w $path)"


Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Corinna Vinschen

On May 17 15:52, Lenik wrote:
> On 2009-5-17 10:09, IWAMURO Motonori wrote:
>> 2009/5/17 Lenik:
>>> Thanks, but where can I get this patch?
>>
>> You can checkout it from CVS HEAD.
>[...]
> 6 weeks to the next release maybe too long to wait.

We have about 2 weeks between the test releases.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Corinna Vinschen

On May 17 11:09, IWAMURO Motonori wrote:
> 2009/5/17 Lenik :
> > Thanks, but where can I get this patch?
> 
> You can checkout it from CVS HEAD.

It occured to me that, if you're using a charset which differs from your
current ANSI or OEM codepage, you might run into trouble with native
Windows tools.  Therefore I also added a new -C/--codepage option to
cygpath to specify the codepage used to create a WIndows path from a
Cygwin path.  For instance:

  cygpath -C ANSI -aw .

creates the full path of the CWD in the current ANSI codepage.  The
-C/--codepage option takes the following parameters:

- ANSI   to specify the current ANSI codepage (for interaction with GUI tools).

- OEMto specify the current OEM codepage (for interaction with CLI tools).

- UTF8   just guess...
  UTF-8

- n  A decimal codepage number according to the following table:
 http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
 Note that not all installations support all codepages.

I hope that helps.  Please note that the -C option doesn't work yet for
the -p option.  That's something I'll do after my vacation.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-17 Thread Lenik


On 2009-5-17 10:09, IWAMURO Motonori wrote:

2009/5/17 Lenik:

Thanks, but where can I get this patch?


You can checkout it from CVS HEAD.


Thanks for your information, well, I'm not expect to build from source, 
that really frustrates me, and brings me even more problems.


Is there any mirror site for nightly builds? so I can use rsync to get 
it (If this patch is too minor to increase any of the version numbers). 
I've just looked at snapshots, but the last update time is 2009-05-13.


I can't build from source, here are some errors, I guess there will be 
more errors, so I hope someone will compile cygpath at the first time, 6 
weeks to the next release maybe too long to wait.


1, cvs update failed:
... (ignored)
cvs update: Updating src/winsup/testsuite/winsup.api/samples
cvs update: Updating src/winsup/utils
cvs update: Updating src/winsup/w32api
cvs update: Updating src/winsup/w32api/include
cvs update: Updating src/winsup/w32api/include/GL
cvs update: Updating src/winsup/w32api/include/ddk
cvs update: Updating src/winsup/w32api/include/directx
cvs update: Updating src/winsup/w32api/lib
cvs update: Updating src/winsup/w32api/lib/ddk
cvs update: Updating src/winsup/w32api/lib/directx
cvs update: closing down connection to cygwin.com: Transport 
endpoint is not connected


2, configure failed:
bash-3.2$ ./configure
  5 [main] expr 952 _cygtls::handle_exceptions: Error while dumping 
state (probably corrupted stack)
./configure: line 56:   952 Segmentation fault  (core dumped) expr a 
: '\(a\)' > /dev/null 2>&1
  4 [main] expr 2808 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 3516 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 3328 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 2648 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 900 _cygtls::handle_exceptions: Error while dumping 
state (probably corrupted stack)
  5 [main] expr 1840 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)
  5 [main] expr 2972 _cygtls::handle_exceptions: Error while 
dumping state (probably corrupted stack)


Thanks,
Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread IWAMURO Motonori

2009/5/17 Lenik :
> Thanks, but where can I get this patch?

You can checkout it from CVS HEAD.
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread Lenik


On 2009-5-16 23:49, Corinna Vinschen wrote:

Looks like cygpath gets the wcstombs system call from ntdll rather than
from cygwin1.dll due to a linking order problem.  Unfortunately ntdll
exports a couple of convenient C functions like wcstombs, or even
sprintf.  I applied a patch so the next version of cygpath should
do the conversion more correctly.


Corinna


Thanks, but where can I get this patch?

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-16 Thread Corinna Vinschen

On May 16 13:17, Lenik wrote:
> (This mail is encoded in utf-8)
>
> After tested with 1.7.0-48, many problems are eliminated.
>
> But cygpath doesn't return good pathnames, see:

Looks like cygpath gets the wcstombs system call from ntdll rather than
from cygwin1.dll due to a linking order problem.  Unfortunately ntdll
exports a couple of convenient C functions like wcstombs, or even
sprintf.  I applied a patch so the next version of cygpath should
do the conversion more correctly.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread Lenik


(This mail is encoded in utf-8)

After tested with 1.7.0-48, many problems are eliminated.

But cygpath doesn't return good pathnames, see:

1, Get absolute path of current directory:

C:\Profiles\Shecti\桌面> set LANG=zh_CN.GBK& cygpath -am .
C:/Profiles/Shecti/桌面 (good)

C:\Profiles\Shecti\桌面> set LANG=zh_CN.GBK& cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

C:\Profiles\Shecti\桌面> set LANG=zh_CN.UTF-8& cygpath -am .
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面> set LANG=zh_CN.UTF-8& cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

C:\Profiles\Shecti\桌面> set LANG=C& cygpath -am .
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面> set LANG=C& cygpath -au .
/mnt/c/Profiles/Shecti/桌面/ (good)

Conclusion:
1.1 only GBK works for `cygpath -am .' (also -aw)
1.2 all work for `cygpath -au .'

2, Get absolute path of specified path

C:\Profiles\Shecti\桌面>set LANG=zh_CN.GBK& cygpath -am C:\Profiles 
\Shecti\桌面

C:/Profiles/Shecti/妗岄潰 (bad)

C:\Profiles\Shecti\桌面>set LANG=zh_CN.GBK& cygpath -au C:\Profiles 
\Shecti\桌面

/mnt/c/Profiles/Shecti/妗岄潰 (bad)

C:\Profiles\Shecti\桌面>set LANG=zh_CN.UTF-8& cygpath -am 
C:\Profiles\Shecti\桌面

C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面>set LANG=zh_CN.UTF-8& cygpath -au 
C:\Profiles\Shecti\桌面

/mnt/c/Profiles/Shecti/桌面 (good)

C:\Profiles\Shecti\桌面>set LANG=C& cygpath -am C:\Profiles\Shecti\桌面
C:/Profiles/Shecti/ (bad)

C:\Profiles\Shecti\桌面>set LANG=C& cygpath -au C:\Profiles\Shecti\桌面
/mnt/c/Profiles/Shecti/桌面 (good)

Conclusion:
2.1 none works for `cygpath -am PathContainsNonascii'
2.2 GBK doesn't work for `cygpath -au PathContainsNonascii'

Now the problem is, I must use GBK for 1.1, and I cannot use GBK for 
2.2. and no more choice.   -_-||...


Lenik



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread Corinna Vinschen

On May 15 20:34, IWAMURO Motonori wrote:
> 2009/5/15 Corinna Vinschen :
> > I have just trouble with SJIS, but that's not something I can easily
> > test. Maybe you can look into that in the next couple of days?
> 
> Maybe I can. Please explain details of the trouble.

Probably I only fall over my own feet.  I was surprised to see the
filenames using chinese characters (from Lenik's examples) using SO/UTF
sequences.  I didn't expect that, but maybe that was correct.  The whole
problem already starts with me not being able to see non-western chars
in the console window.  The two available console fonts simple don't
provide them, so I only see squares, even if the characters are printed
correctly.

It would be cool if you could simply use SJIS for testing and see if
everything looks basically ok.

For the records:  The internationalization stuff is a heck of a lot
of effort.  Even if my replies might sound mean sometimes, I'm glad
for your input and help coding.

Thanks,
Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-15 Thread IWAMURO Motonori

2009/5/15 Corinna Vinschen :
> I have just trouble with SJIS, but that's not something I can easily
> test. Maybe you can look into that in the next couple of days?

Maybe I can. Please explain details of the trouble.
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 19:23, Corinna Vinschen wrote:
> On May 15 01:34, IWAMURO Motonori wrote:
> > 2009/5/15 Corinna Vinschen :
> > > Here's one problem.  What if an application uses setenv("LANG", ...)?
> > 
> > Oh. Hmmm, I think that anything should not occur.
> > 
> > > Do you want Cygwin to intercept all calls to setenv() to check for
> > > setting $LC_ALL/LC_CTYPE/LANG?
> > 
> > No. I think that only setlocale() has to do the check.
> > The reason:
> > - setlocale(LC_CTYPE, "C") is called from Cygwin startup.
> > - The following code become valid.
> > setenv("LANG", "...");
> > setlocale(LC_ALL, "");
> 
> Ok, that makes sense.  I'm just testing a patch which use the
> environment settings internally as you proposed (still keeping UTF-8 the
> default for the "C" locale).  So far it works fine, I have just trouble
> with SJIS, but that's not something I can easily test.  I'm simply
> lacking a real understanding of the SJIS encoding.  Maybe you can look
> into that in the next couple of days?  I'll be mostly offline next week
> and the week after so there's a lot time for testing ;-)

I applied the patch.  The charset used by Cygwin now only depends on the
last call to setlocale in an application and eventually on the setting
of the internationalization environment variables LC_ALL/LC_CTYPE/LANG.

A side effect of this change is that the console charset depends solely
on the environment setting again.  So you can change the console charset
used in an application on the fly again by changing the LC_ALL/LC_CTYPE/LANG
setting in the environment(*), instead of setting it only once at startup
of the first Cygwin process in the console.

The (in)famous "ssh to a remote machine from a UTF-8 console doesn't
work" problem(**) should be unaffected by this change and still work
now, if, for instance, LANG is set to "en_US.UTF-8" in the calling
shell.

Just the documentation needs to be updated again.

I really hope this change makes it finally easier to use Cygwin with
native characters.  I'll create a new Cygwin package tomorrow.

Corinna

(*) Still depends on a call to setlocale, but the Cygwin shells do that
anyway.
(**) http://cygwin.com/ml/cygwin/2009-04/msg00055.html

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 15 01:34, IWAMURO Motonori wrote:
> 2009/5/15 Corinna Vinschen :
> > Here's one problem.  What if an application uses setenv("LANG", ...)?
> 
> Oh. Hmmm, I think that anything should not occur.
> 
> > Do you want Cygwin to intercept all calls to setenv() to check for
> > setting $LC_ALL/LC_CTYPE/LANG?
> 
> No. I think that only setlocale() has to do the check.
> The reason:
> - setlocale(LC_CTYPE, "C") is called from Cygwin startup.
> - The following code become valid.
> setenv("LANG", "...");
> setlocale(LC_ALL, "");

Ok, that makes sense.  I'm just testing a patch which use the
environment settings internally as you proposed (still keeping UTF-8 the
default for the "C" locale).  So far it works fine, I have just trouble
with SJIS, but that's not something I can easily test.  I'm simply
lacking a real understanding of the SJIS encoding.  Maybe you can look
into that in the next couple of days?  I'll be mostly offline next week
and the week after so there's a lot time for testing ;-)

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/15 Corinna Vinschen :
> Here's one problem.  What if an application uses setenv("LANG", ...)?

Oh. Hmmm, I think that anything should not occur.

> Do you want Cygwin to intercept all calls to setenv() to check for
> setting $LC_ALL/LC_CTYPE/LANG?

No. I think that only setlocale() has to do the check.
The reason:
- setlocale(LC_CTYPE, "C") is called from Cygwin startup.
- The following code become valid.
setenv("LANG", "...");
setlocale(LC_ALL, "");
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 16:42, Corinna Vinschen wrote:
> On May 14 23:06, IWAMURO Motonori wrote:
> > 2009/5/14 Corinna Vinschen :
> > > I see a couple of potential problems.
> > 
> > What problems are those?
> 
> I have no example off-hand.  When I thought about it I always got sick
> thinking about scenarios where the library is using, say, UTF-8, and the
> application is using SJIS, and what happens to the filenames in this
> case.  In theory the lib should provide what the application thinks it
> right.

Here's one problem.  What if an application uses setenv("LANG", ...)?
Do you want Cygwin to intercept all calls to setenv() to check for
setting $LC_ALL/LC_CTYPE/LANG?  Right now, the only time these variables
are read by Cygwin is at the start of the first Cygwin process in a
Cygwin process tree.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 23:06, IWAMURO Motonori wrote:
> 2009/5/14 Corinna Vinschen :
> > I see a couple of potential problems.
> 
> What problems are those?

I have no example off-hand.  When I thought about it I always got sick
thinking about scenarios where the library is using, say, UTF-8, and the
application is using SJIS, and what happens to the filenames in this
case.  In theory the lib should provide what the application thinks it
right.

> > And have some time to discuss whether these are something the
> > user can or even should fix or workaround alone.
> 
> I think that the application that use locale by the environment
> variable and the application that use no locale should be able to read
> and write the same byte sequence.

Ok, you got as point there.  Assuming we leave out any application
which deliberately uses a non-"C" locale which differs from the setting
in the environment.  Then we're getting into trouble.

If Cygwin uses internally always the environment setting, we have a
valid, identical byte stream for all applications using
setlocale(LC_ALL, ""), *and* for non locale-aware applications.

But if the application uses a deliberately different setting via
setlocale, ..., hmm.  It should get what it asks for.

Maybe, you're right.  I have to test this a bit.

Thanks for your input,
Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen :
> I see a couple of potential problems.

What problems are those?

> And have some time to discuss whether these are something the
> user can or even should fix or workaround alone.

I think that the application that use locale by the environment
variable and the application that use no locale should be able to read
and write the same byte sequence.

However, I don't strongly request it because the applications work
correctly in UTF-8.
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 14 21:39, IWAMURO Motonori wrote:
> 2009/5/14 Corinna Vinschen :
> >> > Should the following part not be modified?
> >> >
> >> > winsup/cygwin/fhandler_console.cc:
> >> > > dev_state->con_mbtowc = __mbtowc;
> >> > > dev_state->con_wctomb = __wctomb;
> >>
> >> I'd rather not.  It only affects the console and if LANG=C I'd rather
> >> see the single bytes which make up the path instead of the corresponding
> >> UTF-8 character.
> >
> > Hm, maybe I misunderstood.  In which manner should this be modifed?
> 
> I think:
> 
> dev_state->con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc;
> dev_state->con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb;

Oh, ok.  So I understood right.  But that's exactly what I didn't want
to do.  The idea is that, even though UTF-8 is used for the filename
conversion, the console should default to standard ASCII behaviour,
unless you specify another charset before starting the first Cygwin
process in the console.

I'm also wondering if we should perhaps only allow either ASCII or
UTF-8 as console charsets, but for now I don't want to touch this
more than necessary.  I just found that the console I/O doesn't work
well for non-ASCII chars anyway.  The core function which echos input
to the terminal only handles singlebyte chars, which can be easily
reproduced using copy/paste.  Oh well.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen :
>> > Should the following part not be modified?
>> >
>> > winsup/cygwin/fhandler_console.cc:
>> > > dev_state->con_mbtowc = __mbtowc;
>> > > dev_state->con_wctomb = __wctomb;
>>
>> I'd rather not.  It only affects the console and if LANG=C I'd rather
>> see the single bytes which make up the path instead of the corresponding
>> UTF-8 character.
>
> Hm, maybe I misunderstood.  In which manner should this be modifed?

I think:

dev_state->con_mbtowc = __mbtowc == __ascii_mbtowc ? __utf8_mbtowc : __mbtowc;
dev_state->con_wctomb = __wctomb == __ascii_wctomb ? __utf8_wctomb : __wctomb;
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-14 Thread Corinna Vinschen

On May 13 23:49, Matthias Andree wrote:
> Am 13.05.2009, 17:17 Uhr, schrieb Corinna Vinschen  
> :
>
>> I followed the suggestion to use UTF-8 for internal conversions when the
>> locale is set to "C".  This will also be used as default conversion when
>> converting the Windows environment from UTF-16 to multibyte, unless the
>> environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
>> working directory was also potentially unusable, if an application
>> switched the locale.  Now the CWD is re-evaluated after a setlocale call.
>
> Is Unicode normalization an issue here?

Not really, I guess.  Either a character is available or it isn't.
We don't have composition or decomposition capabilities for most
multibyte character sets anyway.  If at all, we could only do that
for SJIS, EUCJP, and GBK.  None of them would profit a lot of that.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Matthias Andree

Am 13.05.2009, 17:17 Uhr, schrieb Corinna Vinschen  
:



I followed the suggestion to use UTF-8 for internal conversions when the
locale is set to "C".  This will also be used as default conversion when
converting the Windows environment from UTF-16 to multibyte, unless the
environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
working directory was also potentially unusable, if an application
switched the locale.  Now the CWD is re-evaluated after a setlocale call.


Is Unicode normalization an issue here?

--
Matthias Andree

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 21:46, Corinna Vinschen wrote:
> On May 14 04:13, IWAMURO Motonori wrote:
> > 2009/5/14 Corinna Vinschen :
> > > I already wrote that patch, see
> > > http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
> > > It seems to do what you are proposing.
> > 
> > I read it and built cygwin1.dll. It seems to work correctly.
> > 
> > Should the following part not be modified?
> > 
> > winsup/cygwin/fhandler_console.cc:
> > > dev_state->con_mbtowc = __mbtowc;
> > > dev_state->con_wctomb = __wctomb;
> 
> I'd rather not.  It only affects the console and if LANG=C I'd rather
> see the single bytes which make up the path instead of the corresponding
> UTF-8 character.

Hm, maybe I misunderstood.  In which manner should this be modifed?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 04:13, IWAMURO Motonori wrote:
> 2009/5/14 Corinna Vinschen :
> > I already wrote that patch, see
> > http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
> > It seems to do what you are proposing.
> 
> I read it and built cygwin1.dll. It seems to work correctly.
> 
> Should the following part not be modified?
> 
> winsup/cygwin/fhandler_console.cc:
> > dev_state->con_mbtowc = __mbtowc;
> > dev_state->con_wctomb = __wctomb;

I'd rather not.  It only affects the console and if LANG=C I'd rather
see the single bytes which make up the path instead of the corresponding
UTF-8 character.

> But I think the patch solves only the case of UTF-8 in the thread
> starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html.
> 
> It is necessary to separate the following variables for the library
> and for the system to support encoding that is not UTF-8.
> 
> - __mb_cur_max
> - lc_ctype_charset
> - __mbtowc
> - __wctomb

I understand what you're up to, but right now I'm not really sure that
this is the way to go.  I had this idea as well at one point, but,
thinking about it, I see a couple of potential problems.  I don't want
to decouple the libraries' idea of a string from the application's idea.
I tried various scenarios with the current solution and they all worked
ok, one way or the other.  I'm sure there are still some which don't
work, but before doing what you propose, I'd rather see explicit
failures.  And have some time to discuss whether these are something the
user can or even should fix or workaround alone.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen :
> I already wrote that patch, see
> http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
> It seems to do what you are proposing.

I read it and built cygwin1.dll. It seems to work correctly.

Should the following part not be modified?

winsup/cygwin/fhandler_console.cc:
> dev_state->con_mbtowc = __mbtowc;
> dev_state->con_wctomb = __wctomb;

But I think the patch solves only the case of UTF-8 in the thread
starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html.

It is necessary to separate the following variables for the library
and for the system to support encoding that is not UTF-8.

- __mb_cur_max
- lc_ctype_charset
- __mbtowc
- __wctomb

And these variables are set by LC_ALL/LC_CTYPE/LANG if call
setlocale(LC_CTYPE, "C").
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 02:25, IWAMURO Motonori wrote:
> 2009/5/14 Corinna Vinschen :
> > That's basically how my patch works.
> 
> Sorry, I can't parse this sentence because of my poor English parser...

No worries.

> Do you be writing the patch for this problem?

I already wrote that patch, see
http://cygwin.com/ml/cygwin-cvs/2009-q2/msg00066.html
It seems to do what you are proposing.

> > Btw., if you plan to write more and bigger patches for Cygwin, it would
> > be necessary to sign a copyright assignment form.  That's explained on
> > http://cygwin.com/contrib.html.
> 
> Ummm, it seems to take time very much...

Yes, unfortunately.  But it's really required for non-trivial patches.
Sorry.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

2009/5/14 Corinna Vinschen :
> That's basically how my patch works.

Sorry, I can't parse this sentence because of my poor English parser...
Do you be writing the patch for this problem?

> Btw., if you plan to write more and bigger patches for Cygwin, it would
> be necessary to sign a copyright assignment form.  That's explained on
> http://cygwin.com/contrib.html.

Ummm, it seems to take time very much...
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 14 01:03, IWAMURO Motonori wrote:
> Hi.
> 
> My idea is as follows:
> 
> 1)  separate mbtowc/wctomb function entries to library usage and
> system usage. (__mbtowc/__wctomb & __sys_mbtowc/__sys_wctomb)
> 
> 2) If call setlocale(LC_CTYPE) by locale != "C", then lib == sys.
> 
> 3) If call setlocale(LC_CTYPE) by locale == "C", then sys is set by
> LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8
> converter.

That's basically how my patch works.

> Cygwin startup call setlocale(LC_CTYPE, "C") at winsup/cygwin/dcrt0.cc.

Yes, it does already.

> I am writing this patch and test code now.

Btw., if you plan to write more and bigger patches for Cygwin, it would
be necessary to sign a copyright assignment form.  That's explained on
http://cygwin.com/contrib.html.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Andy Koppe

> Not necessarily better, but here is a chart:
>
> Sys:  App:    function expects/returns
> NULL: NULL:   UTF-8
> C/UA: NULL:   UTF-8
> NULL: C/UA:   UTF-8
> C/UA: C/UA:   UTF-8
> SPEC: NULL:   System Locale
> SPEC: C/UA:   UTF-8
> NULL  SPEC:   Application Locale
> C/UA: SPEC:   Application Locale
> SPEC: SPEC:   Application Locale

What is the "System Locale" in this context? Asking because Windows
doesn't have locales as such, although it does have a default "ANSI
codepage".

Andy

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread IWAMURO Motonori

Hi.

My idea is as follows:

1)  separate mbtowc/wctomb function entries to library usage and
system usage. (__mbtowc/__wctomb & __sys_mbtowc/__sys_wctomb)

2) If call setlocale(LC_CTYPE) by locale != "C", then lib == sys.

3) If call setlocale(LC_CTYPE) by locale == "C", then sys is set by
LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8
converter.

Cygwin startup call setlocale(LC_CTYPE, "C") at winsup/cygwin/dcrt0.cc.

I think that the result is as follows:

1) LANG=C
   lib = ascii converter, sys = UTF-8 converter.

2) LANG=xx_XX.ENCODING & not call setlocale.
   lib = ascii converter, sys = ENCODING converter.

3) LANG=xx_XX.ENCODING & call setlocale(LC_ALL, "").
   lib = ENCODING converter, sys = ENCODING converter.

I think that [cat `read_dir_entry_and_print_app`] works correctly above all.

I am writing this patch and test code now.

> One problem can't be solved this way:  If an application fetches
> and stores a filename, then switches the locale, and then tries
> to use the filename in another system call, the filename is
> potentially broken.

If the application switches the encoding while processing, I think
that the problem is a responsibility of the application.

2009/5/13 Corinna Vinschen :
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>> > I propose that the filename encoding in C locale uses UTF-8 instead of 
>> > SO/UTF-8.
>> >
>> > There are three reasons:
>>
>> That's an interesting thought.  Do you have a patch and, if so, did you
>> try it?  Does it, for instance, help for the issue reported in the
>> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
>
> After examining the issue Lenik reported in the above thread, I'm at
> a loss how to solve this problem in a generic way.
>
> The problem is that the filename changes dependent on the character
> set used in $LANG.  The reason is that every time a multibyte filename
> has to be generated, it has to be converted from UTF-16 to multibyte.
>
> For instance, taking one of the filename from Lenik's example.  It's
> stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
> LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence
>
>  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
>
> If I set LANG to en_US.GBK, `ls' returns the filename
>
>  0xd7 0xc0 0xc3 0xe6
>
> And in case LANG=C, `ls' returns
>
>  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
>
> So, dependent on the character set setting in the application, the idea
> of the filename differs.  That's not exactly helpful for interoperability
> between different applications.
>
> I can think of two potential solutions to fix this problem:
>
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
>    is the way files are stored on disk.  That results in unchangable
>    filenames which are always valid.
>
>    But what if an application sets LANG=".SJIS" and tries to create
>    a file using SJIS character encoding?  Should the file be created
>    using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
>    That's not good.
>
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
>    Cygwin uses the LC_CTYPE setting which corresponds to the current
>    codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment,
>    Cygwin uses that to convert pathnames.  If the application uses
>    setlocale, Cygwin uses that setting to convert pathnames.
>
>    One problem can't be solved this way:  If an application fetches
>    and stores a filename, then switches the locale, and then tries
>    to use the filename in another system call, the filename is
>    potentially broken.
>
> Any better ideas?
>
>
> Corinna
>
> --
> Corinna Vinschen                  Please, send mails regarding Cygwin to
> Cygwin Project Co-Leader          cygwin AT cygwin DOT com
> Red Hat
>
> --
> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
> Problem reports:       http://cygwin.com/problems.html
> Documentation:         http://cygwin.com/docs.html
> FAQ:                   http://cygwin.com/faq/
>
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 11:41, Jason Pyeron wrote:
> Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> > On May 12 19:37, Corinna Vinschen wrote:
> >> On May 13 02:29, IWAMURO Motonori wrote:
> >>> I propose that the filename encoding in C locale uses UTF-8 instead
> >>> of SO/UTF-8. 
> >>> 
> >>> There are three reasons:
> >> 
> >> That's an interesting thought.  Do you have a patch and, if so, did
> >> you try it?  Does it, for instance, help for the issue reported in
> >> the thread starting at
> > http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
> > 
> > After examining the issue Lenik reported in the above thread,
> > I'm at a loss how to solve this problem in a generic way.
> > 
> 
> I may be dense, as all of my internationlization experience was from the late
> 90's. But in my experience the only solution for this is a cognizant effort on
> behalf of the user (or admin).
> [...]
> > Any better ideas?
> 
> Not necessarily better, but here is a chart:
> 
> Sys:  App:function expects/returns
> NULL: NULL:   UTF-8
> C/UA: NULL:   UTF-8
> NULL: C/UA:   UTF-8
> C/UA: C/UA:   UTF-8
> SPEC: NULL:   System Locale
> SPEC: C/UA:   UTF-8
> NULL  SPEC:   Application Locale
> C/UA: SPEC:   Application Locale
> SPEC: SPEC:   Application Locale

What I just implemented basically matches the above, except for

  SPEC: NULL:   System Locale

This will also use UTF-8.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

RE: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Jason Pyeron

Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>>> I propose that the filename encoding in C locale uses UTF-8 instead
>>> of SO/UTF-8. 
>>> 
>>> There are three reasons:
>> 
>> That's an interesting thought.  Do you have a patch and, if so, did
>> you try it?  Does it, for instance, help for the issue reported in
>> the thread starting at
> http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
> 
> After examining the issue Lenik reported in the above thread,
> I'm at a loss how to solve this problem in a generic way.
> 

I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).

> The problem is that the filename changes dependent on the
> character set used in $LANG.  The reason is that every time a
> multibyte filename has to be generated, it has to be
> converted from UTF-16 to multibyte.
> 
> For instance, taking one of the filename from Lenik's
> example.  It's stored on the filesystem as the UTF-16
> sequence \u684c \u9762.  If I set LANG to en_US.UTF-8, a
> readdir(2) call returns the multibyte sequence
> 
>  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
> 
> If I set LANG to en_US.GBK, `ls' returns the filename
> 
>  0xd7 0xc0 0xc3 0xe6
> 
> And in case LANG=C, `ls' returns
> 
>  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
> 
> So, dependent on the character set setting in the
> application, the idea of the filename differs.  That's not
> exactly helpful for interoperability between different applications.
> 
> I can think of two potential solutions to fix this problem:
> 
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
> is the way files are stored on disk.  That results in unchangable
> filenames which are always valid.
> 
> But what if an application sets LANG=".SJIS" and
> tries to create
> a file using SJIS character encoding?  Should the file be created
> using the SJIS->UTF-16 conversion or should open fail with
> EILSEQ? That's not good. 
> 
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
> Cygwin uses the LC_CTYPE setting which corresponds to the current
> codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in
> the environment,

If nothing is set use UTF-8 as it will work in existing code.

> Cygwin uses that to convert pathnames.  If the application uses
> setlocale, Cygwin uses that setting to convert pathnames.
> 
> One problem can't be solved this way:  If an application fetches
> and stores a filename, then switches the locale, and then tries
> to use the filename in another system call, the filename is
> potentially broken. 

This is the user's problem to resolve.

> 
> Any better ideas?
> 

Not necessarily better, but here is a chart:

Sys:App:function expects/returns
NULL:   NULL:   UTF-8
C/UA:   NULL:   UTF-8
NULL:   C/UA:   UTF-8
C/UA:   C/UA:   UTF-8
SPEC:   NULL:   System Locale
SPEC:   C/UA:   UTF-8
NULLSPEC:   Application Locale
C/UA:   SPEC:   Application Locale
SPEC:   SPEC:   Application Locale


Key:

Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)

-jason

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-   -
- Jason Pyeron  PD Inc. http://www.pdinc.us -
- Principal Consultant  10 West 24th Street #100-
- +1 (443) 269-1555 x333Baltimore, Maryland 21218   -
-   -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 13 15:54, Andy Koppe wrote:
> > - why do you need to touch the filename at all? I haven't read all of it. Is
> > the UTF-16 on disk and we need to work around UTF-16 being intractable as C
> > string?
> 
> Yes. If you simply treated each UTF-16 symbol as two chars, you'd get
> unintended NULs and slashes. For starters, the upper halves of all
> ISO-8859-1 characters are NUL in UTF-16. And even without that, the
> resulting filenames would be completely unusable.

Right.  That's the crux when using UTF-16 filenames but many different
multibyte codepages.  In contrast to a system in which the filename is
just a byte stream, we have to perform widechar to multibyte conversion
and outside of the UTF-8 domain, every other conversion is lossy.

For the time being, I applied a patch to Cygwin which should ease the
pain.

I followed the suggestion to use UTF-8 for internal conversions when the
locale is set to "C".  This will also be used as default conversion when
converting the Windows environment from UTF-16 to multibyte, unless the
environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
working directory was also potentially unusable, if an application
switched the locale.  Now the CWD is re-evaluated after a setlocale call.

I'm sure this change doesn't fix all problems, but this worked much better
in my environment when using japanese and chinese characters in filenames.

There are a few other changes to the Cygwin DLL in the loop, but I will
update Cygwin 1.7 end of the week.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Andy Koppe

> - why do you need to touch the filename at all? I haven't read all of it. Is
> the UTF-16 on disk and we need to work around UTF-16 being intractable as C
> string?

Yes. If you simply treated each UTF-16 symbol as two chars, you'd get
unintended NULs and slashes. For starters, the upper halves of all
ISO-8859-1 characters are NUL in UTF-16. And even without that, the
resulting filenames would be completely unusable.

Andy

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Matthias Andree

Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen  
:

On May 12 19:37, Corinna Vinschen wrote:

On May 13 02:29, IWAMURO Motonori wrote:
> I propose that the filename encoding in C locale uses UTF-8 instead  
of SO/UTF-8.

>
> There are three reasons:

That's an interesting thought.  Do you have a patch and, if so, did you
try it?  Does it, for instance, help for the issue reported in the
thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?

After examining the issue Lenik reported in the above thread, I'm at
a loss how to solve this problem in a generic way.

The problem is that the filename changes dependent on the character
set used in $LANG.  The reason is that every time a multibyte filename
has to be generated, it has to be converted from UTF-16 to multibyte.

For instance, taking one of the filename from Lenik's example.  It's
stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

If I set LANG to en_US.GBK, `ls' returns the filename

 0xd7 0xc0 0xc3 0xe6

And in case LANG=C, `ls' returns

 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

So, dependent on the character set setting in the application, the idea
of the filename differs.  That's not exactly helpful for interoperability
between different applications.

I can think of two potential solutions to fix this problem:

(1) Always return filenames in UTF-8 encoding and pretend that UTF-8
is the way files are stored on disk.  That results in unchangable
filenames which are always valid.
   But what if an application sets LANG=".SJIS" and tries to create
a file using SJIS character encoding?  Should the file be created
using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
That's not good.

Why would it have to interpreted as all? Aren't filenames just opaque  
strings - with exceptions, say, for / and NUL to UNIX kernels?

(2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
Cygwin uses the LC_CTYPE setting which corresponds to the current
codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the  
environment,

Cygwin uses that to convert pathnames.  If the application uses
setlocale, Cygwin uses that setting to convert pathnames.

One problem can't be solved this way:  If an application fetches
and stores a filename, then switches the locale, and then tries
to use the filename in another system call, the filename is
potentially broken.

Any better ideas?

Just questions to kindle some brainstorming:

- why do you need to touch the filename at all? I haven't read all of it.  
Is the UTF-16 on disk and we need to work around UTF-16 being intractable  
as C string?

- some applications in the GNOME ballpark, for instance Gnumerica, do  
something like "treat as Unicode" and fall back to  
SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated  
list - not sure)

- adding to my interspersed comment above: isn't the issue more about  
*presentation* of filenames to the user than internal workings? To me the  
main issue appears to be that filenames should look alike in a Cygwin  
application and in a native Windows application. I'd assume that  
applications can get really confused if you change file names behind their  
back.

- if you speak of UTF-8, do you want to normalize file names? (I'd think  
you do.) Which normalization form will you choose? NFC (canonical) or NFD  
(compatibility)?

--
Matthias Andree

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Corinna Vinschen

On May 12 19:37, Corinna Vinschen wrote:
> On May 13 02:29, IWAMURO Motonori wrote:
> > I propose that the filename encoding in C locale uses UTF-8 instead of 
> > SO/UTF-8.
> > 
> > There are three reasons:
> 
> That's an interesting thought.  Do you have a patch and, if so, did you
> try it?  Does it, for instance, help for the issue reported in the
> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?

After examining the issue Lenik reported in the above thread, I'm at 
a loss how to solve this problem in a generic way.

The problem is that the filename changes dependent on the character
set used in $LANG.  The reason is that every time a multibyte filename
has to be generated, it has to be converted from UTF-16 to multibyte.

For instance, taking one of the filename from Lenik's example.  It's
stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence

 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2

If I set LANG to en_US.GBK, `ls' returns the filename

 0xd7 0xc0 0xc3 0xe6

And in case LANG=C, `ls' returns

 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2

So, dependent on the character set setting in the application, the idea
of the filename differs.  That's not exactly helpful for interoperability
between different applications.

I can think of two potential solutions to fix this problem:

(1) Always return filenames in UTF-8 encoding and pretend that UTF-8
is the way files are stored on disk.  That results in unchangable
filenames which are always valid.

But what if an application sets LANG=".SJIS" and tries to create
a file using SJIS character encoding?  Should the file be created
using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
That's not good.

(2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
Cygwin uses the LC_CTYPE setting which corresponds to the current
codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment,
Cygwin uses that to convert pathnames.  If the application uses
setlocale, Cygwin uses that setting to convert pathnames.

One problem can't be solved this way:  If an application fetches
and stores a filename, then switches the locale, and then tries
to use the filename in another system call, the filename is
potentially broken.

Any better ideas?

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-13 Thread Lenik


http://cygwin.com/ml/cygwin/2009-05/msg00245.html?


I found this web page doesn't display utf-8 characters correctly.

BTW, I'm using thunderbird as news reader.

Lenik


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 12 15:53, Mark J. Reed wrote:
> On Tue, May 12, 2009 at 3:22 PM, Corinna Vinschen
> >
> > http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> 
> OK, got it.  So Mr. Iwamuro's proposal is that Cygwin ignore the
> locale setting, and just automatically convert the Windows UTF-16
> filenames to UTF-8 (and back) no matter what.

No.  Only if LANG=C.

> That seems rife with possible confusion, though. If I have my codepage
> set to ISO-2022 and paste in a filename, I expect it to be interpreted

Cygwin 1.7 doesn't use the codepage.  It uses what $LANG says.  See
http://cygwin.com/1.7/cygwin-ug-net/setup-locale.html

> as ISO-2022, not as UTF-8 (which will probably fail with an invalid
> encoding sequence).
> 
> OTOH, the SO/UTF-8 hack would seem to bode ill for the portability of,
> say, tar archives created under Cygwin.

The filenames potentially look weird, but they are valid filenames.
If anybody has a better idea how to workaround the problem of UTF-16
chars which don't translate into the current singlebyte or multibyte
charset, feel free to suggest.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Mark J. Reed

On Tue, May 12, 2009 at 3:22 PM, Corinna Vinschen
>
> http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual

OK, got it.  So Mr. Iwamuro's proposal is that Cygwin ignore the
locale setting, and just automatically convert the Windows UTF-16
filenames to UTF-8 (and back) no matter what.

That seems rife with possible confusion, though. If I have my codepage
set to ISO-2022 and paste in a filename, I expect it to be interpreted
as ISO-2022, not as UTF-8 (which will probably fail with an invalid
encoding sequence).

OTOH, the SO/UTF-8 hack would seem to bode ill for the portability of,
say, tar archives created under Cygwin.

-- 
Mark J. Reed 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 12 15:13, Mark J. Reed wrote:
> > On May 13 02:29, IWAMURO Motonori wrote:
> >> Hi.
> >>
> >> I propose that the filename encoding in C locale uses UTF-8 instead of 
> >> SO/UTF-8
> 
> What the heck is "SO/UTF-8"?

http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Mark J. Reed

> On May 13 02:29, IWAMURO Motonori wrote:
>> Hi.
>>
>> I propose that the filename encoding in C locale uses UTF-8 instead of 
>> SO/UTF-8

What the heck is "SO/UTF-8"?

-- 
Mark J. Reed 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread Corinna Vinschen

On May 13 02:29, IWAMURO Motonori wrote:
> Hi.
> 
> I propose that the filename encoding in C locale uses UTF-8 instead of 
> SO/UTF-8.
> 
> There are three reasons:
> 
> 1. for the interoperability between Cygwin and various UNIX-like
> systems (Linux, *BSD, Solaris, and so on).
>UNIX-like systems treat the filename as 8bit byte array, and many
> applications on the systems send or receive filename information
> without locale. (mercurial, git, rsync, and so on).
> 
> 2. UTF-8 is the only encoding that can treat multi languages.
> 
> 3. Today, the default encoding of modern UNIX-like systems is UTF-8.

That's an interesting thought.  Do you have a patch and, if so, did you
try it?  Does it, for instance, help for the issue reported in the
thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

2009-05-12 Thread IWAMURO Motonori

Hi.

I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8.

There are three reasons:

1. for the interoperability between Cygwin and various UNIX-like
systems (Linux, *BSD, Solaris, and so on).
   UNIX-like systems treat the filename as 8bit byte array, and many
applications on the systems send or receive filename information
without locale. (mercurial, git, rsync, and so on).

2. UTF-8 is the only encoding that can treat multi languages.

3. Today, the default encoding of modern UNIX-like systems is UTF-8.

Please examine it.

Thanks.
-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/

46 matches

Mail list logo