Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Larry Hall (Cygwin)

On 06/03/2009, Christopher Faylor wrote:

If you really must know, I was waiting for Corinna to return from
vacation before advertising the fix because I wanted her to look at it
first before it was exposed to sound and fury of the cygwin mailing
list.


Looks like you gotyour wish. ;-)

--
Larry Hall  http://www.rfk.com
RFK Partners, Inc.  (508) 893-9779 - RFK Office
216 Dalton Rd.  (508) 893-9889 - FAX
Holliston, MA 01746

_

A: Yes.
> Q: Are you sure?
>> A: Because it reverses the logical flow of conversation.
>>> Q: Why is top posting annoying in email?

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Christopher Faylor
On Wed, Jun 03, 2009 at 01:19:45PM -0400, Edward Lam wrote:
>Christopher Faylor wrote:
>>As Corinna said above: "Chris implemented using the invalid code point
>>solution"
>>
>>That's what is in Cygwin's CVS and in the latest snapshot.
>
>I see, you silently committed a fix while this discussion was ongoing?
>
>http://www.cygwin.com/snapshots/winsup-changelog-20090530-20090531

Yes, and then Corinna not-so-silently mentioned it in email that you
quoted asking what was going to happen.

If you really must know, I was waiting for Corinna to return from
vacation before advertising the fix because I wanted her to look at it
first before it was exposed to sound and fury of the cygwin mailing
list.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Alexey Borzenkov
On Wed, Jun 3, 2009 at 6:27 PM, Corinna Vinschen
 wrote:
> What's left as questionable is the LANG=C default case.  Due to the
> discussion from the last month we now use UTF-8 as default encoding,
> because it's the only encoding which covers all (valid) characters.
> Sure, we could also convert the command line using the current ANSI
> codepage as Windows does it when calling CreateProcessA in this case.
>
> Maybe we should do that for testing?  Anybody having a strong opinion
> here?

I am strongly against it. Because, as I showed earlier, this case:

  for filename in `ls`; do
windowsprogram.exe $filename
  end

Should work. Since filenames use cygwin's encoding (UTF-8 for C
locale, or the value of LANG), arguments conversion should use it too.
It would be confusing otherwise.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

Christopher Faylor wrote:

As Corinna said above: "Chris implemented using the invalid code point
solution"

That's what is in Cygwin's CVS and in the latest snapshot.


I see, you silently committed a fix while this discussion was ongoing?

http://www.cygwin.com/snapshots/winsup-changelog-20090530-20090531

-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Christopher Faylor
On Wed, Jun 03, 2009 at 12:55:57PM -0400, Edward Lam wrote:
>Corinna Vinschen wrote:
>> On Jun  3 12:02, Christopher Faylor wrote:
>>> On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote:
 On Jun  3 09:18, Edward Lam wrote:
> Corinna Vinschen wrote:
>> The question is, what do you expect?  [...]
> [...]
> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
> rule that uses the replacement character.
 Chris implemented using the invalid code point solution.  The discussion
 in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
 supports this solution.  What's missing so far is the way back, from
 an invalid single second half of a surrogate pair in the 0xDCxx range
 back to the correct byte value.  I'm just looking into that.
>>> The way back was not, AFAIK, needed for Cygwin programs.  I don't think
>>> there is a valid way back for Windows programs.
>> 
>> The way back is not needed for the argv handling in Cygwin, but it
>> gets necessary if you converted to UTF-16 in other circumstances.
>> It's not much of a problem since the way back is a no-brainer, in
>> contrast to the conversion to UTF-16.
>
>What is the current state of affairs in cygwin 1.7.0-48? Is the invalid 
>code point solution currently being used when converting the command 
>line to UTF-16 when spawning non-cygwin processes? What I'm trying to 
>understand is where the command line truncation is taking place, in the 
>parent or child process.
>
>If the truncation is happening in the child process because of the 
>invalid code point, then perhaps we should consider using the 
>replacement character solution when spawning non-cygwin child processes. 
>IMHO, having a bad character is better than having a truncated command 
>line. At least, the problem (invalid UTF-8) then becomes more obvious.

As Corinna said above: "Chris implemented using the invalid code point
solution"

That's what is in Cygwin's CVS and in the latest snapshot.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

IWAMURO Motonori wrote:

And, I think that UTF-8 is best solution when the setting of LC_CTYPE
category is C.

2009/6/4 IWAMURO Motonori :

I think that this problem is caused by missing setting the locale
environment variable.
Therefore, I think that the problem can be solved by compelling the
setting with setup.exe.


That's not a bad idea. I'm thinking of some option in the installer to 
provide legacy compatibility.


-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

Corinna Vinschen wrote:

On Jun  3 12:02, Christopher Faylor wrote:

On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote:

On Jun  3 09:18, Edward Lam wrote:

Corinna Vinschen wrote:

The question is, what do you expect?  [...]

[...]
Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
rule that uses the replacement character.

Chris implemented using the invalid code point solution.  The discussion
in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
supports this solution.  What's missing so far is the way back, from
an invalid single second half of a surrogate pair in the 0xDCxx range
back to the correct byte value.  I'm just looking into that.

The way back was not, AFAIK, needed for Cygwin programs.  I don't think
there is a valid way back for Windows programs.


The way back is not needed for the argv handling in Cygwin, but it
gets necessary if you converted to UTF-16 in other circumstances.
It's not much of a problem since the way back is a no-brainer, in
contrast to the conversion to UTF-16.


What is the current state of affairs in cygwin 1.7.0-48? Is the invalid 
code point solution currently being used when converting the command 
line to UTF-16 when spawning non-cygwin processes? What I'm trying to 
understand is where the command line truncation is taking place, in the 
parent or child process.


If the truncation is happening in the child process because of the 
invalid code point, then perhaps we should consider using the 
replacement character solution when spawning non-cygwin child processes. 
IMHO, having a bad character is better than having a truncated command 
line. At least, the problem (invalid UTF-8) then becomes more obvious.


-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
And, I think that UTF-8 is best solution when the setting of LC_CTYPE
category is C.

2009/6/4 IWAMURO Motonori :
> I think that this problem is caused by missing setting the locale
> environment variable.
> Therefore, I think that the problem can be solved by compelling the
> setting with setup.exe.
>
> 2009/6/4 Corinna Vinschen :
>>
>> http://cygwin.com/acronyms/#PCYMTNQREAIYR
>> http://cygwin.com/acronyms/#TOFU
>>
>> On Jun  4 00:03, IWAMURO Motonori wrote:
>>> 2009/6/3 Corinna Vinschen
>>> > What's left as questionable is the LANG=C default case.  Due to the
>>> > discussion from the last month we now use UTF-8 as default encoding,
>>> > because it's the only encoding which covers all (valid) characters.
>>> > Sure, we could also convert the command line using the current ANSI
>>> > codepage as Windows does it when calling CreateProcessA in this case.
>>> >
>>> > Maybe we should do that for testing?  Anybody having a strong opinion
>>> > here?
>>
>>> How about the addition of the setting of the locale environment
>>> variable (like LANG) to the Cygwin installer?
>>
>> I'm sorry, but I don't understand how that's connected to the behaviour
>> of the Cygwin DLL.  Setup.exe is an entirely different beast.
>>
>>
>> Corinna
>>
>> --
>> Corinna Vinschen                  Please, send mails regarding Cygwin to
>> Cygwin Project Co-Leader          cygwin AT cygwin DOT com
>> Red Hat
>>
>> --
>> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
>> Problem reports:       http://cygwin.com/problems.html
>> Documentation:         http://cygwin.com/docs.html
>> FAQ:                   http://cygwin.com/faq/
>>
>>
>
>
>
> --
> IWAMURO Motnori 
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
I think that this problem is caused by missing setting the locale
environment variable.
Therefore, I think that the problem can be solved by compelling the
setting with setup.exe.

2009/6/4 Corinna Vinschen :
>
> http://cygwin.com/acronyms/#PCYMTNQREAIYR
> http://cygwin.com/acronyms/#TOFU
>
> On Jun  4 00:03, IWAMURO Motonori wrote:
>> 2009/6/3 Corinna Vinschen
>> > What's left as questionable is the LANG=C default case.  Due to the
>> > discussion from the last month we now use UTF-8 as default encoding,
>> > because it's the only encoding which covers all (valid) characters.
>> > Sure, we could also convert the command line using the current ANSI
>> > codepage as Windows does it when calling CreateProcessA in this case.
>> >
>> > Maybe we should do that for testing?  Anybody having a strong opinion
>> > here?
>
>> How about the addition of the setting of the locale environment
>> variable (like LANG) to the Cygwin installer?
>
> I'm sorry, but I don't understand how that's connected to the behaviour
> of the Cygwin DLL.  Setup.exe is an entirely different beast.
>
>
> Corinna
>
> --
> Corinna Vinschen                  Please, send mails regarding Cygwin to
> Cygwin Project Co-Leader          cygwin AT cygwin DOT com
> Red Hat
>
> --
> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
> Problem reports:       http://cygwin.com/problems.html
> Documentation:         http://cygwin.com/docs.html
> FAQ:                   http://cygwin.com/faq/
>
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen
On Jun  3 18:12, Corinna Vinschen wrote:
> On Jun  3 12:01, Edward Lam wrote:
> > Corinna Vinschen wrote:
> >> No.  I'm suggesting to convert the command line always using the default
> >> ANSI codepage, same as Windows when calling CreateProcessA.  This only
> >> affects non-Cygwin processes anyway since Cygwin uses another mechanism
> >> to send the command line arguments to the child process.
> >
> > Wouldn't that necessarily break non-Cygwin processes that are UTF-16 aware?
> 
> How?  They get the commnd line in UTF-16 anyway.

Ok, I found a problem.  Assuming the argument is a valid filename in
UTF-8 encoding, as it's the default when using LANG=C.  If we try to
convert this string using the ANSI codepage, the native child process
will get a malformed filename as argument.  Looks like always using the
ANSI codepage is not exactly a good solution...


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen
On Jun  3 12:01, Edward Lam wrote:
> Corinna Vinschen wrote:
>> No.  I'm suggesting to convert the command line always using the default
>> ANSI codepage, same as Windows when calling CreateProcessA.  This only
>> affects non-Cygwin processes anyway since Cygwin uses another mechanism
>> to send the command line arguments to the child process.
>
> Wouldn't that necessarily break non-Cygwin processes that are UTF-16 aware?

How?  They get the commnd line in UTF-16 anyway.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen
On Jun  3 12:02, Christopher Faylor wrote:
> On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote:
> >On Jun  3 09:18, Edward Lam wrote:
> >> Corinna Vinschen wrote:
> >>> The question is, what do you expect?  [...]
> >> [...]
> >> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
> >> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
> >> rule that uses the replacement character.
> >
> >Chris implemented using the invalid code point solution.  The discussion
> >in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
> >supports this solution.  What's missing so far is the way back, from
> >an invalid single second half of a surrogate pair in the 0xDCxx range
> >back to the correct byte value.  I'm just looking into that.
> 
> The way back was not, AFAIK, needed for Cygwin programs.  I don't think
> there is a valid way back for Windows programs.

The way back is not needed for the argv handling in Cygwin, but it
gets necessary if you converted to UTF-16 in other circumstances.
It's not much of a problem since the way back is a no-brainer, in
contrast to the conversion to UTF-16.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Christopher Faylor
On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote:
>On Jun  3 09:18, Edward Lam wrote:
>> Corinna Vinschen wrote:
>>> The question is, what do you expect?  [...]
>> [...]
>> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
>> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
>> rule that uses the replacement character.
>
>Chris implemented using the invalid code point solution.  The discussion
>in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
>supports this solution.  What's missing so far is the way back, from
>an invalid single second half of a surrogate pair in the 0xDCxx range
>back to the correct byte value.  I'm just looking into that.

The way back was not, AFAIK, needed for Cygwin programs.  I don't think
there is a valid way back for Windows programs.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

Corinna Vinschen wrote:

No.  I'm suggesting to convert the command line always using the default
ANSI codepage, same as Windows when calling CreateProcessA.  This only
affects non-Cygwin processes anyway since Cygwin uses another mechanism
to send the command line arguments to the child process.


Wouldn't that necessarily break non-Cygwin processes that are UTF-16 aware?

-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen

http://cygwin.com/acronyms/#PCYMTNQREAIYR
http://cygwin.com/acronyms/#TOFU

On Jun  4 00:03, IWAMURO Motonori wrote:
> 2009/6/3 Corinna Vinschen
> > What's left as questionable is the LANG=C default case.  Due to the
> > discussion from the last month we now use UTF-8 as default encoding,
> > because it's the only encoding which covers all (valid) characters.
> > Sure, we could also convert the command line using the current ANSI
> > codepage as Windows does it when calling CreateProcessA in this case.
> >
> > Maybe we should do that for testing?  Anybody having a strong opinion
> > here?

> How about the addition of the setting of the locale environment
> variable (like LANG) to the Cygwin installer?

I'm sorry, but I don't understand how that's connected to the behaviour
of the Cygwin DLL.  Setup.exe is an entirely different beast.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen
On Jun  3 10:43, Edward Lam wrote:
> Corinna Vinschen wrote:
>> What's left as questionable is the LANG=C default case.  Due to the
>> discussion from the last month we now use UTF-8 as default encoding,
>> because it's the only encoding which covers all (valid) characters.
>> Sure, we could also convert the command line using the current ANSI
>> codepage as Windows does it when calling CreateProcessA in this case.
>>
>> Maybe we should do that for testing?  Anybody having a strong opinion
>> here?
>
> I'm not clear on what you're proposing here. Are you suggesting that if  
> the conversion from LANG to UTF-16 fails, that we also try a second  
> attempt from the default system code page to UTF-16?

No.  I'm suggesting to convert the command line always using the default
ANSI codepage, same as Windows when calling CreateProcessA.  This only
affects non-Cygwin processes anyway since Cygwin uses another mechanism
to send the command line arguments to the child process.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread IWAMURO Motonori
Hi.

How about the addition of the setting of the locale environment
variable (like LANG) to the Cygwin installer?

2009/6/3 Corinna Vinschen :
> On Jun  3 09:18, Edward Lam wrote:
>> Corinna Vinschen wrote:
>>> The question is, what do you expect?  [...]
>> [...]
>> Wikipedia has several suggestions on how to handle invalid UTF-8 byte
>> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the
>> rule that uses the replacement character.
>
> Chris implemented using the invalid code point solution.  The discussion
> in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
> supports this solution.  What's missing so far is the way back, from
> an invalid single second half of a surrogate pair in the 0xDCxx range
> back to the correct byte value.  I'm just looking into that.
>
>> > How is anybody supposed to know that the file which consists
>> > of the single byte 0xa9 has *any* meaning at all?  Why should it be
>> > the copyright sign, of all things?
>>
>> What I was attempting to do was to have NO conversion. In the
>> real case that I into this, the "bug.exe" was the one to properly
>> interpret what the byte 0xA9 meant from the command line. Yes, I know
>> there are several workarounds.
>
> The command line is always converted to UTF-16 when calling a native
> Win32 application.  If we don't do it (because we call CreateProcessA),
> Windows would do it.  As matters stand, we have to convert ourselves,
> because we must call CreateProcessW.  Either way, the problem persists.
> We just don't know what the correct conversion is for the given input.
> We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE.
>
>>> If we default to the ANSI codepage, you will have the same problem,
>>> just upside down.  In both cases you will have even more problems if
>>> you start using characters not available in your default codepage.
>>
>> This is where I disagreed with Alexey. What we're really arguing here is
>> whether which default will run into the least problems for the most
>> common usage. This is subjective of course.
>
> Definitely.  The "right" solution is always only right for a given value
> of right.  What if the user has set LANG to, say, ja_JP.eucJP?  That
> user of course expects that the stuff on the command line is converted
> to UTF-16 using the eucJP encoding.  Everything else would just be very
> surprising.
>
> What's left as questionable is the LANG=C default case.  Due to the
> discussion from the last month we now use UTF-8 as default encoding,
> because it's the only encoding which covers all (valid) characters.
> Sure, we could also convert the command line using the current ANSI
> codepage as Windows does it when calling CreateProcessA in this case.
>
> Maybe we should do that for testing?  Anybody having a strong opinion
> here?
>
>
> Corinna
>
> --
> Corinna Vinschen                  Please, send mails regarding Cygwin to
> Cygwin Project Co-Leader          cygwin AT cygwin DOT com
> Red Hat
>
> --
> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
> Problem reports:       http://cygwin.com/problems.html
> Documentation:         http://cygwin.com/docs.html
> FAQ:                   http://cygwin.com/faq/
>
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

Corinna Vinschen wrote:

What's left as questionable is the LANG=C default case.  Due to the
discussion from the last month we now use UTF-8 as default encoding,
because it's the only encoding which covers all (valid) characters.
Sure, we could also convert the command line using the current ANSI
codepage as Windows does it when calling CreateProcessA in this case.

Maybe we should do that for testing?  Anybody having a strong opinion
here?


I'm not clear on what you're proposing here. Are you suggesting that if 
the conversion from LANG to UTF-16 fails, that we also try a second 
attempt from the default system code page to UTF-16?


-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Corinna Vinschen
On Jun  3 09:18, Edward Lam wrote:
> Corinna Vinschen wrote:
>> The question is, what do you expect?  [...]
> [...]
> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
> rule that uses the replacement character.

Chris implemented using the invalid code point solution.  The discussion
in http://www.mail-archive.com/linux-u...@nl.linux.org/msg00080.html
supports this solution.  What's missing so far is the way back, from
an invalid single second half of a surrogate pair in the 0xDCxx range
back to the correct byte value.  I'm just looking into that.

> > How is anybody supposed to know that the file which consists
> > of the single byte 0xa9 has *any* meaning at all?  Why should it be
> > the copyright sign, of all things?
>
> What I was attempting to do was to have NO conversion. In the
> real case that I into this, the "bug.exe" was the one to properly
> interpret what the byte 0xA9 meant from the command line. Yes, I know
> there are several workarounds.

The command line is always converted to UTF-16 when calling a native
Win32 application.  If we don't do it (because we call CreateProcessA),
Windows would do it.  As matters stand, we have to convert ourselves,
because we must call CreateProcessW.  Either way, the problem persists.
We just don't know what the correct conversion is for the given input.
We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE.

>> If we default to the ANSI codepage, you will have the same problem,
>> just upside down.  In both cases you will have even more problems if
>> you start using characters not available in your default codepage.
>
> This is where I disagreed with Alexey. What we're really arguing here is  
> whether which default will run into the least problems for the most  
> common usage. This is subjective of course.

Definitely.  The "right" solution is always only right for a given value
of right.  What if the user has set LANG to, say, ja_JP.eucJP?  That
user of course expects that the stuff on the command line is converted
to UTF-16 using the eucJP encoding.  Everything else would just be very
surprising.

What's left as questionable is the LANG=C default case.  Due to the
discussion from the last month we now use UTF-8 as default encoding,
because it's the only encoding which covers all (valid) characters.
Sure, we could also convert the command line using the current ANSI
codepage as Windows does it when calling CreateProcessA in this case.

Maybe we should do that for testing?  Anybody having a strong opinion
here?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-03 Thread Edward Lam

Corinna Vinschen wrote:

On May 29 17:21, Edward Lam wrote:


I think the problem I'm running into is: - I give cygwin 1.7's bash
a string that is in my system default code page. - cygwin 1.7
thinks the string is actually UTF-8 and tries to convert it as
UTF-8 into UTF-16, resulting in a truncated command line that is 
passed to child process.


The question is, what do you expect?  I know, you expect that it
"just works", but that's not as easy as you might assume,
unfortunately.


Yes, Alexey and I had a lengthy argument on this thread already.
Disagreements on the default LANG behaviour notwithstanding, I think
that it still should NOT truncate, substituting the invalid character
with something else instead.

Here's a quote from Alexey previously on this thread:

"In my opinion: truncation is a bug (should use replacement character,
or fail exec altogether), expecting utf-8 is not"

Wikipedia has several suggestions on how to handle invalid UTF-8 byte 
sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the 
rule that uses the replacement character.



Yoy get the idea.  The character 0xa9 has no meaning in itself.  It
only has a meaning when you consider the character set or codepage in
which you use this character.

...
> How is anybody supposed to know that the file which consists
> of the single byte 0xa9 has *any* meaning at all?  Why should it be
> the copyright sign, of all things?

What I was attempting to do was to have NO conversion. In the
real case that I into this, the "bug.exe" was the one to properly
interpret what the byte 0xA9 meant from the command line. Yes, I know
there are several workarounds.


If we default to the ANSI codepage, you will have the same problem,
just upside down.  In both cases you will have even more problems if
you start using characters not available in your default codepage.


This is where I disagreed with Alexey. What we're really arguing here is 
whether which default will run into the least problems for the most 
common usage. This is subjective of course.


-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-06-02 Thread Corinna Vinschen
On May 29 17:21, Edward Lam wrote:
>
> Alexey Borzenkov wrote:
> > No, the bug is not that it gets wrong number of arguments. In fact,
> > Windows has no concept of arguments, only C runtime does, which parses
> > the command line. If command line is truncated, then C runtime will
> > have missing arguments when it tries to parse it.
>
> Sorry, I had meant to comment on this previously but hit send too soon.
>
> I think the problem I'm running into is:
> - I give cygwin 1.7's bash a string that is in my system default code page.
> - cygwin 1.7 thinks the string is actually UTF-8 and tries to convert it  
> as UTF-8 into UTF-16, resulting in a truncated command line that is  
> passed to child process.
>
> Here's some more investigation:
>
> $ cat bug.c
> #include 
>
> int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
> {
> int i;
> for (i = 0; i < argc; i++)
> wprintf(L"%d: %s\n", i, argv[i]);
> return 0;
> }
>
> ... and compiled using MSVC 
>
> $ ./bug arg1 "before `cat copyright.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before
>
> So note that even when I'm seems to be an UNICODE-AWARE child process,  
> I'm still getting a truncated command line. In fact, call  
> GetCommandLineW() directly seems to give a truncated command line
> as well.

The question is, what do you expect?  I know, you expect that it "just
works", but that's not as easy as you might assume, unfortunately.

Let's assume you're doing all this in a Windows console.  The character
we're talking about is a singlebyte or multibyte character with the
value 0xa9.  What exactly is this character 0xa9?

- It's the "Copyright" sign in Windows codepage 1252, the default GUI
  (ANSI) codepage for many western languages and, incidentally, in
  ISO-8859-1 and ISO-8859-15.  The Unicode value of this character is
  0xa9.

- It's the "reverse not sign" in Windows codepage 437, the default
  console (OEM) codepage on US systems.  The Unicode value is 0x2310.

- It's the "Registered trademark" sign in Windows codepage 850, the
  default OEM codepage in a couple of western european languages
  (French, German, Italian, ...).  The Unicode value is 0xae.

- It's the Cyrillic capital letter IE in Windows codepage 855, the
  default OEM codepage for languages using cyrillic characters.  The
  Unicode value is 0x0415.

Yoy get the idea.  The character 0xa9 has no meaning in itself.  It only
has a meaning when you consider the character set or codepage in which
you use this character.

When converting this character to UTF-16, the converting function has to
know the charset in which the character has been given.  The problem is,
how is Cygwin supposed to know in which codepage or charset the
character has been created?  In your case it's even more weird.  How is
anybody supposed to know that the file which consists of the single byte
0xa9 has *any* meaning at all?  Why should it be the copyright sign, of
all things?

Cygwin now defaults to UTF-8.  In UTF-8 the character value 0xa9 is an
invalid character.  The conversion function which converts the command
line fails due to an invalid character value.  Whether this is good or
bad is another problem, but fact is, Cygwin doesn't know what to do with
this value in the first place.  It doesn't know anything about the
charset used to generate the character with the value 0xa9.  So, even if
you take Cygwin out of the picture, if you create a console application
which writes the multibyte character with value 0xa9 to the console, it
will in all likelihood not be the copyright sign.  If you're printing on
a US system, the default console codepage is 437 and you get the reverse
not sign.  If you call `chcp 1252' and print again, you get the
copyright sign.

The bottom line is, whatever default we use, we're screwed in some way,
because it will cause inconvenience for one part of the users and help
the others.  That was already the case for the old
CYGWIN=codepage:{oem|ansi} environment variable setting.

If we default to the OEM charset, you will not get the expected result
for characters created using the ANSI codepage and get problems
interacting with applications using the ANSI codepage.

If we default to the ANSI codepage, you will have the same problem, just
upside down.  In both cases you will have even more problems if you
start using characters not available in your default codepage.

If we default to UTF-8, we have no problem in Cygwin to work with any
Unicode character, but you will have to take some care when interacting
with Windows applications when using non-ASCII characters.  In your case,
in which only you know that 0xa9 is meant to be the copyright char, you
should tell Cygwin which charset you want to use.  Try setting LANG to
en_US.CP1252.  Your example should work then.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:  ht

Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-30 Thread Charles Wilson
Edward Lam wrote:
>>> Ok, so where's the bug tracker so I can log a bug?
>> Isn't this mailing list serving as bug tracker? I just hope that
>> whoever can fix this is reading our emails and will come up with the
>> right solution.
> 
> Given the lack of developer acknowledgment (or refutation), I'm not
> getting my hopes up.

Chill. The developer responsible for the unicode support has been on
holiday for two weeks. She'll be back next week.

--
Chuck

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-30 Thread Christopher Faylor
On Sat, May 30, 2009 at 04:23:22PM -0400, Edward Lam wrote:
>>> Ok, so where's the bug tracker so I can log a bug?
>>
>> Isn't this mailing list serving as bug tracker? I just hope that
>> whoever can fix this is reading our emails and will come up with the
>> right solution.
>
>Given the lack of developer acknowledgment (or refutation), I'm not
>getting my hopes up.

My eyes glazed over ten messages or so into this interminable thread.

As astonishing as it might seem for an open source project, you could
provide a fix yourself.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-30 Thread Edward Lam
>> Ok, so where's the bug tracker so I can log a bug?
>
> Isn't this mailing list serving as bug tracker? I just hope that
> whoever can fix this is reading our emails and will come up with the
> right solution.

Given the lack of developer acknowledgment (or refutation), I'm not
getting my hopes up.

-Edward


--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



[Fwd: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line]

2009-05-30 Thread Edward Lam
Repost for mailing list.

On Sat, May 30, 2009 at 6:03 PM, Edward Lam  wrote:
>> Here, when I use russian Windows and I don't have LANG set (or when I
>> have LANG=en_US.UTF-8), filename will be utf-8 multibyte string. So
>> both, russian and european/chinese/japanese filenames will be valid.
>> Now there are three possibilities:
> How does the filename get to be a utf-8 multibyte string if you created
> the filename from an ANSI application? Since it sounds like Russian
> Windows uses a code page different from UTF-8.

When you create a file from ansi application, Windows converts
filenames to unicode, using your system code page. Cygwin 1.7 uses
unicode. Cygwin converts filenames to multibyte when it communicates
with Cygwin applications, and converts to unicode when it accepts data
from Cygwin applications. When LANG is not set it is currently utf-8
(but could be anything arbitrary, I'm just glad that it's utf-8
because it converts data back and forth without losing characters and
there are no problems with SO-UTF8). So cygwin applications work with
utf-8 filenames, and console is utf-8, and cygwin communicates with
Windows via unicode. But multibyte encoding is overridable via
LC_ALL/LANG.

When you are executing windows applications it's natural that you
either pass filenames or some text. Since without LANG set Cygwin
multibyte encoding is utf-8 it's only natural to use utf-8 to convert
arguments to unicode when executing windows applications. After all,
if you have a utf-8 filename with japanese characters it's only
natural that "cmd.exe /c del /y $filename" and "cmd.exe /c echo
$sometext" will succeed for any text that uses current cygwin
encoding.

Think of it like this: since file is being read by cygwin in your
first email your copyright.txt had a wrong encoding. So you need to
either use iconv to convert it (I hope that `iconv -c -f cp1251 ...`
will do the right thing without specifying target encoding here), or
specify LANG to what you are working with right now.

And if you are using English windows with English regional settings,
then your LANG should be en_US.CP1252, not en_US.ISO-8859-1 (CP1252 is
what your windows applications are using!).

I really don't know how to better explain all this, since in my head
it's so clear and obvious. :-/

> Ok, so where's the bug tracker so I can log a bug?

Isn't this mailing list serving as bug tracker? I just hope that
whoever can fix this is reading our emails and will come up with the
right solution.



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-30 Thread Edward Lam
I'm reposting since I didn't mean to send this privately.

On Fri, May 29, 2009 17:22, Alexey Borzenkov wrote:
> Here, when I use russian Windows and I don't have LANG set (or when I
> have LANG=en_US.UTF-8), filename will be utf-8 multibyte string. So
> both, russian and european/chinese/japanese filenames will be valid.
> Now there are three possibilities:

How does the filename get to be a utf-8 multibyte string if you created
the filename from an ANSI application? Since it sounds like Russian
Windows uses a code page different from UTF-8.

> And again, you must have misunderstood me. In my opinion: truncation
> is a bug (should use replacement character, or fail exec altogether),
> expecting utf-8 is not (if you tried to cat your copyright.txt on a
> Linux box that uses utf-8, what would you expect to see on the
> screen?)

Ok, so where's the bug tracker so I can log a bug?

-Edward




--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Alexey Borzenkov
On Sat, May 30, 2009 at 1:21 AM, Edward Lam  wrote:
> Here's some more investigation:
[...]
> So note that even when I'm seems to be an UNICODE-AWARE child process, I'm
> still getting a truncated command line. In fact, call GetCommandLineW()
> directly seems to give a truncated command line
> as well.

And again, you must have misunderstood me. In my opinion: truncation
is a bug (should use replacement character, or fail exec altogether),
expecting utf-8 is not (if you tried to cat your copyright.txt on a
Linux box that uses utf-8, what would you expect to see on the
screen?)

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Alexey Borzenkov
On Sat, May 30, 2009 at 1:04 AM, Edward Lam  wrote:
> Alexey Borzenkov wrote:
>> It might be safe for you, but not for other people. If you have a
>> Russian default codepage and ever need to work with chineese/japanese
>> filenames and cygwin uses default codepage for filesystem operations
>> (as in 1.5 right now), then you are really screwed. In my opinion
>> utf-8 is a silver bullet here, and I'm very glad it went that way.
> I must be missing something here. Suppose you have a default Russian code
> page, with LANG unset (ie. cygwin 1.7 uses UTF-8). Now, if you're using any
> non-Unicode, non-CodePage aware, native application to create a Russian
> filename, isn't Windows going to convert the filename from the Russian code
> page into UTF-16 for storage in NTFS? If that is the case, and then you do
> an ls from cygwin 1.7, aren't you going to get the wrong filename displayed?
> ie. interoperability with non-Unicode, non-CodePage aware native
> applications will be broken for you too with the current default cygwin 1.7
> behaviour.
>
> Or is this, not a case that you care about and you *only* use cygwin
> applications?

No, it is precisely that I care about both ends of interoperability.
Here is a hypotetical situation:

for filename in `ls`; do
  someprogram $filename
done

Here, when I use russian Windows and I don't have LANG set (or when I
have LANG=en_US.UTF-8), filename will be utf-8 multibyte string. So
both, russian and european/chinese/japanese filenames will be valid.
Now there are three possibilities:

1) someprogram is a cygwin application, then it must be that $filename
will be passed as is, without any conversions
2) someprogram is a unicode application, then it will have a correct
unicode argument
3) someprogram is an ansi application, then Windows (cygwin has
nothing to do with it) will convert its unicode arguments to system's
codepage (cp1251 for Russian) and any character that can't be encoded
will be replaced with question marks. This is solely someprogram's
fault and cygwin has nothing to do with it.

All I'm trying to say is that on Windows (since WinNT) arguments are
always in unicode. It just so happens that when ansi applications call
other ansi applications with a sequence of bytes, it first gets
converted to unicode, then back to ansi, and you get the same sequence
of bytes. But the arguments are always characters, not bytes.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Edward Lam


Alexey Borzenkov wrote:
> No, the bug is not that it gets wrong number of arguments. In fact,
> Windows has no concept of arguments, only C runtime does, which parses
> the command line. If command line is truncated, then C runtime will
> have missing arguments when it tries to parse it.

Sorry, I had meant to comment on this previously but hit send too soon.

I think the problem I'm running into is:
- I give cygwin 1.7's bash a string that is in my system default code page.
- cygwin 1.7 thinks the string is actually UTF-8 and tries to convert it 
as UTF-8 into UTF-16, resulting in a truncated command line that is 
passed to child process.


Here's some more investigation:

$ cat bug.c
#include 

int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
{
int i;
for (i = 0; i < argc; i++)
wprintf(L"%d: %s\n", i, argv[i]);
return 0;
}

... and compiled using MSVC 

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

So note that even when I'm seems to be an UNICODE-AWARE child process, 
I'm still getting a truncated command line. In fact, call 
GetCommandLineW() directly seems to give a truncated command line

as well.

Regards,
-Edward

Alexey Borzenkov wrote:

On Sat, May 30, 2009 at 12:10 AM, Edward Lam  wrote:

Thanks for explaining the UTF8 changes in cygwin 1.7. However, the decision
to use UTF-8 for the C locale is questionable.


Not at all, because utf-8, as far as I understand, is used for
communication with the system in this context, and does not force
anything to the application. Most modern unixes use utf-8 nowadays, it
means that even if you have a C locale your terminal outputs text in
utf-8, your input is utf-8, your filenames are utf-8 (well, not
really, but the rest of the system sees them that way). Same stuff
here, except that launching non-cygwin processes is communication with
the system as well, and it needs conversion. And where is conversion
there is always possible loss of data. One way or the other.


It seems to me that it would be much safer to use the SYSTEM DEFAULT code
page (ie. the return value of the system GetACP() function) for CYGWIN
instead, ensuring compatibility for the large class native Windows
applications that are non-Unicode, non-CodePage aware.


It might be safe for you, but not for other people. If you have a
Russian default codepage and ever need to work with chineese/japanese
filenames and cygwin uses default codepage for filesystem operations
(as in 1.5 right now), then you are really screwed. In my opinion
utf-8 is a silver bullet here, and I'm very glad it went that way.


I think it's very bad that changing LANG can result in a truncated *command
line*, that has nothing to do with printf. The printf in the code was just
for testing. The HUGE bug is that the application gets the  WRONG NUMBER OF
ARGUMENTS.


No, the bug is not that it gets wrong number of arguments. In fact,
Windows has no concept of arguments, only C runtime does, which parses
the command line. If command line is truncated, then C runtime will
have missing arguments when it tries to parse it.

I mentioned wprintf because recently I was wondering why
mkpasswd/mkgroup had a strange truncating behavior with russian
usernames and it turned out that wprintf, when it can't encode some
characters, stops right there and returns an error code. But, honesly,
who ever checks return codes from printf?

Here might be something similar. When constructing command line some
function is called and can't encode some character, returns error
status, but it's never checked, and you get truncated command line.

And btw, I'm not cygwin developer here, I'm just a speculating user
right now, because I haven't been searching this problem in the code.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/




--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Edward Lam

Alexey Borzenkov wrote:
> It might be safe for you, but not for other people. If you have a
> Russian default codepage and ever need to work with chineese/japanese
> filenames and cygwin uses default codepage for filesystem operations
> (as in 1.5 right now), then you are really screwed. In my opinion
> utf-8 is a silver bullet here, and I'm very glad it went that way.

I must be missing something here. Suppose you have a default Russian 
code page, with LANG unset (ie. cygwin 1.7 uses UTF-8). Now, if you're 
using any non-Unicode, non-CodePage aware, native application to create 
a Russian filename, isn't Windows going to convert the filename from the 
Russian code page into UTF-16 for storage in NTFS? If that is the case, 
and then you do an ls from cygwin 1.7, aren't you going to get the wrong 
filename displayed? ie. interoperability with non-Unicode, non-CodePage 
aware native applications will be broken for you too with the current 
default cygwin 1.7 behaviour.


Or is this, not a case that you care about and you *only* use cygwin 
applications?


Regards,
-Edward

Alexey Borzenkov wrote:

On Sat, May 30, 2009 at 12:10 AM, Edward Lam  wrote:

Thanks for explaining the UTF8 changes in cygwin 1.7. However, the decision
to use UTF-8 for the C locale is questionable.


Not at all, because utf-8, as far as I understand, is used for
communication with the system in this context, and does not force
anything to the application. Most modern unixes use utf-8 nowadays, it
means that even if you have a C locale your terminal outputs text in
utf-8, your input is utf-8, your filenames are utf-8 (well, not
really, but the rest of the system sees them that way). Same stuff
here, except that launching non-cygwin processes is communication with
the system as well, and it needs conversion. And where is conversion
there is always possible loss of data. One way or the other.


It seems to me that it would be much safer to use the SYSTEM DEFAULT code
page (ie. the return value of the system GetACP() function) for CYGWIN
instead, ensuring compatibility for the large class native Windows
applications that are non-Unicode, non-CodePage aware.


It might be safe for you, but not for other people. If you have a
Russian default codepage and ever need to work with chineese/japanese
filenames and cygwin uses default codepage for filesystem operations
(as in 1.5 right now), then you are really screwed. In my opinion
utf-8 is a silver bullet here, and I'm very glad it went that way.


I think it's very bad that changing LANG can result in a truncated *command
line*, that has nothing to do with printf. The printf in the code was just
for testing. The HUGE bug is that the application gets the  WRONG NUMBER OF
ARGUMENTS.


No, the bug is not that it gets wrong number of arguments. In fact,
Windows has no concept of arguments, only C runtime does, which parses
the command line. If command line is truncated, then C runtime will
have missing arguments when it tries to parse it.

I mentioned wprintf because recently I was wondering why
mkpasswd/mkgroup had a strange truncating behavior with russian
usernames and it turned out that wprintf, when it can't encode some
characters, stops right there and returns an error code. But, honesly,
who ever checks return codes from printf?

Here might be something similar. When constructing command line some
function is called and can't encode some character, returns error
status, but it's never checked, and you get truncated command line.

And btw, I'm not cygwin developer here, I'm just a speculating user
right now, because I haven't been searching this problem in the code.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/




--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Alexey Borzenkov
On Sat, May 30, 2009 at 12:10 AM, Edward Lam  wrote:
> Thanks for explaining the UTF8 changes in cygwin 1.7. However, the decision
> to use UTF-8 for the C locale is questionable.

Not at all, because utf-8, as far as I understand, is used for
communication with the system in this context, and does not force
anything to the application. Most modern unixes use utf-8 nowadays, it
means that even if you have a C locale your terminal outputs text in
utf-8, your input is utf-8, your filenames are utf-8 (well, not
really, but the rest of the system sees them that way). Same stuff
here, except that launching non-cygwin processes is communication with
the system as well, and it needs conversion. And where is conversion
there is always possible loss of data. One way or the other.

> It seems to me that it would be much safer to use the SYSTEM DEFAULT code
> page (ie. the return value of the system GetACP() function) for CYGWIN
> instead, ensuring compatibility for the large class native Windows
> applications that are non-Unicode, non-CodePage aware.

It might be safe for you, but not for other people. If you have a
Russian default codepage and ever need to work with chineese/japanese
filenames and cygwin uses default codepage for filesystem operations
(as in 1.5 right now), then you are really screwed. In my opinion
utf-8 is a silver bullet here, and I'm very glad it went that way.

> I think it's very bad that changing LANG can result in a truncated *command
> line*, that has nothing to do with printf. The printf in the code was just
> for testing. The HUGE bug is that the application gets the  WRONG NUMBER OF
> ARGUMENTS.

No, the bug is not that it gets wrong number of arguments. In fact,
Windows has no concept of arguments, only C runtime does, which parses
the command line. If command line is truncated, then C runtime will
have missing arguments when it tries to parse it.

I mentioned wprintf because recently I was wondering why
mkpasswd/mkgroup had a strange truncating behavior with russian
usernames and it turned out that wprintf, when it can't encode some
characters, stops right there and returns an error code. But, honesly,
who ever checks return codes from printf?

Here might be something similar. When constructing command line some
function is called and can't encode some character, returns error
status, but it's never checked, and you get truncated command line.

And btw, I'm not cygwin developer here, I'm just a speculating user
right now, because I haven't been searching this problem in the code.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Edward Lam

Hi Alexey,

Thanks for explaining the UTF8 changes in cygwin 1.7. However, the 
decision to use UTF-8 for the C locale is questionable.


It seems to me that it would be much safer to use the SYSTEM DEFAULT 
code page (ie. the return value of the system GetACP() function) for 
CYGWIN instead, ensuring compatibility for the large class native 
Windows applications that are non-Unicode, non-CodePage aware.


Reading the original mailing list threads now, it seems like Corinna 
Vinschen also mentioned this using the system code page[1]. I tried to 
dig through the various mails in that thread didn't find any good 
objection to it.


> The only bug here is that the arguments are truncated instead of using
> some kind of a replacement character, is it related to some posix
> complience, like with wprintf?

I think it's very bad that changing LANG can result in a truncated 
*command line*, that has nothing to do with printf. The printf in the 
code was just for testing. The HUGE bug is that the application gets the 
 WRONG NUMBER OF ARGUMENTS.


1. http://www.mail-archive.com/cygwin@cygwin.com/msg96843.html

Regards,
-Edward

Alexey Borzenkov wrote:

On Fri, May 29, 2009 at 8:22 PM, Edward Lam  wrote:

I think there is still a bug here? I set LANG=C, then shouldn't be just NOT
doing any encoding, thus work? If I do this on Linux, it works. If I use a
cygwin compiled app, it also works.


On Linux, internally, system uses multibyte strings (it is encoding
agnostic even), but on Windows, system uses unicode strings, so cygwin
has to decode your byte sequences somehow to pass them to non-cygwin
processes as unicode (the fact that cygwin now understands unicode is
a huge plus to me). In earlier discussions it was decided that cygwin
C locale should use utf-8 encoding, because file system internally
uses unicode it's the safest default to represent all possible
filenames, etc. In previous cygwin versions, your byte sequences were
just silently converted using your system's codepage (by the system
itself, even), so if you want the old behavior you should set
LANG=en_US.CP1252.

The only bug here is that the arguments are truncated instead of using
some kind of a replacement character, is it related to some posix
complience, like with wprintf?

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/




--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Alexey Borzenkov
On Fri, May 29, 2009 at 8:22 PM, Edward Lam  wrote:
> I think there is still a bug here? I set LANG=C, then shouldn't be just NOT
> doing any encoding, thus work? If I do this on Linux, it works. If I use a
> cygwin compiled app, it also works.

On Linux, internally, system uses multibyte strings (it is encoding
agnostic even), but on Windows, system uses unicode strings, so cygwin
has to decode your byte sequences somehow to pass them to non-cygwin
processes as unicode (the fact that cygwin now understands unicode is
a huge plus to me). In earlier discussions it was decided that cygwin
C locale should use utf-8 encoding, because file system internally
uses unicode it's the safest default to represent all possible
filenames, etc. In previous cygwin versions, your byte sequences were
just silently converted using your system's codepage (by the system
itself, even), so if you want the old behavior you should set
LANG=en_US.CP1252.

The only bug here is that the arguments are truncated instead of using
some kind of a replacement character, is it related to some posix
complience, like with wprintf?

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Edward Lam

IWAMURO Motonori wrote:

I think that you should set "export LANG=en_US.ISO-8859-1" instead of
"export LANG=LANG=en_US.ISO-8859-1".


Ah, sorry, copy/paste error. Yes, that finally works. Thank you!

I think there is still a bug here? I set LANG=C, then shouldn't be just 
NOT doing any encoding, thus work? If I do this on Linux, it works. If I 
use a cygwin compiled app, it also works.


-Edward


2009/5/30 Edward Lam :

IWAMURO Motonori wrote:

The encoding of C locale is ASCII, and not ISO-8859-1.
I don't think ASCII is the same as ISO-8859-1.
Does it work on LANG=en_US.ISO-8859-1?

No, it doesn't. Mind you though, I haven't managed to get piconv to
recognize any of my LANG settings other than C in cygwin 1.7.

$ export LANG=LANG=en_US.ISO-8859-1

$ piconv
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
   LC_ALL = (unset),
   LANG = "LANG=en_US.ISO-8859-1"
   are supported and installed on your system.

(... usage omitted...)

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

Regards,
-Edward


2009/5/29 Edward Lam :

Alexey Borzenkov wrote:

On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:

PS. In case you haven't noticed, copyright.txt is not a long file. It
consists of a single byte, 0xA9.

Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
and the encoder fails.

How is one supposed to determine one's locale in cygwin? I do NOT have
LANG,
or any of the LC environment variables set. I even tried explicitly
setting
LANG=C and it still fails.

The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
However, I think something is going on here that is unexpected because
trying something similar on Linux has no problems. To confirm that it was
an
UTF-8 related problem, let me repeat the steps slightly differently
again.
Here we assume that I've already got bug.exe compiled which simply prints
out its arguments.

$ export LANG=C

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

*Notice that argc is 3 when it should be 4!*

$ piconv -f iso-8859-1 -t utf8 < copyright.txt > fubar.txt

$ ./bug arg1 "before `cat fubar.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before © after
3: arg3

*So now everything works because I converted the character into UTF-8.*

I think what this points to is some form of invalid source encoding of
the
command line argument when spawning NATIVE applications.

Here's what happens when I try to compile bug.c using cygwin's gcc:

$ gcc bug.c -o bug-gcc.exe

$ ./bug-gcc arg1 "before `cat copyright.txt` after" arg3
0: ./bug-gcc
1: arg1
2: before © after
3: arg3

So there seems to be some sort of special marshaling of the command line
arguments that only works when spawning cygwin apps, but breaks when
running
under native apps.

Regards,
-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/







--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/









--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread IWAMURO Motonori
I think that you should set "export LANG=en_US.ISO-8859-1" instead of
"export LANG=LANG=en_US.ISO-8859-1".

2009/5/30 Edward Lam :
> IWAMURO Motonori wrote:
>>
>> The encoding of C locale is ASCII, and not ISO-8859-1.
>> I don't think ASCII is the same as ISO-8859-1.
>> Does it work on LANG=en_US.ISO-8859-1?
>
> No, it doesn't. Mind you though, I haven't managed to get piconv to
> recognize any of my LANG settings other than C in cygwin 1.7.
>
> $ export LANG=LANG=en_US.ISO-8859-1
>
> $ piconv
> perl: warning: Setting locale failed.
> perl: warning: Please check that your locale settings:
>        LC_ALL = (unset),
>        LANG = "LANG=en_US.ISO-8859-1"
>    are supported and installed on your system.
>
> (... usage omitted...)
>
> $ ./bug arg1 "before `cat copyright.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before
>
> Regards,
> -Edward
>
>> 2009/5/29 Edward Lam :
>>>
>>> Alexey Borzenkov wrote:

 On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:
>
> PS. In case you haven't noticed, copyright.txt is not a long file. It
> consists of a single byte, 0xA9.

 Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
 and the encoder fails.
>>>
>>> How is one supposed to determine one's locale in cygwin? I do NOT have
>>> LANG,
>>> or any of the LC environment variables set. I even tried explicitly
>>> setting
>>> LANG=C and it still fails.
>>>
>>> The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
>>> However, I think something is going on here that is unexpected because
>>> trying something similar on Linux has no problems. To confirm that it was
>>> an
>>> UTF-8 related problem, let me repeat the steps slightly differently
>>> again.
>>> Here we assume that I've already got bug.exe compiled which simply prints
>>> out its arguments.
>>>
>>> $ export LANG=C
>>>
>>> $ ./bug arg1 "before `cat copyright.txt` after" arg3
>>> 0: E:\cygwin1.7\tmp\bug.exe
>>> 1: arg1
>>> 2: before
>>>
>>> *Notice that argc is 3 when it should be 4!*
>>>
>>> $ piconv -f iso-8859-1 -t utf8 < copyright.txt > fubar.txt
>>>
>>> $ ./bug arg1 "before `cat fubar.txt` after" arg3
>>> 0: E:\cygwin1.7\tmp\bug.exe
>>> 1: arg1
>>> 2: before © after
>>> 3: arg3
>>>
>>> *So now everything works because I converted the character into UTF-8.*
>>>
>>> I think what this points to is some form of invalid source encoding of
>>> the
>>> command line argument when spawning NATIVE applications.
>>>
>>> Here's what happens when I try to compile bug.c using cygwin's gcc:
>>>
>>> $ gcc bug.c -o bug-gcc.exe
>>>
>>> $ ./bug-gcc arg1 "before `cat copyright.txt` after" arg3
>>> 0: ./bug-gcc
>>> 1: arg1
>>> 2: before © after
>>> 3: arg3
>>>
>>> So there seems to be some sort of special marshaling of the command line
>>> arguments that only works when spawning cygwin apps, but breaks when
>>> running
>>> under native apps.
>>>
>>> Regards,
>>> -Edward
>>>
>>> --
>>> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
>>> Problem reports:       http://cygwin.com/problems.html
>>> Documentation:         http://cygwin.com/docs.html
>>> FAQ:                   http://cygwin.com/faq/
>>>
>>>
>>
>>
>>
>
>
> --
> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
> Problem reports:       http://cygwin.com/problems.html
> Documentation:         http://cygwin.com/docs.html
> FAQ:                   http://cygwin.com/faq/
>
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread Edward Lam

IWAMURO Motonori wrote:

The encoding of C locale is ASCII, and not ISO-8859-1.
I don't think ASCII is the same as ISO-8859-1.
Does it work on LANG=en_US.ISO-8859-1?


No, it doesn't. Mind you though, I haven't managed to get piconv to 
recognize any of my LANG settings other than C in cygwin 1.7.


$ export LANG=LANG=en_US.ISO-8859-1

$ piconv
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LC_ALL = (unset),
LANG = "LANG=en_US.ISO-8859-1"
are supported and installed on your system.

(... usage omitted...)

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

Regards,
-Edward


2009/5/29 Edward Lam :

Alexey Borzenkov wrote:

On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:

PS. In case you haven't noticed, copyright.txt is not a long file. It
consists of a single byte, 0xA9.

Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
and the encoder fails.

How is one supposed to determine one's locale in cygwin? I do NOT have LANG,
or any of the LC environment variables set. I even tried explicitly setting
LANG=C and it still fails.

The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
However, I think something is going on here that is unexpected because
trying something similar on Linux has no problems. To confirm that it was an
UTF-8 related problem, let me repeat the steps slightly differently again.
Here we assume that I've already got bug.exe compiled which simply prints
out its arguments.

$ export LANG=C

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

*Notice that argc is 3 when it should be 4!*

$ piconv -f iso-8859-1 -t utf8 < copyright.txt > fubar.txt

$ ./bug arg1 "before `cat fubar.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before © after
3: arg3

*So now everything works because I converted the character into UTF-8.*

I think what this points to is some form of invalid source encoding of the
command line argument when spawning NATIVE applications.

Here's what happens when I try to compile bug.c using cygwin's gcc:

$ gcc bug.c -o bug-gcc.exe

$ ./bug-gcc arg1 "before `cat copyright.txt` after" arg3
0: ./bug-gcc
1: arg1
2: before © after
3: arg3

So there seems to be some sort of special marshaling of the command line
arguments that only works when spawning cygwin apps, but breaks when running
under native apps.

Regards,
-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/









--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-29 Thread IWAMURO Motonori
Hi.

The encoding of C locale is ASCII, and not ISO-8859-1.
I don't think ASCII is the same as ISO-8859-1.
Does it work on LANG=en_US.ISO-8859-1?

2009/5/29 Edward Lam :
> Alexey Borzenkov wrote:
>>
>> On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:
>>>
>>> PS. In case you haven't noticed, copyright.txt is not a long file. It
>>> consists of a single byte, 0xA9.
>>
>> Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
>> and the encoder fails.
>
> How is one supposed to determine one's locale in cygwin? I do NOT have LANG,
> or any of the LC environment variables set. I even tried explicitly setting
> LANG=C and it still fails.
>
> The problem does seem to stem from the new UTF-8 support in cygwin 1.7.
> However, I think something is going on here that is unexpected because
> trying something similar on Linux has no problems. To confirm that it was an
> UTF-8 related problem, let me repeat the steps slightly differently again.
> Here we assume that I've already got bug.exe compiled which simply prints
> out its arguments.
>
> $ export LANG=C
>
> $ ./bug arg1 "before `cat copyright.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before
>
> *Notice that argc is 3 when it should be 4!*
>
> $ piconv -f iso-8859-1 -t utf8 < copyright.txt > fubar.txt
>
> $ ./bug arg1 "before `cat fubar.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before © after
> 3: arg3
>
> *So now everything works because I converted the character into UTF-8.*
>
> I think what this points to is some form of invalid source encoding of the
> command line argument when spawning NATIVE applications.
>
> Here's what happens when I try to compile bug.c using cygwin's gcc:
>
> $ gcc bug.c -o bug-gcc.exe
>
> $ ./bug-gcc arg1 "before `cat copyright.txt` after" arg3
> 0: ./bug-gcc
> 1: arg1
> 2: before © after
> 3: arg3
>
> So there seems to be some sort of special marshaling of the command line
> arguments that only works when spawning cygwin apps, but breaks when running
> under native apps.
>
> Regards,
> -Edward
>
> --
> Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
> Problem reports:       http://cygwin.com/problems.html
> Documentation:         http://cygwin.com/docs.html
> FAQ:                   http://cygwin.com/faq/
>
>



-- 
IWAMURO Motnori 

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Edward Lam

Alexey Borzenkov wrote:

On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:

PS. In case you haven't noticed, copyright.txt is not a long file. It
consists of a single byte, 0xA9.


Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
and the encoder fails.


How is one supposed to determine one's locale in cygwin? I do NOT have 
LANG, or any of the LC environment variables set. I even tried 
explicitly setting LANG=C and it still fails.


The problem does seem to stem from the new UTF-8 support in cygwin 1.7. 
However, I think something is going on here that is unexpected because 
trying something similar on Linux has no problems. To confirm that it 
was an UTF-8 related problem, let me repeat the steps slightly 
differently again. Here we assume that I've already got bug.exe compiled 
which simply prints out its arguments.


$ export LANG=C

$ ./bug arg1 "before `cat copyright.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before

*Notice that argc is 3 when it should be 4!*

$ piconv -f iso-8859-1 -t utf8 < copyright.txt > fubar.txt

$ ./bug arg1 "before `cat fubar.txt` after" arg3
0: E:\cygwin1.7\tmp\bug.exe
1: arg1
2: before © after
3: arg3

*So now everything works because I converted the character into UTF-8.*

I think what this points to is some form of invalid source encoding of 
the command line argument when spawning NATIVE applications.


Here's what happens when I try to compile bug.c using cygwin's gcc:

$ gcc bug.c -o bug-gcc.exe

$ ./bug-gcc arg1 "before `cat copyright.txt` after" arg3
0: ./bug-gcc
1: arg1
2: before © after
3: arg3

So there seems to be some sort of special marshaling of the command line 
arguments that only works when spawning cygwin apps, but breaks when 
running under native apps.


Regards,
-Edward

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Alexey Borzenkov
On Thu, May 28, 2009 at 7:28 PM, Edward Lam  wrote:
> PS. In case you haven't noticed, copyright.txt is not a long file. It
> consists of a single byte, 0xA9.

Did you try utf-8 encoding copyright.txt? Perhaps your locale is utf-8
and the encoder fails.

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Edward Lam
PS. In case you haven't noticed, copyright.txt is not a long file. It 
consists of a single byte, 0xA9.


Edward Lam wrote:

Hi Larry,

 > This sounds allot like this report to me:
 >
 > 

I don't think it's the same bug because if I replace copyright.txt with 
a single printable character (eg. c), then it works.


Regards,
-Edward

Larry Hall (Cygwin) wrote:

Edward Lam wrote:

Hi Cygwin 1.7 developers,

I think I've encountered bug in cygwin 1.7.0-48 on WinXP 32-bit. It 
seems that passing a character on the command line (from either 
ash.exe or bash.exe) that is greater than 127 to a native win32 
process results in arguments being truncated.


Hopefully you can reproduce and fix. Steps to reproduce outlined below.

$ cat bug.c
#include 

int main(int argc, char *argv[])
{
int i;
for (i = 0; i < argc; i++)
printf("%d: %s\n", i, argv[i]);
return 0;
}

$ xxd copyright.txt
000: a9   .

$ $TOOLROOT/bin/cl -I$TOOLROOT/include bug.c /link 
/libpath:$TOOLROOT/lib /libpath:$TOOLROOT/PlatformSDK/lib


Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 
80x86

Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

bug.c
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:bug.exe
/libpath:e:/msdev7/vc7/lib
/libpath:e:/msdev7/vc7/PlatformSDK/lib
bug.obj

$ ./bug "before `cat copyright.txt` after"
0: E:\cygwin1.7\tmp\bug.exe
1: before

Notice that for argument 1, we never see the contents of 
copyright.txt and the text after it, "after" is never passed to the 
win32 native application.


This sounds allot like this report to me:



No?







--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Edward Lam

Hi Larry,

> This sounds allot like this report to me:
>
> 

I don't think it's the same bug because if I replace copyright.txt with 
a single printable character (eg. c), then it works.


Regards,
-Edward

Larry Hall (Cygwin) wrote:

Edward Lam wrote:

Hi Cygwin 1.7 developers,

I think I've encountered bug in cygwin 1.7.0-48 on WinXP 32-bit. It 
seems that passing a character on the command line (from either 
ash.exe or bash.exe) that is greater than 127 to a native win32 
process results in arguments being truncated.


Hopefully you can reproduce and fix. Steps to reproduce outlined below.

$ cat bug.c
#include 

int main(int argc, char *argv[])
{
int i;
for (i = 0; i < argc; i++)
printf("%d: %s\n", i, argv[i]);
return 0;
}

$ xxd copyright.txt
000: a9   .

$ $TOOLROOT/bin/cl -I$TOOLROOT/include bug.c /link 
/libpath:$TOOLROOT/lib /libpath:$TOOLROOT/PlatformSDK/lib


Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 
80x86

Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

bug.c
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:bug.exe
/libpath:e:/msdev7/vc7/lib
/libpath:e:/msdev7/vc7/PlatformSDK/lib
bug.obj

$ ./bug "before `cat copyright.txt` after"
0: E:\cygwin1.7\tmp\bug.exe
1: before

Notice that for argument 1, we never see the contents of copyright.txt 
and the text after it, "after" is never passed to the win32 native 
application.


This sounds allot like this report to me:



No?




--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Larry Hall (Cygwin)

Edward Lam wrote:

Hi Cygwin 1.7 developers,

I think I've encountered bug in cygwin 1.7.0-48 on WinXP 32-bit. It 
seems that passing a character on the command line (from either ash.exe 
or bash.exe) that is greater than 127 to a native win32 process results 
in arguments being truncated.


Hopefully you can reproduce and fix. Steps to reproduce outlined below.

$ cat bug.c
#include 

int main(int argc, char *argv[])
{
int i;
for (i = 0; i < argc; i++)
printf("%d: %s\n", i, argv[i]);
return 0;
}

$ xxd copyright.txt
000: a9   .

$ $TOOLROOT/bin/cl -I$TOOLROOT/include bug.c /link 
/libpath:$TOOLROOT/lib /libpath:$TOOLROOT/PlatformSDK/lib


Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

bug.c
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:bug.exe
/libpath:e:/msdev7/vc7/lib
/libpath:e:/msdev7/vc7/PlatformSDK/lib
bug.obj

$ ./bug "before `cat copyright.txt` after"
0: E:\cygwin1.7\tmp\bug.exe
1: before

Notice that for argument 1, we never see the contents of copyright.txt 
and the text after it, "after" is never passed to the win32 native 
application.


This sounds allot like this report to me:



No?

--
Larry Hall  http://www.rfk.com
RFK Partners, Inc.  (508) 893-9779 - RFK Office
216 Dalton Rd.  (508) 893-9889 - FAX
Holliston, MA 01746

_

A: Yes.
> Q: Are you sure?
>> A: Because it reverses the logical flow of conversation.
>>> Q: Why is top posting annoying in email?

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



1.7.0-48: [BUG] Passing characters above 128 from bash command line

2009-05-28 Thread Edward Lam

Hi Cygwin 1.7 developers,

I think I've encountered bug in cygwin 1.7.0-48 on WinXP 32-bit. It 
seems that passing a character on the command line (from either ash.exe 
or bash.exe) that is greater than 127 to a native win32 process results 
in arguments being truncated.


Hopefully you can reproduce and fix. Steps to reproduce outlined below.

$ cat bug.c
#include 

int main(int argc, char *argv[])
{
int i;
for (i = 0; i < argc; i++)
printf("%d: %s\n", i, argv[i]);
return 0;
}

$ xxd copyright.txt
000: a9   .

$ $TOOLROOT/bin/cl -I$TOOLROOT/include bug.c /link 
/libpath:$TOOLROOT/lib /libpath:$TOOLROOT/PlatformSDK/lib


Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

bug.c
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:bug.exe
/libpath:e:/msdev7/vc7/lib
/libpath:e:/msdev7/vc7/PlatformSDK/lib
bug.obj

$ ./bug "before `cat copyright.txt` after"
0: E:\cygwin1.7\tmp\bug.exe
1: before

Notice that for argument 1, we never see the contents of copyright.txt 
and the text after it, "after" is never passed to the win32 native 
application.


Thanks,
-Edward

Cygwin Configuration Diagnostics
Current System Time: Thu May 28 10:59:09 2009

Windows XP Professional Ver 5.1 Build 2600 Service Pack 3

Path:   E:\cygwin1.7\usr\local\bin
E:\cygwin1.7\bin
E:\cygwin1.7\bin
E:\cygwin1.7\usr\X11R6\bin
C:\WINDOWS\system32
C:\WINDOWS
C:\WINDOWS\System32\Wbem
F:\apps\Pixar\RenderManProServer-13.5.2\bin
E:\msdev7\Common7\IDE
E:\apps\QuickTime\QTSystem\

Output from e:\cygwin1.7\bin\id.exe (nontsec)
UID: 1008(edward)   GID: 513(None)
544(Administrators) 545(Users)  513(None)

Output from e:\cygwin1.7\bin\id.exe (ntsec)
UID: 1008(edward)   GID: 513(None)
544(Administrators) 545(Users)  513(None)

SysDir: C:\WINDOWS\system32
WinDir: C:\WINDOWS

USER = 'edward'
PWD = '/tmp'
CYGWIN = 'nodosfilewarning'
HOME = '/home/edward'

HOMEPATH = '\home\edward'
AQSISHOME = 'f:\apps\Aqsis\bin'
ALL = '*.[Ch]'
MANPATH = '/usr/local/man:/usr/share/man:/usr/man::/usr/ssl/man'
APPDATA = 'C:\Documents and Settings\edward\Application Data'
RMANTREE = 'F:\apps\Pixar\RenderManProServer-13.5.2'
HOSTNAME = 'crete'
VS71COMNTOOLS = 'E:\msdev7\Common7\Tools\'
INTEL_LICENSE_FILE = 'C:\Program Files\Common Files\Intel\Licenses'
TERM = 'cygwin'
PROCESSOR_IDENTIFIER = 'x86 Family 15 Model 2 Stepping 5, GenuineIntel'
GC = 'e:/apps/GlowCo~1.1'
WINDIR = 'C:\WINDOWS'
OLDPWD = '/usr/bin'
USERDOMAIN = 'CRETE'
OS = 'Windows_NT'
ALLUSERSPROFILE = 'C:\Documents and Settings\All Users'
ANT_OPTS = '-Xmx500M'
SVN_EDITOR = 'gvim -f'
!:: = '::\'
TEMP = '/cygdrive/e/tmp'
COMMONPROGRAMFILES = 'C:\Program Files\Common Files'
QTJAVA = 'e:\apps\Java\jre1.5.0_07\lib\ext\QTJava.zip'
USERNAME = 'edward'
TOOLROOT = 'e:/msdev7/vc7'
PROCESSOR_LEVEL = '15'
FP_NO_HOST_CHECK = 'NO'
SYSTEMDRIVE = 'C:'
JAVA_HOME = 'e:/j2sdk1.4.2_04'
USERPROFILE = 'C:\Documents and Settings\edward'
CLIENTNAME = 'Console'
PS1 = '\[\e]0;\w\a\]\n\[\e[32m\...@\h \[\e[33m\]\w\[\e[0m\]\n\$ '
LOGONSERVER = '\\CRETE'
PROCESSOR_ARCHITECTURE = 'x86'
SPM_HOST = 'beijing.sidefx.com'
SHLVL = '1'
PATHEXT = '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH'
NSPR_LOG_MODULES = 'smtp:5'
NCFTPDIR = 'e:/home/edward/.ncftp'
HOUDINI_TOOLS = 'e:/houdini_tools'
HOMEDRIVE = 'e:'
NSPR_LOG_FILE = 'e:/tmp/nspr_smtp.log'
MI_ROOT = 'C:/Program Files/mental images/mental ray nt-x86 V 3.6.0.31'
HOUDINI_TEXT_CONSOLE = '1'
PROMPT = '$P$G'
COMSPEC = 'C:\WINDOWS\system32\cmd.exe'
TMP = '/cygdrive/e/tmp'
SYSTEMROOT = 'C:\WINDOWS'
PRINTER = 'Xerox Document Centre 430'
CVS_RSH = '/bin/ssh'
PROCESSOR_REVISION = '0205'
JAVA_HOME2 = 'F:/jdk1.5.0_06'
CLASSPATH = 'e:\apps\Java\jre1.5.0_07\lib\ext\QTJava.zip'
DESKTOP = 'c:/DOCUME~1/edward/Desktop'
PathOld = 
'e:\cygwin\bin;E:\apps\Tcl\bin;e:\apps\perl\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;F:\apps\Pixar\RenderManProServer-13.5.2/bin;e:\msdev7\Common7\IDE;e:\apps\QuickTime\QTSystem\;'
!E: = 'E:\cygwin1.7\bin'
INFOPATH = '/usr/local/info:/usr/share/info:/usr/info:'
PROGRAMFILES = 'C:\Program Files'
NUMBER_OF_PROCESSORS = '4'
HOUDINI_INTERNAL_IGNORE_SIGNALS = '1'
SESSIONNAME = 'Console'
COMPUTERNAME = 'CRETE'
HOME_OLD = 'e:\home\edward'
_ = '/usr/bin/cygcheck'

HKEY_CURRENT_USER\Software\Cygnus Solutions\Cygwin
HKEY_CURRENT_USER\Software\Cygnus Solutions\Cygwin\mounts v2
HKEY_CURRENT_USER\Software\Cygnus Solutions\Cygwin\Program Options
HKEY_CURRENT_USER\Software\Cygnus Solutions\CYGWIN.DLL setup
HKEY_CURRENT_USER\Software\Cygnus Solutions\CYGWIN.DLL setup\b15.0
HKEY_CURRENT_USER\Software\Cygnus Solutions\CYGWIN.DLL setup\b15.0\mounts
HKEY_CURRENT_USER\Software\Cygnus Solutions\CYGWIN.DLL setup\b15.0\mounts\00
  (default) = 'C:'
  unix = '/'
  fbinary = 0x
  fsilent = 0x