subject:"\"\\\\\\\[1.7\\\\\\\] Invalid UTF8 while creating a file \\\\\\\- cannot delete\\\\\\\?\""

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-25 Thread Robert Pendell

On Wed, Sep 23, 2009 at 5:30 PM, Ross Smith wrote:
> Corinna Vinschen wrote:
>>
>> However, if we default to UTF-8 for a subset of languages anyway, it
>> gets even more interesting to ask, why not for all languages?  Isn't it
>> better in the long run to have the same default for all Cygwin
>> installations?
>>
>> I'm really wondering if we shouldn't simply default to UTF-8 as charset
>> throughout, in the application, the console, and for the filename
>> conversion.  Yes, not all applications will work OOTB with chars > 0x7f,
>> but it was always a bug to make any assumptions for non-ASCII chars
>> in the C locale.  Applications can be fixed, right?
>
> In support of this plan, it occurs to me that any command line
> applications that don't speak UTF-8 would presumably be showing the
> same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's
> main goals is providing a Linux-like environment on Windows, I don't
> think Cygwin developers should feel obliged to go out of their way to
> do _better_ than Linux in this regard.
>
> -- Ross Smith
>
>

I don't have anything to add on the technical side of things but I
will note that most linux distributions have been defaulting to UTF-8
lately.  I think it would be highly appropriate to default to UTF-8 in
cygwin.

Robert Pendell
shi...@elite-systems.org

"A perfect world is one of chaos."

Thawte Web of Trust Notary
CAcert Assurer

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Ross Smith


Corinna Vinschen wrote:


However, if we default to UTF-8 for a subset of languages anyway, it
gets even more interesting to ask, why not for all languages?  Isn't it
better in the long run to have the same default for all Cygwin
installations?

I'm really wondering if we shouldn't simply default to UTF-8 as charset
throughout, in the application, the console, and for the filename
conversion.  Yes, not all applications will work OOTB with chars > 0x7f,
but it was always a bug to make any assumptions for non-ASCII chars
in the C locale.  Applications can be fixed, right?


In support of this plan, it occurs to me that any command line
applications that don't speak UTF-8 would presumably be showing the
same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's
main goals is providing a Linux-like environment on Windows, I don't
think Cygwin developers should feel obliged to go out of their way to
do _better_ than Linux in this regard.

-- Ross Smith


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Corinna Vinschen

On Sep 23 14:43, Corinna Vinschen wrote:
> On Sep 23 13:34, Andy Koppe wrote:
> > 2009/9/23 Corinna Vinschen:
> > > I have a local patch ready to use the ANSI codepage by default in the
> > > "C" locale.  It appears to work nicely and has the additional positive
> > > side effect to simplify the code in a few places.
> > >
> > > If I only new that eastern language users could happily live with
> > > this change as well!
> > 
> > Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
> > no charset is specified in the locale and the ANSI charset isn't
> > singlebyte.
> > 
> > Based on the following grounds:
> > - Full CJK support (and more) out of the box.
> > - DBCSs can't have worked very well in 1.5 in the first place, because
> > the shell and most applications weren't aware of double-byte
> > characters. Hence backward compatibility is less of an issue here.
> > - Applications that don't (yet) work with UTF-8 are also unlikely to
> > work correctly with DBCSs.
> > - Iwamuro Motonori asked for it.
> 
> Yeah, I was tinkering with this idea, too, but it's much more tricky to
> implement.
> 
> I'll think about it.

Turns out, it's not complicated at all.

However, if we default to UTF-8 for a subset of languages anyway, it
gets even more interesting to ask, why not for all languages?  Isn't it
better in the long run to have the same default for all Cygwin
installations?

I'm really wondering if we shouldn't simply default to UTF-8 as charset
throughout, in the application, the console, and for the filename
conversion.  Yes, not all applications will work OOTB with chars > 0x7f,
but it was always a bug to make any assumptions for non-ASCII chars
in the C locale.  Applications can be fixed, right?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Corinna Vinschen

On Sep 23 13:34, Andy Koppe wrote:
> 2009/9/23 Corinna Vinschen:
> > I have a local patch ready to use the ANSI codepage by default in the
> > "C" locale.  It appears to work nicely and has the additional positive
> > side effect to simplify the code in a few places.
> >
> > If I only new that eastern language users could happily live with
> > this change as well!
> 
> Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
> no charset is specified in the locale and the ANSI charset isn't
> singlebyte.
> 
> Based on the following grounds:
> - Full CJK support (and more) out of the box.
> - DBCSs can't have worked very well in 1.5 in the first place, because
> the shell and most applications weren't aware of double-byte
> characters. Hence backward compatibility is less of an issue here.
> - Applications that don't (yet) work with UTF-8 are also unlikely to
> work correctly with DBCSs.
> - Iwamuro Motonori asked for it.

Yeah, I was tinkering with this idea, too, but it's much more tricky to
implement.

I'll think about it.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Andy Koppe

2009/9/23 Corinna Vinschen:
> I have a local patch ready to use the ANSI codepage by default in the
> "C" locale.  It appears to work nicely and has the additional positive
> side effect to simplify the code in a few places.
>
> If I only new that eastern language users could happily live with
> this change as well!

Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
no charset is specified in the locale and the ANSI charset isn't
singlebyte.

Based on the following grounds:
- Full CJK support (and more) out of the box.
- DBCSs can't have worked very well in 1.5 in the first place, because
the shell and most applications weren't aware of double-byte
characters. Hence backward compatibility is less of an issue here.
- Applications that don't (yet) work with UTF-8 are also unlikely to
work correctly with DBCSs.
- Iwamuro Motonori asked for it.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Corinna Vinschen

On Sep 22 19:07, Corinna Vinschen wrote:
> On Sep 22 17:12, Andy Koppe wrote:
> > True, but that's an implementation issue rather than a design issue,
> > i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
> > than invoke the __utf8 functions. Shall I look into creating a patch?
> [...]
> Hmm... maybe it's not that complicated.  The ^N case checks for a valid
> UTF-8 lead byte right now.  The U+DCxx case could be handled by
> generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a
> non-valid lead byte, like 0xff.

I applied a patch for that.  It wasn't very tricky, but while doing it,
I found a couple of annoyances in the conversion functions related to
the invalid character handling.  So the patch is somewhat bigger than
anticipated.

> Only singlebyte charsets are off the hook.  So, your proposal to switch
> to the default ANSI codepage for the C locale would be good for most
> western languages, but it would still leave the eastern language users
> with double-byte charsets behind.
> 
> Note that I'm not as opposed to your proposal to use the ANSI codepage
> as before this discussion.  But I would like to see that the solution
> works for most eastern language users as well.

I have a local patch ready to use the ANSI codepage by default in the
"C" locale.  It appears to work nicely and has the additional positive
side effect to simplify the code in a few places.

If I only new that eastern language users could happily live with
this change as well!

*** REQUEST FOR HELP ***

Is anybody here set up to build the Cygwin DLL *and* working with an
eastern language Windows, namely using the codepages 932 (SJIS), 936
(GBK), 949 (EUC-KR), or 950 (Big5)?

If so, please build your own Cygwin DLL using the latest from CVS plus
the attached patch, and test if this setting works for you.

The change will result in using your default Windows codepage in the "C"
locale in all components, that is, in the application itself, as well as
in the console and the filename conversion.  In contrast to the current
implementation using UTF-8 for filename conversion by default, there
will be no state anymore in which the application, the console window,
and the filename conversion routine have a different idea of the charset
to use(*).

Thanks in advance,
Corinna

(*) Except when the application switches the console to the "alternate
charset", which usually happens when it's going to print semi-graphical
frame and block characters.

Index: newlib/libc/locale/locale.c
===
RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v
retrieving revision 1.25
diff -u -p -r1.25 locale.c
--- newlib/libc/locale/locale.c 25 Aug 2009 18:47:24 -  1.25
+++ newlib/libc/locale/locale.c 23 Sep 2009 11:53:02 -
@@ -61,6 +61,11 @@ backward compatibility with older implem
 xxx in [437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125,
 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258].

+Instead of <<"C-">>, you can specify also <<"C.">>.  Both variations allow
+to specify language neutral locales while using other charsets than ASCII,
+for instance <<"C.UTF-8">>, which keeps all settings as in the C locale,
+but uses the UTF-8 charset.
+
 Even when using POSIX locale strings, the only charsets allowed are
 <<"UTF-8">>, <<"JIS">>, <<"EUCJP">>, <<"SJIS">>, <>, <>,
 <<"ISO-8859-x">> with 1 <= x <= 15, or <<"CPxxx">> with xxx in
@@ -431,9 +436,19 @@ loadlocale(struct _reent *p, int categor
   if (!strcmp (locale, "POSIX"))
 strcpy (locale, "C");
   if (!strcmp (locale, "C"))   /* Default "C" locale */
+#ifdef __CYGWIN__
+__set_charset_from_codepage (GetACP (), charset);
+#else
 strcpy (charset, "ASCII");
-  else if (locale[0] == 'C' && locale[1] == '-')   /* Old newlib style */
-   strcpy (charset, locale + 2);
+#endif
+  else if (locale[0] == 'C'
+  && (locale[1] == '-' /* Old newlib style */
+  || locale[1] == '.'))/* Extension for the C locale to allow
+  specifying different charsets while
+  sticking to the C locale in terms
+  of sort order, etc.  Proposed in
+  the Debian project. */
+strcpy (charset, locale + 2);
   else /* POSIX style */
 {
   char *c = locale;
Index: newlib/libc/stdlib/sb_charsets.c
===
RCS file: /cvs/src/src/newlib/libc/stdlib/sb_charsets.c,v
retrieving revision 1.3
diff -u -p -r1.3 sb_charsets.c
--- newlib/libc/stdlib/sb_charsets.c25 Aug 2009 18:47:24 -  1.3
+++ newlib/libc/stdlib/sb_charsets.c23 Sep 2009 11:53:02 -
@@ -24,17 +24,17 @@ wchar_t __iso_8859_conv[14][0x60] = {
 0x111, 0x144, 0x148, 0xf3, 0xf4, 0x151, 0xf6, 0xf7,

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-23 Thread Andy Koppe

2009/9/22 Corinna Vinschen:
>> >> Therefore, when converting a UTF-16 Windows filename to the current
>> >> charset, 0xDC?? words should be treated like any other UTF-16 word
>> >> that can't be represented in the current charset: it should be encoded
>> >> as a ^N sequence.

(I started writing this before seeing your patch to the singlebyte
codepage tables, which makes plenty of sense. Here goes anyway.)

Having actually looked at strfuncs.cc, my diagnosis was too
simplistic, because the U+DC?? codes are used not only for invalid
UTF-8 bytes, but for invalid bytes in any charset. This even includes
CP1252, which has a few holes in the 0x80..0x9F range.

Therefore, the complete solution would be something like this: when
sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte
it encodes is indeed an invalid byte in the current charset. If it is,
it translates it into that invalid byte, because on the way back it
would once again be turned into the same 0xDC?? code. If the byte
would represent (part of) a valid character, however, it would need to
be encoded as a ^N sequence to ensure correct roundtripping.

Now that shouldn't be too difficult to implement for singlebyte
charsets, but it gets somewhat hairy for multibyte charsets, including
UTF-8 itself. Here's how I think it could be done though:

In sys_cp_wcstombs:

* On encountering a DC?? code, extract the encoded byte, and feed it
into f_mbtowc. A private mbstate for this is needed, starting in the
initial state for each filename. Switch on the result of f_mbtowc:
** case -2 (incomplete sequence): add the byte to a buffer for this purpose
** case -1 (invalid sequence): copy anything already in the buffer
plus the current byte into the target filename, as we can be sure that
they'll turn back into U-DCbb again on the way back.
** case >0 (valid sequence): encode buffer contents and current byte
as a ^N codes that don't represent valid UTF-8

* When encountering a non-DC?? code, copy any bytes left in the buffer
into the target filename.

Unfortunately the latter point still leaves a loophole, in case the
incomplete sequence from the buffer and the subsequent bytes combine
into something valid. Singlebyte charset aren 't affected though,
because they don't have continuation bytes. Nor is UTF-8, because it
was designed such that continuation bytes are distinct from initial
bytes. Which leaves the DBCS charsets.

However, it rather looks like DBCSs are an intractable problem here in
any case, because of issues like this:

http://support.microsoft.com/kb/170559: "There are some codes that are
not matched one-to-one between Shift-JIS (Japanese character set
supported by MS) and Unicode. When an application calls
MultiByteToWideChar() and WideCharToMultiByte() to perform code
conversion between Shift-JIS and Unicode, the function returns the
wrong code value in some cases."

Which leaves me scratching my head regarding the C locale. More later ...

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-22 Thread Corinna Vinschen

On Sep 22 17:12, Andy Koppe wrote:
> 2009/9/22 Corinna Vinschen:
> >> Therefore, when converting a UTF-16 Windows filename to the current
> >> charset, 0xDC?? words should be treated like any other UTF-16 word
> >> that can't be represented in the current charset: it should be encoded
> >> as a ^N sequence.
> >
> > How?  Just like the incoming multibyte character didn't represent a valid
> > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
> > Therefore, the ^N conversion will fail since U+DCxx can't be converted
> > to valid UTF-8.
> 
> True, but that's an implementation issue rather than a design issue,
> i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
> than invoke the __utf8 functions. Shall I look into creating a patch?

Well, sure I'm interested to see that patch (lazy me), but please note
that we need a snail mailed copyright assignment per
http://cygwin.com/assign.txt from you before we can apply any significant
patches.  Sorry for the hassle.

Hmm... maybe it's not that complicated.  The ^N case checks for a valid
UTF-8 lead byte right now.  The U+DCxx case could be handled by
generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a
non-valid lead byte, like 0xff.

> >> This won't work correctly, because different POSIX filenames will map
> >> to the same Windows filename. For example, the filenames "\xC3\xA4"
> >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
> >> represents a-umlaut in 8859-1), will both map to Windows filename
> >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
> >> called "\xC4", a readdir() would show that file as "\xC3\xA4".
> >
> > Right, but using your above suggestion will also lead to another filename
> > in readdir, it would just be \x0e\xsome\xthing.
> 
> I don't think the suggestion above is directly relevant to the problem
> I tried to highlight here.
> 
> Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4
> on disk, while "\xC4" turns into U+DCC4, and converting back yields
> the original separate filenames.

Well, right now it doesn't exactly.

> If I understand your proposal
> correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence
> converting back would yield "\xC3\xA4" for both. This is wrong. Those
> filenames shouldn't be clobbering each other, and a filename shouldn't
> change between open() and readdir(), certainly not without switching
> charset inbetween.

I see your point.  I was more thinking along the lines of how likely
that clobbering is, apart from pathological testcases.

> Having said that, if you did switch charset from UTF-8 e.g. to
> ISO-8859-1, the on-disk U+DCC4 would indeed turn into
> "\x0E\xsome\xthing". However, that issue applies to any UTF-16

You don't have to switch the charset.  Assume you're using any
non-singlebyte charset in which \xC4 is the start of a double- or
multibyte sequence.  open ("\xC4"); close; readdir(); will return
"\x0E\xsome\xthing" on readdir.

Only singlebyte charsets are off the hook.  So, your proposal to switch
to the default ANSI codepage for the C locale would be good for most
western languages, but it would still leave the eastern language users
with double-byte charsets behind.

Note that I'm not as opposed to your proposal to use the ANSI codepage
as before this discussion.  But I would like to see that the solution
works for most eastern language users as well.

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-22 Thread Andy Koppe

2009/9/22 Corinna Vinschen:
>> > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
>> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
>> > The problem now is that readdir() will return the transposed characters
>> > as if they are the original characters.
>>
>> Yep, that's where the bug is. Those 0xDC?? words represent invalid
>> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
>>
>> Therefore, when converting a UTF-16 Windows filename to the current
>> charset, 0xDC?? words should be treated like any other UTF-16 word
>> that can't be represented in the current charset: it should be encoded
>> as a ^N sequence.
>
> How?  Just like the incoming multibyte character didn't represent a valid
> UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
> Therefore, the ^N conversion will fail since U+DCxx can't be converted
> to valid UTF-8.

True, but that's an implementation issue rather than a design issue,
i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
than invoke the __utf8 functions. Shall I look into creating a patch?


>> > So it looks like the current mechanism to handle invalid multibyte
>> > sequences is too complicated for us.  As far as I can see, it would be
>> > much simpler and less error prone to translate the invalid bytes simply
>> > to the equivalent UTF-16 value.  That creates filenames with UTF-16
>> > values from the ISO-8859-1 range.
>>
>> This won't work correctly, because different POSIX filenames will map
>> to the same Windows filename. For example, the filenames "\xC3\xA4"
>> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
>> represents a-umlaut in 8859-1), will both map to Windows filename
>> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
>> called "\xC4", a readdir() would show that file as "\xC3\xA4".
>
> Right, but using your above suggestion will also lead to another filename
> in readdir, it would just be \x0e\xsome\xthing.

I don't think the suggestion above is directly relevant to the problem
I tried to highlight here.

Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4
on disk, while "\xC4" turns into U+DCC4, and converting back yields
the original separate filenames. If I understand your proposal
correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence
converting back would yield "\xC3\xA4" for both. This is wrong. Those
filenames shouldn't be clobbering each other, and a filename shouldn't
change between open() and readdir(), certainly not without switching
charset inbetween.

Having said that, if you did switch charset from UTF-8 e.g. to
ISO-8859-1, the on-disk U+DCC4 would indeed turn into
"\x0E\xsome\xthing". However, that issue applies to any UTF-16
character not in the target charset, not just those funny U+DC?? codes
for representing invalid UTF-8 bytes.

The only way to avoid the POSIX filenames changing depending on locale
would be to assume UTF-8 for filenames no matter the locale charset.
That's an entirely different can of worms though, extending the
compatibility problems discussed on the "The C locale" thread to all
non-UTF-8 locales, and putting the onus for converting filenames on
applications.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-22 Thread Corinna Vinschen

On Sep 21 19:54, Andy Koppe wrote:
> 2009/9/21 Corinna Vinschen:
> > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
> > The problem now is that readdir() will return the transposed characters
> > as if they are the original characters.
> 
> Yep, that's where the bug is. Those 0xDC?? words represent invalid
> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
> 
> Therefore, when converting a UTF-16 Windows filename to the current
> charset, 0xDC?? words should be treated like any other UTF-16 word
> that can't be represented in the current charset: it should be encoded
> as a ^N sequence.

How?  Just like the incoming multibyte character didn't represent a valid
UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
Therefore, the ^N conversion will fail since U+DCxx can't be converted
to valid UTF-8.

> > So it looks like the current mechanism to handle invalid multibyte
> > sequences is too complicated for us.  As far as I can see, it would be
> > much simpler and less error prone to translate the invalid bytes simply
> > to the equivalent UTF-16 value.  That creates filenames with UTF-16
> > values from the ISO-8859-1 range.
> 
> This won't work correctly, because different POSIX filenames will map
> to the same Windows filename. For example, the filenames "\xC3\xA4"
> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
> represents a-umlaut in 8859-1), will both map to Windows filename
> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
> called "\xC4", a readdir() would show that file as "\xC3\xA4".

Right, but using your above suggestion will also lead to another filename
in readdir, it would just be \x0e\xsome\xthing.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-21 Thread Andy Koppe

2009/9/21 Corinna Vinschen:
>> % cat t.c
>> int main() {
>>     fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
>>     fopen("b-\xF6\xE4\xFC\xDFz", "w");
>>     fopen("c-\xF6\xE4\xFC\xDFzz", "w");
>>     fopen("d-\xF6\xE4\xFC\xDFzzz", "w");
>>     fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w");
>>     return 0;
>> }
>
> Ok, I see what happens.  The problem is that the mechanism which is
> supposed to handle invalid multibyte sequences handles the first such
> byte, but misses to reset the multibyte shift state after the byte has
> been handled.  Basically, resetting the shift state after such a
> sequence has been encountered fixes that problem.

Great!


> Unfortunately this is only the first half of a solution.  This is what
> `ls' prints after running t:
>
>  $ ls -l --show-control-chars
>  total 21
>  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 a-öäüß
>  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 c-öäüßzz
>  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 d-öäüßzzz
>  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 e-öäüßöäüß
>
> But this is what ls prints when setting $LANG to something "non-C":
>
>  $ setenv LANG en      (implies codepage 1252)
>  $ ls -l --show-control-chars
>  ls: cannot access a-öäüß: No such file or directory
>  ls: cannot access c-öäüßzz: No such file or directory
>  ls: cannot access d-öäüßzzz: No such file or directory
>  ls: cannot access e-öäüßöäüß: No such file or directory
>  total 21
>  -? ? ?       ?            ?                ? a-öäüß
>  -? ? ?       ?            ?                ? c-öäüßzz
>  -? ? ?       ?            ?                ? d-öäüßzzz
>  -? ? ?       ?            ?                ? e-öäüßöäüß

Btw, the same thing will happen with en.C-ISO-8859-1 or C.ASCII too.


> As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
> transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
> The problem now is that readdir() will return the transposed characters
> as if they are the original characters.

Yep, that's where the bug is. Those 0xDC?? words represent invalid
UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.

Therefore, when converting a UTF-16 Windows filename to the current
charset, 0xDC?? words should be treated like any other UTF-16 word
that can't be represented in the current charset: it should be encoded
as a ^N sequence.


> ls uses some mbtowc function
> to create a valid widechar string, and then uses the resulting widechar
> string in some wctomb function to call stat().

It's not 'ls' that does that conversion. On the POSIX side, filenames
are simply sequences of bytes, hence 'ls' would be very wrong to do
any conversion between readdir() and stat().

No, it's stat() itself converting the CP1252 sequence "a-öäüß" to
UTF-16, which yields L"a-öäüß". This does not contain the 0xDC??
codepoints that the actual filename contained, hence stat() fails.


> So it looks like the current mechanism to handle invalid multibyte
> sequences is too complicated for us.  As far as I can see, it would be
> much simpler and less error prone to translate the invalid bytes simply
> to the equivalent UTF-16 value.  That creates filenames with UTF-16
> values from the ISO-8859-1 range.

This won't work correctly, because different POSIX filenames will map
to the same Windows filename. For example, the filenames "\xC3\xA4"
(valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
represents a-umlaut in 8859-1), will both map to Windows filename
"U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
called "\xC4", a readdir() would show that file as "\xC3\xA4".

Note also that invalid UTF-8 sequences would be much less of an issue
if the C locale didn't mix UTF-8 filenames with a ISO-8859-1 console.
They'd still occur e.g. when unpacking a tarball with ISO-encoded
filenames while a UTF-8 locale is active. However, that sort of
situation is not handled well on Linux either.

Regards,
Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-21 Thread Corinna Vinschen

On Sep 16 00:38, Lapo Luchini wrote:
> Andy Koppe wrote:
> > Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
> > filename got translated to UTF-16 in fopen(), which would explain what
> > you're seeing
> 
> More data: it's not simply "the last character", is something more
> complex than that.
> 
> % cat t.c
> int main() {
> fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
> fopen("b-\xF6\xE4\xFC\xDFz", "w");
> fopen("c-\xF6\xE4\xFC\xDFzz", "w");
> fopen("d-\xF6\xE4\xFC\xDFzzz", "w");
> fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w");
> return 0;
> }

Ok, I see what happens.  The problem is that the mechanism which is
supposed to handle invalid multibyte sequences handles the first such
byte, but misses to reset the multibyte shift state after the byte has
been handled.  Basically, resetting the shift state after such a
sequence has been encountered fixes that problem.

Unfortunately this is only the first half of a solution.  This is what
`ls' prints after running t:

  $ ls -l --show-control-chars
  total 21
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 a-Ã¶Ã¤Ã¼Ã
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 c-Ã¶Ã¤Ã¼Ãzz
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 d-Ã¶Ã¤Ã¼Ãzzz
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 e-Ã¶Ã¤Ã¼ÃÃ¶Ã¤Ã¼Ã

But this is what ls prints when setting $LANG to something "non-C":

  $ setenv LANG en  (implies codepage 1252)
  $ ls -l --show-control-chars
  ls: cannot access a-Ã¶Ã¤Ã¼Ã: No such file or directory
  ls: cannot access c-Ã¶Ã¤Ã¼Ãzz: No such file or directory
  ls: cannot access d-Ã¶Ã¤Ã¼Ãzzz: No such file or directory
  ls: cannot access e-Ã¶Ã¤Ã¼ÃÃ¶Ã¤Ã¼Ã: No such file or directory
  total 21
  -? ? ?   ??? a-Ã¶Ã¤Ã¼Ã
  -? ? ?   ??? c-Ã¶Ã¤Ã¼Ãzz
  -? ? ?   ??? d-Ã¶Ã¤Ã¼Ãzzz
  -? ? ?   ??? e-Ã¶Ã¤Ã¼ÃÃ¶Ã¤Ã¼Ã

As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
The problem now is that readdir() will return the transposed characters
as if they are the original characters.  ls uses some mbtowc function
to create a valid widechar string, and then uses the resulting widechar
string in some wctomb function to call stat().  However, *that* string
will use a valid mutlibyte sequence to represent the character and the
resulting filename is suddenly different from the actual filename on
disk and stat returns with errno set to ENOENT.
Since the conversion fro and to is independent of each other, there's
no way to detect whether the incoming string of a wctomb was originally
based on a transposed character or not.

I'm not sure if I could explain this clear enough...

So it looks like the current mechanism to handle invalid multibyte
sequences is too complicated for us.  As far as I can see, it would be
much simpler and less error prone to translate the invalid bytes simply
to the equivalent UTF-16 value.  That creates filenames with UTF-16
values from the ISO-8859-1 range.  I tested this with the files created
by the above testcase.  While the filenames appeared to be different
dependent on the used charset, ls always handled the files gracefully.

Any objections?  I can also just check it in and the entire locale
challenged part of the community can test it...

Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-15 Thread Lapo Luchini

Andy Koppe wrote:
> Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
> filename got translated to UTF-16 in fopen(), which would explain what
> you're seeing

More data: it's not simply "the last character", is something more
complex than that.

% cat t.c
int main() {
fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
fopen("b-\xF6\xE4\xFC\xDFz", "w");
fopen("c-\xF6\xE4\xFC\xDFzz", "w");
fopen("d-\xF6\xE4\xFC\xDFzzz", "w");
fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w");
return 0;
}
% gcc -o t t.c
% ./t
% find .
.
./a-???
./b-???
./c-???
./d-???
./e-???
./t.c
./t.exe

It seems that once one "high bit set" byte is encountered, everything
past the last of them (itself included) is lost.

Also, I can confirm this works too:
% rm a-$'\366'$'\344'$'\374'$'\337'
but also this, since last one doesn't count:
% rm a-$'\366'$'\344'$'\374'$'\336'
BTW: I didn't know about that kind of escaping, but zsh auto-completed
that for me (excluding the last character, of course)

-- 
Lapo Luchini - http://lapo.it/


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-10 Thread Andy Koppe

2009/9/10 Lapo Luchini:
> But the real problem with that test is not really what shows and how,
> the biggest problem is that it seems that filenames created with a
> "wrong" filename are quite limited in usage and can't seemingly be deleted.
>
> % export LANG=en_EN.UTF-8
> % cat t.c
> #include 
> int main() {
>    fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
>    fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8
>    return 0;
> }
> % gcc -o t t.c
> % mkdir test ; cd test ; ../t ; cd ..
> % ls -l test
> ls: cannot access test/a-▒▒▒: No such file or directory
> total 0
> -? ? ?    ?    ?                ? a-▒▒▒
> -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß
> % find test
> test
> test/a-???
> test/b-öäüß
> % find test -delete
> find: cannot delete `test/a-\366\344\374': No such file or directory

Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
filename got translated to UTF-16 in fopen(), which would explain what
you're seeing:

'find' reads the filename correctly, invokes remove() on it, which
translates it to UTF-16 again, whereby we lose a second byte, so we're
down to a-\366\344, which can't be deleted because it doesn't exist.

>    remove("a-\xF6\xE4\xFC\xDF");

Now here we start with the full name again, so if we lose the last
byte we get what's actually on disk, hence the call succeeds.

Bytes that don't contribute to valid UTF-8 characters get mapped to a
certain subrange of UTF-16 low surrogates at 0xDC80, which is a clever
trick for encoding such bytes into UTF-16 and get them out again after
decoding.

I stared at the code for this in sys_cp_mbstowcs for a bit, but
haven't spotted where those missing byte might have gone.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

[1.7] Invalid UTF8 while creating a file -> cannot delete?

2009-09-10 Thread Lapo Luchini

After a few problems with monotone's unit tests on Cygwin-1.7, I began
searching and experimenting a bit with new 1.7 support for wide chars.

I also read the full thread about its last change:
http://www.cygwin.com/ml/cygwin/2009-05/msg00344.html
which really makes some sense to me (when I create a file from the
console I want "ls" to show back that file to me with same encoding).

Problem is, that unit test assumes filenames are "raw data" and tries to
create three types of filenames: ISO-8859-1, EUC-JP and UTF-8.
Except on OSX where it only tries UTF-8 as that's the disk format.

Now we have an UTF-16 disk format, except the library is using
LANG-value-from-process-start to initialize some LANG-to-UTF16
conversion as far as I understoof so there's not really one "correct"
format: it depends on the LANG env value when the test unit is launched.

OK, that's a side issue since I can probably modify the tests to always
be launched with LANG=C instead of using the current value so that at
least it is consitent. And then maybe remove the creation of ISO-8859-1
and EUC-JP tests just like on OSX. Which could be correct... but a bit
less so than on OSX itself, when that is really "the format" and not the
"the DEFAULT format which could be overridden with a correct setlocale".

But the real problem with that test is not really what shows and how,
the biggest problem is that it seems that filenames created with a
"wrong" filename are quite limited in usage and can't seemingly be deleted.

% export LANG=en_EN.UTF-8
% cat t.c
#include 
int main() {
fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8
return 0;
}
% gcc -o t t.c
% mkdir test ; cd test ; ../t ; cd ..
% ls -l test
ls: cannot access test/a-▒▒▒: No such file or directory
total 0
-? ? ???? a-▒▒▒
-rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß
% find test
test
test/a-???
test/b-öäüß
% find test -delete
find: cannot delete `test/a-\366\344\374': No such file or directory
find: cannot delete `test': Directory not empty
% find test
test
test/a-???

Now... I don't know how exactly `find` works but it seems strange to me
it isn't capable of deleting something it is capable of listing.
Also seems strange `ls` is not capable of stat-ing something it's
capable of listing.

Yep, I do know that filename is "broken" in the first place, but since
in the Unix world such stuff can happen as filenames are really raw
data, I think probably an error on file creation would be better than
creating a file that can't be consequently stat-ed or even unlinked.

% cat u.c
#include 
int main() {
remove("a-\xF6\xE4\xFC\xDF");
remove("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F");
return 0;
}
% gcc -o u u.c

OK, a program using a similarly-broken filename can delete it, but the
fact it can't be deleted with "normal" tools is a bit of an inconvenience...

-- 
Lapo Luchini - http://lapo.it/

“Premature optimisation is the root of all evil in programming.” (C. A.
R. Hoare)


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?

[1.7] Invalid UTF8 while creating a file -> cannot delete?

15 matches

Site Navigation

Mail list logo

Footer information