Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-25 Thread Robert Pendell
On Wed, Sep 23, 2009 at 5:30 PM, Ross Smith wrote:
 Corinna Vinschen wrote:

 However, if we default to UTF-8 for a subset of languages anyway, it
 gets even more interesting to ask, why not for all languages?  Isn't it
 better in the long run to have the same default for all Cygwin
 installations?

 I'm really wondering if we shouldn't simply default to UTF-8 as charset
 throughout, in the application, the console, and for the filename
 conversion.  Yes, not all applications will work OOTB with chars  0x7f,
 but it was always a bug to make any assumptions for non-ASCII chars
 in the C locale.  Applications can be fixed, right?

 In support of this plan, it occurs to me that any command line
 applications that don't speak UTF-8 would presumably be showing the
 same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's
 main goals is providing a Linux-like environment on Windows, I don't
 think Cygwin developers should feel obliged to go out of their way to
 do _better_ than Linux in this regard.

 -- Ross Smith



I don't have anything to add on the technical side of things but I
will note that most linux distributions have been defaulting to UTF-8
lately.  I think it would be highly appropriate to default to UTF-8 in
cygwin.

Robert Pendell
shi...@elite-systems.org

A perfect world is one of chaos.

Thawte Web of Trust Notary
CAcert Assurer

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Andy Koppe
2009/9/22 Corinna Vinschen:
  Therefore, when converting a UTF-16 Windows filename to the current
  charset, 0xDC?? words should be treated like any other UTF-16 word
  that can't be represented in the current charset: it should be encoded
  as a ^N sequence.

(I started writing this before seeing your patch to the singlebyte
codepage tables, which makes plenty of sense. Here goes anyway.)

Having actually looked at strfuncs.cc, my diagnosis was too
simplistic, because the U+DC?? codes are used not only for invalid
UTF-8 bytes, but for invalid bytes in any charset. This even includes
CP1252, which has a few holes in the 0x80..0x9F range.

Therefore, the complete solution would be something like this: when
sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte
it encodes is indeed an invalid byte in the current charset. If it is,
it translates it into that invalid byte, because on the way back it
would once again be turned into the same 0xDC?? code. If the byte
would represent (part of) a valid character, however, it would need to
be encoded as a ^N sequence to ensure correct roundtripping.

Now that shouldn't be too difficult to implement for singlebyte
charsets, but it gets somewhat hairy for multibyte charsets, including
UTF-8 itself. Here's how I think it could be done though:

In sys_cp_wcstombs:

* On encountering a DC?? code, extract the encoded byte, and feed it
into f_mbtowc. A private mbstate for this is needed, starting in the
initial state for each filename. Switch on the result of f_mbtowc:
** case -2 (incomplete sequence): add the byte to a buffer for this purpose
** case -1 (invalid sequence): copy anything already in the buffer
plus the current byte into the target filename, as we can be sure that
they'll turn back into U-DCbb again on the way back.
** case 0 (valid sequence): encode buffer contents and current byte
as a ^N codes that don't represent valid UTF-8

* When encountering a non-DC?? code, copy any bytes left in the buffer
into the target filename.

Unfortunately the latter point still leaves a loophole, in case the
incomplete sequence from the buffer and the subsequent bytes combine
into something valid. Singlebyte charset aren 't affected though,
because they don't have continuation bytes. Nor is UTF-8, because it
was designed such that continuation bytes are distinct from initial
bytes. Which leaves the DBCS charsets.

However, it rather looks like DBCSs are an intractable problem here in
any case, because of issues like this:

http://support.microsoft.com/kb/170559: There are some codes that are
not matched one-to-one between Shift-JIS (Japanese character set
supported by MS) and Unicode. When an application calls
MultiByteToWideChar() and WideCharToMultiByte() to perform code
conversion between Shift-JIS and Unicode, the function returns the
wrong code value in some cases.

Which leaves me scratching my head regarding the C locale. More later ...

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Corinna Vinschen
On Sep 22 19:07, Corinna Vinschen wrote:
 On Sep 22 17:12, Andy Koppe wrote:
  True, but that's an implementation issue rather than a design issue,
  i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
  than invoke the __utf8 functions. Shall I look into creating a patch?
 [...]
 Hmm... maybe it's not that complicated.  The ^N case checks for a valid
 UTF-8 lead byte right now.  The U+DCxx case could be handled by
 generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a
 non-valid lead byte, like 0xff.

I applied a patch for that.  It wasn't very tricky, but while doing it,
I found a couple of annoyances in the conversion functions related to
the invalid character handling.  So the patch is somewhat bigger than
anticipated.

 Only singlebyte charsets are off the hook.  So, your proposal to switch
 to the default ANSI codepage for the C locale would be good for most
 western languages, but it would still leave the eastern language users
 with double-byte charsets behind.
 
 Note that I'm not as opposed to your proposal to use the ANSI codepage
 as before this discussion.  But I would like to see that the solution
 works for most eastern language users as well.

I have a local patch ready to use the ANSI codepage by default in the
C locale.  It appears to work nicely and has the additional positive
side effect to simplify the code in a few places.

If I only new that eastern language users could happily live with
this change as well!

*** REQUEST FOR HELP ***

Is anybody here set up to build the Cygwin DLL *and* working with an
eastern language Windows, namely using the codepages 932 (SJIS), 936
(GBK), 949 (EUC-KR), or 950 (Big5)?

If so, please build your own Cygwin DLL using the latest from CVS plus
the attached patch, and test if this setting works for you.

The change will result in using your default Windows codepage in the C
locale in all components, that is, in the application itself, as well as
in the console and the filename conversion.  In contrast to the current
implementation using UTF-8 for filename conversion by default, there
will be no state anymore in which the application, the console window,
and the filename conversion routine have a different idea of the charset
to use(*).


Thanks in advance,
Corinna


(*) Except when the application switches the console to the alternate
charset, which usually happens when it's going to print semi-graphical
frame and block characters.


Index: newlib/libc/locale/locale.c
===
RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v
retrieving revision 1.25
diff -u -p -r1.25 locale.c
--- newlib/libc/locale/locale.c 25 Aug 2009 18:47:24 -  1.25
+++ newlib/libc/locale/locale.c 23 Sep 2009 11:53:02 -
@@ -61,6 +61,11 @@ backward compatibility with older implem
 xxx in [437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125,
 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258].
 
+Instead of C-, you can specify also C..  Both variations allow
+to specify language neutral locales while using other charsets than ASCII,
+for instance C.UTF-8, which keeps all settings as in the C locale,
+but uses the UTF-8 charset.
+
 Even when using POSIX locale strings, the only charsets allowed are
 UTF-8, JIS, EUCJP, SJIS, KOI8-R, KOI8-U,
 ISO-8859-x with 1 = x = 15, or CPxxx with xxx in
@@ -431,9 +436,19 @@ loadlocale(struct _reent *p, int categor
   if (!strcmp (locale, POSIX))
 strcpy (locale, C);
   if (!strcmp (locale, C))   /* Default C locale */
+#ifdef __CYGWIN__
+__set_charset_from_codepage (GetACP (), charset);
+#else
 strcpy (charset, ASCII);
-  else if (locale[0] == 'C'  locale[1] == '-')   /* Old newlib style */
-   strcpy (charset, locale + 2);
+#endif
+  else if (locale[0] == 'C'
+   (locale[1] == '-' /* Old newlib style */
+  || locale[1] == '.'))/* Extension for the C locale to allow
+  specifying different charsets while
+  sticking to the C locale in terms
+  of sort order, etc.  Proposed in
+  the Debian project. */
+strcpy (charset, locale + 2);
   else /* POSIX style */
 {
   char *c = locale;
Index: newlib/libc/stdlib/sb_charsets.c
===
RCS file: /cvs/src/src/newlib/libc/stdlib/sb_charsets.c,v
retrieving revision 1.3
diff -u -p -r1.3 sb_charsets.c
--- newlib/libc/stdlib/sb_charsets.c25 Aug 2009 18:47:24 -  1.3
+++ newlib/libc/stdlib/sb_charsets.c23 Sep 2009 11:53:02 -
@@ -24,17 +24,17 @@ wchar_t __iso_8859_conv[14][0x60] = {
 0x111, 0x144, 0x148, 0xf3, 0xf4, 0x151, 0xf6, 0xf7,
 0x159, 0x16f, 0xfa, 0x171, 0xfc, 0xfd, 0x163, 0x2d9 },
   /* ISO-8859-3 */
-  { 

Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Andy Koppe
2009/9/23 Corinna Vinschen:
 I have a local patch ready to use the ANSI codepage by default in the
 C locale.  It appears to work nicely and has the additional positive
 side effect to simplify the code in a few places.

 If I only new that eastern language users could happily live with
 this change as well!

Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
no charset is specified in the locale and the ANSI charset isn't
singlebyte.

Based on the following grounds:
- Full CJK support (and more) out of the box.
- DBCSs can't have worked very well in 1.5 in the first place, because
the shell and most applications weren't aware of double-byte
characters. Hence backward compatibility is less of an issue here.
- Applications that don't (yet) work with UTF-8 are also unlikely to
work correctly with DBCSs.
- Iwamuro Motonori asked for it.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Corinna Vinschen
On Sep 23 13:34, Andy Koppe wrote:
 2009/9/23 Corinna Vinschen:
  I have a local patch ready to use the ANSI codepage by default in the
  C locale.  It appears to work nicely and has the additional positive
  side effect to simplify the code in a few places.
 
  If I only new that eastern language users could happily live with
  this change as well!
 
 Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
 no charset is specified in the locale and the ANSI charset isn't
 singlebyte.
 
 Based on the following grounds:
 - Full CJK support (and more) out of the box.
 - DBCSs can't have worked very well in 1.5 in the first place, because
 the shell and most applications weren't aware of double-byte
 characters. Hence backward compatibility is less of an issue here.
 - Applications that don't (yet) work with UTF-8 are also unlikely to
 work correctly with DBCSs.
 - Iwamuro Motonori asked for it.

Yeah, I was tinkering with this idea, too, but it's much more tricky to
implement.

I'll think about it.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Corinna Vinschen
On Sep 23 14:43, Corinna Vinschen wrote:
 On Sep 23 13:34, Andy Koppe wrote:
  2009/9/23 Corinna Vinschen:
   I have a local patch ready to use the ANSI codepage by default in the
   C locale.  It appears to work nicely and has the additional positive
   side effect to simplify the code in a few places.
  
   If I only new that eastern language users could happily live with
   this change as well!
  
  Here's an idea to circumvent the DBCS troubles: default to UTF-8 when
  no charset is specified in the locale and the ANSI charset isn't
  singlebyte.
  
  Based on the following grounds:
  - Full CJK support (and more) out of the box.
  - DBCSs can't have worked very well in 1.5 in the first place, because
  the shell and most applications weren't aware of double-byte
  characters. Hence backward compatibility is less of an issue here.
  - Applications that don't (yet) work with UTF-8 are also unlikely to
  work correctly with DBCSs.
  - Iwamuro Motonori asked for it.
 
 Yeah, I was tinkering with this idea, too, but it's much more tricky to
 implement.
 
 I'll think about it.

Turns out, it's not complicated at all.

However, if we default to UTF-8 for a subset of languages anyway, it
gets even more interesting to ask, why not for all languages?  Isn't it
better in the long run to have the same default for all Cygwin
installations?

I'm really wondering if we shouldn't simply default to UTF-8 as charset
throughout, in the application, the console, and for the filename
conversion.  Yes, not all applications will work OOTB with chars  0x7f,
but it was always a bug to make any assumptions for non-ASCII chars
in the C locale.  Applications can be fixed, right?


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-23 Thread Ross Smith

Corinna Vinschen wrote:


However, if we default to UTF-8 for a subset of languages anyway, it
gets even more interesting to ask, why not for all languages?  Isn't it
better in the long run to have the same default for all Cygwin
installations?

I'm really wondering if we shouldn't simply default to UTF-8 as charset
throughout, in the application, the console, and for the filename
conversion.  Yes, not all applications will work OOTB with chars  0x7f,
but it was always a bug to make any assumptions for non-ASCII chars
in the C locale.  Applications can be fixed, right?


In support of this plan, it occurs to me that any command line
applications that don't speak UTF-8 would presumably be showing the
same behaviour on Linux (e.g. odd column widths). Since one of Cygwin's
main goals is providing a Linux-like environment on Windows, I don't
think Cygwin developers should feel obliged to go out of their way to
do _better_ than Linux in this regard.

-- Ross Smith


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-22 Thread Corinna Vinschen
On Sep 21 19:54, Andy Koppe wrote:
 2009/9/21 Corinna Vinschen:
  As you might know, invalid bytes = 0x80 are translated to UTF-16 by
  transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
  The problem now is that readdir() will return the transposed characters
  as if they are the original characters.
 
 Yep, that's where the bug is. Those 0xDC?? words represent invalid
 UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
 
 Therefore, when converting a UTF-16 Windows filename to the current
 charset, 0xDC?? words should be treated like any other UTF-16 word
 that can't be represented in the current charset: it should be encoded
 as a ^N sequence.

How?  Just like the incoming multibyte character didn't represent a valid
UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
Therefore, the ^N conversion will fail since U+DCxx can't be converted
to valid UTF-8.

  So it looks like the current mechanism to handle invalid multibyte
  sequences is too complicated for us.  As far as I can see, it would be
  much simpler and less error prone to translate the invalid bytes simply
  to the equivalent UTF-16 value.  That creates filenames with UTF-16
  values from the ISO-8859-1 range.
 
 This won't work correctly, because different POSIX filenames will map
 to the same Windows filename. For example, the filenames \xC3\xA4
 (valid UTF-8 for a-umlaut) and \xC4 (invalid UTF-8 sequence that
 represents a-umlaut in 8859-1), will both map to Windows filename
 U+00C4, i.e a-umlaut in UTF-16. Furthermore, after creating a file
 called \xC4, a readdir() would show that file as \xC3\xA4.

Right, but using your above suggestion will also lead to another filename
in readdir, it would just be \x0e\xsome\xthing.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-22 Thread Andy Koppe
2009/9/22 Corinna Vinschen:
  As you might know, invalid bytes = 0x80 are translated to UTF-16 by
  transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
  The problem now is that readdir() will return the transposed characters
  as if they are the original characters.

 Yep, that's where the bug is. Those 0xDC?? words represent invalid
 UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.

 Therefore, when converting a UTF-16 Windows filename to the current
 charset, 0xDC?? words should be treated like any other UTF-16 word
 that can't be represented in the current charset: it should be encoded
 as a ^N sequence.

 How?  Just like the incoming multibyte character didn't represent a valid
 UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
 Therefore, the ^N conversion will fail since U+DCxx can't be converted
 to valid UTF-8.

True, but that's an implementation issue rather than a design issue,
i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
than invoke the __utf8 functions. Shall I look into creating a patch?


  So it looks like the current mechanism to handle invalid multibyte
  sequences is too complicated for us.  As far as I can see, it would be
  much simpler and less error prone to translate the invalid bytes simply
  to the equivalent UTF-16 value.  That creates filenames with UTF-16
  values from the ISO-8859-1 range.

 This won't work correctly, because different POSIX filenames will map
 to the same Windows filename. For example, the filenames \xC3\xA4
 (valid UTF-8 for a-umlaut) and \xC4 (invalid UTF-8 sequence that
 represents a-umlaut in 8859-1), will both map to Windows filename
 U+00C4, i.e a-umlaut in UTF-16. Furthermore, after creating a file
 called \xC4, a readdir() would show that file as \xC3\xA4.

 Right, but using your above suggestion will also lead to another filename
 in readdir, it would just be \x0e\xsome\xthing.

I don't think the suggestion above is directly relevant to the problem
I tried to highlight here.

Currently, with UTF-8 filename encodings, \xC3xA4 turns into U+00C4
on disk, while \xC4 turns into U+DCC4, and converting back yields
the original separate filenames. If I understand your proposal
correctly, both \xC3\xA4 and \xC4 would turn into U+00C4, hence
converting back would yield \xC3\xA4 for both. This is wrong. Those
filenames shouldn't be clobbering each other, and a filename shouldn't
change between open() and readdir(), certainly not without switching
charset inbetween.

Having said that, if you did switch charset from UTF-8 e.g. to
ISO-8859-1, the on-disk U+DCC4 would indeed turn into
\x0E\xsome\xthing. However, that issue applies to any UTF-16
character not in the target charset, not just those funny U+DC?? codes
for representing invalid UTF-8 bytes.

The only way to avoid the POSIX filenames changing depending on locale
would be to assume UTF-8 for filenames no matter the locale charset.
That's an entirely different can of worms though, extending the
compatibility problems discussed on the The C locale thread to all
non-UTF-8 locales, and putting the onus for converting filenames on
applications.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-22 Thread Corinna Vinschen
On Sep 22 17:12, Andy Koppe wrote:
 2009/9/22 Corinna Vinschen:
  Therefore, when converting a UTF-16 Windows filename to the current
  charset, 0xDC?? words should be treated like any other UTF-16 word
  that can't be represented in the current charset: it should be encoded
  as a ^N sequence.
 
  How?  Just like the incoming multibyte character didn't represent a valid
  UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
  Therefore, the ^N conversion will fail since U+DCxx can't be converted
  to valid UTF-8.
 
 True, but that's an implementation issue rather than a design issue,
 i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
 than invoke the __utf8 functions. Shall I look into creating a patch?

Well, sure I'm interested to see that patch (lazy me), but please note
that we need a snail mailed copyright assignment per
http://cygwin.com/assign.txt from you before we can apply any significant
patches.  Sorry for the hassle.

Hmm... maybe it's not that complicated.  The ^N case checks for a valid
UTF-8 lead byte right now.  The U+DCxx case could be handled by
generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a
non-valid lead byte, like 0xff.

  This won't work correctly, because different POSIX filenames will map
  to the same Windows filename. For example, the filenames \xC3\xA4
  (valid UTF-8 for a-umlaut) and \xC4 (invalid UTF-8 sequence that
  represents a-umlaut in 8859-1), will both map to Windows filename
  U+00C4, i.e a-umlaut in UTF-16. Furthermore, after creating a file
  called \xC4, a readdir() would show that file as \xC3\xA4.
 
  Right, but using your above suggestion will also lead to another filename
  in readdir, it would just be \x0e\xsome\xthing.
 
 I don't think the suggestion above is directly relevant to the problem
 I tried to highlight here.
 
 Currently, with UTF-8 filename encodings, \xC3xA4 turns into U+00C4
 on disk, while \xC4 turns into U+DCC4, and converting back yields
 the original separate filenames.

Well, right now it doesn't exactly.

 If I understand your proposal
 correctly, both \xC3\xA4 and \xC4 would turn into U+00C4, hence
 converting back would yield \xC3\xA4 for both. This is wrong. Those
 filenames shouldn't be clobbering each other, and a filename shouldn't
 change between open() and readdir(), certainly not without switching
 charset inbetween.

I see your point.  I was more thinking along the lines of how likely
that clobbering is, apart from pathological testcases.

 Having said that, if you did switch charset from UTF-8 e.g. to
 ISO-8859-1, the on-disk U+DCC4 would indeed turn into
 \x0E\xsome\xthing. However, that issue applies to any UTF-16

You don't have to switch the charset.  Assume you're using any
non-singlebyte charset in which \xC4 is the start of a double- or
multibyte sequence.  open (\xC4); close; readdir(); will return
\x0E\xsome\xthing on readdir.

Only singlebyte charsets are off the hook.  So, your proposal to switch
to the default ANSI codepage for the C locale would be good for most
western languages, but it would still leave the eastern language users
with double-byte charsets behind.

Note that I'm not as opposed to your proposal to use the ANSI codepage
as before this discussion.  But I would like to see that the solution
works for most eastern language users as well.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-21 Thread Corinna Vinschen
On Sep 16 00:38, Lapo Luchini wrote:
 Andy Koppe wrote:
  Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
  filename got translated to UTF-16 in fopen(), which would explain what
  you're seeing
 
 More data: it's not simply the last character, is something more
 complex than that.
 
 % cat t.c
 int main() {
 fopen(a-\xF6\xE4\xFC\xDF, w); //ISO-8859-1
 fopen(b-\xF6\xE4\xFC\xDFz, w);
 fopen(c-\xF6\xE4\xFC\xDFzz, w);
 fopen(d-\xF6\xE4\xFC\xDFzzz, w);
 fopen(e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF, w);
 return 0;
 }

Ok, I see what happens.  The problem is that the mechanism which is
supposed to handle invalid multibyte sequences handles the first such
byte, but misses to reset the multibyte shift state after the byte has
been handled.  Basically, resetting the shift state after such a
sequence has been encountered fixes that problem.

Unfortunately this is only the first half of a solution.  This is what
`ls' prints after running t:

  $ ls -l --show-control-chars
  total 21
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 a-öäüß
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 c-öäüßzz
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 d-öäüßzzz
  -rw-r--r-- 1 corinna vinschen 0 Sep 21 17:35 e-öäüßöäüß

But this is what ls prints when setting $LANG to something non-C:

  $ setenv LANG en  (implies codepage 1252)
  $ ls -l --show-control-chars
  ls: cannot access a-öäüß: No such file or directory
  ls: cannot access c-öäüßzz: No such file or directory
  ls: cannot access d-öäüßzzz: No such file or directory
  ls: cannot access e-öäüßöäüß: No such file or directory
  total 21
  -? ? ?   ??? a-öäüß
  -? ? ?   ??? c-öäüßzz
  -? ? ?   ??? d-öäüßzzz
  -? ? ?   ??? e-öäüßöäüß

As you might know, invalid bytes = 0x80 are translated to UTF-16 by
transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
The problem now is that readdir() will return the transposed characters
as if they are the original characters.  ls uses some mbtowc function
to create a valid widechar string, and then uses the resulting widechar
string in some wctomb function to call stat().  However, *that* string
will use a valid mutlibyte sequence to represent the character and the
resulting filename is suddenly different from the actual filename on
disk and stat returns with errno set to ENOENT.
Since the conversion fro and to is independent of each other, there's
no way to detect whether the incoming string of a wctomb was originally
based on a transposed character or not.

I'm not sure if I could explain this clear enough...

So it looks like the current mechanism to handle invalid multibyte
sequences is too complicated for us.  As far as I can see, it would be
much simpler and less error prone to translate the invalid bytes simply
to the equivalent UTF-16 value.  That creates filenames with UTF-16
values from the ISO-8859-1 range.  I tested this with the files created
by the above testcase.  While the filenames appeared to be different
dependent on the used charset, ls always handled the files gracefully.

Any objections?  I can also just check it in and the entire locale
challenged part of the community can test it...


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-21 Thread Andy Koppe
2009/9/21 Corinna Vinschen:
 % cat t.c
 int main() {
     fopen(a-\xF6\xE4\xFC\xDF, w); //ISO-8859-1
     fopen(b-\xF6\xE4\xFC\xDFz, w);
     fopen(c-\xF6\xE4\xFC\xDFzz, w);
     fopen(d-\xF6\xE4\xFC\xDFzzz, w);
     fopen(e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF, w);
     return 0;
 }

 Ok, I see what happens.  The problem is that the mechanism which is
 supposed to handle invalid multibyte sequences handles the first such
 byte, but misses to reset the multibyte shift state after the byte has
 been handled.  Basically, resetting the shift state after such a
 sequence has been encountered fixes that problem.

Great!


 Unfortunately this is only the first half of a solution.  This is what
 `ls' prints after running t:

  $ ls -l --show-control-chars
  total 21
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 a-öäüß
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 c-öäüßzz
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 d-öäüßzzz
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 e-öäüßöäüß

 But this is what ls prints when setting $LANG to something non-C:

  $ setenv LANG en      (implies codepage 1252)
  $ ls -l --show-control-chars
  ls: cannot access a-öäüß: No such file or directory
  ls: cannot access c-öäüßzz: No such file or directory
  ls: cannot access d-öäüßzzz: No such file or directory
  ls: cannot access e-öäüßöäüß: No such file or directory
  total 21
  -? ? ?       ?            ?                ? a-öäüß
  -? ? ?       ?            ?                ? c-öäüßzz
  -? ? ?       ?            ?                ? d-öäüßzzz
  -? ? ?       ?            ?                ? e-öäüßöäüß

Btw, the same thing will happen with en.C-ISO-8859-1 or C.ASCII too.


 As you might know, invalid bytes = 0x80 are translated to UTF-16 by
 transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
 The problem now is that readdir() will return the transposed characters
 as if they are the original characters.

Yep, that's where the bug is. Those 0xDC?? words represent invalid
UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.

Therefore, when converting a UTF-16 Windows filename to the current
charset, 0xDC?? words should be treated like any other UTF-16 word
that can't be represented in the current charset: it should be encoded
as a ^N sequence.


 ls uses some mbtowc function
 to create a valid widechar string, and then uses the resulting widechar
 string in some wctomb function to call stat().

It's not 'ls' that does that conversion. On the POSIX side, filenames
are simply sequences of bytes, hence 'ls' would be very wrong to do
any conversion between readdir() and stat().

No, it's stat() itself converting the CP1252 sequence a-öäüß to
UTF-16, which yields La-öäüß. This does not contain the 0xDC??
codepoints that the actual filename contained, hence stat() fails.


 So it looks like the current mechanism to handle invalid multibyte
 sequences is too complicated for us.  As far as I can see, it would be
 much simpler and less error prone to translate the invalid bytes simply
 to the equivalent UTF-16 value.  That creates filenames with UTF-16
 values from the ISO-8859-1 range.

This won't work correctly, because different POSIX filenames will map
to the same Windows filename. For example, the filenames \xC3\xA4
(valid UTF-8 for a-umlaut) and \xC4 (invalid UTF-8 sequence that
represents a-umlaut in 8859-1), will both map to Windows filename
U+00C4, i.e a-umlaut in UTF-16. Furthermore, after creating a file
called \xC4, a readdir() would show that file as \xC3\xA4.

Note also that invalid UTF-8 sequences would be much less of an issue
if the C locale didn't mix UTF-8 filenames with a ISO-8859-1 console.
They'd still occur e.g. when unpacking a tarball with ISO-encoded
filenames while a UTF-8 locale is active. However, that sort of
situation is not handled well on Linux either.

Regards,
Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-15 Thread Lapo Luchini
Andy Koppe wrote:
 Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
 filename got translated to UTF-16 in fopen(), which would explain what
 you're seeing

More data: it's not simply the last character, is something more
complex than that.

% cat t.c
int main() {
fopen(a-\xF6\xE4\xFC\xDF, w); //ISO-8859-1
fopen(b-\xF6\xE4\xFC\xDFz, w);
fopen(c-\xF6\xE4\xFC\xDFzz, w);
fopen(d-\xF6\xE4\xFC\xDFzzz, w);
fopen(e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF, w);
return 0;
}
% gcc -o t t.c
% ./t
% find .
.
./a-???
./b-???
./c-???
./d-???
./e-???
./t.c
./t.exe

It seems that once one high bit set byte is encountered, everything
past the last of them (itself included) is lost.

Also, I can confirm this works too:
% rm a-$'\366'$'\344'$'\374'$'\337'
but also this, since last one doesn't count:
% rm a-$'\366'$'\344'$'\374'$'\336'
BTW: I didn't know about that kind of escaping, but zsh auto-completed
that for me (excluding the last character, of course)

-- 
Lapo Luchini - http://lapo.it/


--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: [1.7] Invalid UTF8 while creating a file - cannot delete?

2009-09-10 Thread Andy Koppe
2009/9/10 Lapo Luchini:
 But the real problem with that test is not really what shows and how,
 the biggest problem is that it seems that filenames created with a
 wrong filename are quite limited in usage and can't seemingly be deleted.

 % export LANG=en_EN.UTF-8
 % cat t.c
 #include stdio.h
 int main() {
    fopen(a-\xF6\xE4\xFC\xDF, w); //ISO-8859-1
    fopen(b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F, w); //UTF-8
    return 0;
 }
 % gcc -o t t.c
 % mkdir test ; cd test ; ../t ; cd ..
 % ls -l test
 ls: cannot access test/a-▒▒▒: No such file or directory
 total 0
 -? ? ?    ?    ?                ? a-▒▒▒
 -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß
 % find test
 test
 test/a-???
 test/b-öäüß
 % find test -delete
 find: cannot delete `test/a-\366\344\374': No such file or directory

Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
filename got translated to UTF-16 in fopen(), which would explain what
you're seeing:

'find' reads the filename correctly, invokes remove() on it, which
translates it to UTF-16 again, whereby we lose a second byte, so we're
down to a-\366\344, which can't be deleted because it doesn't exist.

    remove(a-\xF6\xE4\xFC\xDF);

Now here we start with the full name again, so if we lose the last
byte we get what's actually on disk, hence the call succeeds.

Bytes that don't contribute to valid UTF-8 characters get mapped to a
certain subrange of UTF-16 low surrogates at 0xDC80, which is a clever
trick for encoding such bytes into UTF-16 and get them out again after
decoding.

I stared at the code for this in sys_cp_mbstowcs for a bit, but
haven't spotted where those missing byte might have gone.

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple