> Date: Wed, 19 Aug 2015 01:43:51 +0200
> From: Ángel González <keis...@gmail.com>
> 
>     +int
>     +wc_utime (unsigned char *filename, struct _utimbuf *times)
>     +{
>     +  wchar_t *w_filename;
>     +  int buffer_size;
>     +
>     +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, 
> filename, -1, 
>     w_filename, 0);
>     +  w_filename = alloca (buffer_size);
>     +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
>     +  return _wutime (w_filename, times);
>     +}
> 
>     and similar for stat, open, etc. Something similar is what would be 
> needed on 
>     Windows?
>     Is his patch usable? Maybe I also commented a little in
>     http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
>     but after that nothing happened, it seems.
> 
> That would probably work, but would need a review. On a quick look, some of 
> the functions have memory leaks (seems he first used malloc, then changed to 
> alloca just some of them).

Indeed.  Actually, there's no need to allocate memory dynamically,
neither will malloc nor with alloca, since Windows file names have
fixed size limitation that is known in advance.  So each conversion
function can use a fixed-sized local wchar_t array.  Doing that will
also avoid the need for 2 calls to MultiByteToWideChar, the first one
to find out how much space to allocate.

> And of course, there's the question of what to do if the filename we are 
> trying to convert to utf-16 is not in fact valid utf-8.

The calls to MultiByteToWideChar should use a flag
(MB_ERR_INVALID_CHARS) in its 2nd argument that makes the function
fail with a distinct error code in that case.  When it fails like
that, the wc_* wrappers should simply call the "normal" unibyte
functions with the original 'char *' argument.  This makes the
modified code fall back on previous behavior when the source file
names are not in UTF-8.

And regardless, wget should convert to the locale's codeset (on all
platforms).  Once the above patches are accepted, the Windows build
will pretend that its locale's codeset is UTF-8, and that will ensure
the conversions with MultiByteToWideChar will work in most situations.


Reply via email to