Re: [Bug-wget] bad filename

Andries E. Brouwer Wed, 23 Apr 2014 07:30:15 -0700

On Wed, Apr 23, 2014 at 04:57:11PM +0300, Bykov Aleksey wrote:
> Greetings, Darshit Shah
> This was disscussed some (or long) time ago. 
> Possible logic:
> If locale isn't UTF-8 then process as before else
> 1. Convert string to WideCharString with mbstowcs(). 
> 2. For Each WideChar check it size with wctomb(). If size is 1 then compare 
> it with mask. If char restricted, then "quoted++;"
> 3. If need, convert to lower/upper case with towlower()/towupper()
> 4. Recreate string char by char with wctomb: Convert char to temporary 
> buffer. If filechar size is 1 compare with mask and repalce. Else "memcpy(q, 
> char_buffer, char_size); q+=char_size;"
> In windows i can't check it ( mbstowcs didn't work with UTF-8, so must be 
> used MultiByteToWideChar()...)
> Patch for windows (unstructured, unclear, unfinished, but worked) is attached.
> Best Regards, Bykov Aleksey.


Good!

However:
- the patch is inside #ifdef WINDOWS ... #endif while the problem
  occurs on all systems, also on Unix.
- I think all of this is needlessly complicated. Repeatedly
  converting filenames is not a good plan if the goal is to
  keep them unchanged.
- UTF-8 has the nice property that the only 7-bit bytes that occur
  inside a character code are those in the ASCII set. So, no
  conversion is needed to test the length: every byte in 0-127
  always represents a full character. 
- Presently, 0-31 and 127-159 are considerd "control". That is
  wrong on UTF-8 systems, where 128-159 are part of a multibyte character.
  If one wants to preserve the filename mangling in the 0-31,127 range,
  but wants to do the mangling to 128-159 only when some option asks
  for it, then 0-31,127 and 128-159 should have different flags in
  url.c:static const unsigned char urlchr_table[256], e.g.
...
#define D filechr_highcontrol
...
  D, D, D, D,  D, D, D, D,  D, D, D, D,  D, D, D, D, /* 128-143 */
  D, D, D, D,  D, D, D, D,  D, D, D, D,  D, D, D, D, /* 144-159 */
...
#undef D

Andries

Re: [Bug-wget] bad filename

Reply via email to