On Wed, Apr 23, 2014 at 04:57:11PM +0300, Bykov Aleksey wrote: > Greetings, Darshit Shah > This was disscussed some (or long) time ago. > Possible logic: > If locale isn't UTF-8 then process as before else > 1. Convert string to WideCharString with mbstowcs(). > 2. For Each WideChar check it size with wctomb(). If size is 1 then compare > it with mask. If char restricted, then "quoted++;" > 3. If need, convert to lower/upper case with towlower()/towupper() > 4. Recreate string char by char with wctomb: Convert char to temporary > buffer. If filechar size is 1 compare with mask and repalce. Else "memcpy(q, > char_buffer, char_size); q+=char_size;" > In windows i can't check it ( mbstowcs didn't work with UTF-8, so must be > used MultiByteToWideChar()...) > Patch for windows (unstructured, unclear, unfinished, but worked) is attached. > Best Regards, Bykov Aleksey.
Good! However: - the patch is inside #ifdef WINDOWS ... #endif while the problem occurs on all systems, also on Unix. - I think all of this is needlessly complicated. Repeatedly converting filenames is not a good plan if the goal is to keep them unchanged. - UTF-8 has the nice property that the only 7-bit bytes that occur inside a character code are those in the ASCII set. So, no conversion is needed to test the length: every byte in 0-127 always represents a full character. - Presently, 0-31 and 127-159 are considerd "control". That is wrong on UTF-8 systems, where 128-159 are part of a multibyte character. If one wants to preserve the filename mangling in the 0-31,127 range, but wants to do the mangling to 128-159 only when some option asks for it, then 0-31,127 and 128-159 should have different flags in url.c:static const unsigned char urlchr_table[256], e.g. ... #define D filechr_highcontrol ... D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 128-143 */ D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 144-159 */ ... #undef D Andries
