> Date: Wed, 19 Aug 2015 02:52:57 +0200 > From: "Andries E. Brouwer" <[email protected]> > Cc: [email protected] > > Look at the remote filename. > > Assign a character set as follows: > - if the user specified a from-charset, use that > - if the name is printable ASCII (in 0x20-0x7f), take ASCII > - if the name is non-ASCII and valid UTF-8, take UTF-8 > - otherwise take Unknown.
I think this is simpler and produces the same results: - if the user specified a from-charset, use that - otherwise assume UTF-8 > Determine a local character set as follows: > - if the user specified a to-charset, use that > - if the locale uses UTF-8, use that > - otherwise take ASCII I suggest this instead: - if the user specified a to-charset, use that - otherwise, call nl_langinfo(CODESET) to find out the current locale's encoding > Convert the name from from-charset to to-charset: > - if the user asked for unmodified filenames, do nothing > - if the name is ASCII, do nothing > - if the name is UTF-8 and the locale uses UTF-8, do nothing > - convert from Unknown by hex-escaping the entire name > - convert to ASCII by hex-escaping the entire name > - otherwise invoke iconv(); upon failure, escape the illegal bytes My suggestion: - if the user asked for unmodified filenames, do nothing - else invoke 'iconv' to convert from remote to local encoding - if 'iconv' fails, convert to ASCII by hex-escaping Hex-escaping only the bytes that fail 'iconv' is better than hex-escaping all of them, but it's more complex, and I'm not sure it's worth the hassle. But if it can be implemented without undue trouble, I'm all for it, as it will make wget more user-friendly in those cases. > Once we know what we want it is trivial to write the code, > but it may take a while to figure out what we want. > I think we should start applying the current patch. Tim says he has some/most of that coded on a branch, so I think we should start by merging that branch, and then take it from there.
