On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote: > > Date: Wed, 19 Aug 2015 02:52:57 +0200 > > From: "Andries E. Brouwer" <[email protected]> > > Cc: [email protected] > > > > Look at the remote filename. > > > > Assign a character set as follows: > > - if the user specified a from-charset, use that > > - if the name is printable ASCII (in 0x20-0x7f), take ASCII > > - if the name is non-ASCII and valid UTF-8, take UTF-8 > > - otherwise take Unknown. > > I think this is simpler and produces the same results: > - if the user specified a from-charset, use that > - otherwise assume UTF-8 > > > Determine a local character set as follows: > > - if the user specified a to-charset, use that > > - if the locale uses UTF-8, use that > > - otherwise take ASCII > > I suggest this instead: > - if the user specified a to-charset, use that > - otherwise, call nl_langinfo(CODESET) to find out the current > locale's encoding > > > Convert the name from from-charset to to-charset: > > - if the user asked for unmodified filenames, do nothing > > - if the name is ASCII, do nothing > > - if the name is UTF-8 and the locale uses UTF-8, do nothing > > - convert from Unknown by hex-escaping the entire name > > - convert to ASCII by hex-escaping the entire name > > - otherwise invoke iconv(); upon failure, escape the illegal bytes > > My suggestion: > - if the user asked for unmodified filenames, do nothing > - else invoke 'iconv' to convert from remote to local encoding > - if 'iconv' fails, convert to ASCII by hex-escaping > > Hex-escaping only the bytes that fail 'iconv' is better than > hex-escaping all of them, but it's more complex, and I'm not sure it's > worth the hassle. But if it can be implemented without undue trouble, > I'm all for it, as it will make wget more user-friendly in those > cases. > > > Once we know what we want it is trivial to write the code, > > but it may take a while to figure out what we want. > > I think we should start applying the current patch. > > Tim says he has some/most of that coded on a branch, so I think we > should start by merging that branch, and then take it from there.
It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 'click on the merge button' to merge. Basically, I keep track of the charset of each URL input (command line, input file, stdin, downloaded+scanned). So when generating the filename we have the to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), escaping takes place. Tim
