Re: [Bug-wget] bad filenames (again)

Tim Ruehsen Thu, 20 Aug 2015 15:33:41 -0700

On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote:
> > Date: Wed, 19 Aug 2015 02:52:57 +0200
> > From: "Andries E. Brouwer" <[email protected]>
> > Cc: [email protected]
> > 
> > Look at the remote filename.
> > 
> > Assign a character set as follows:
> > - if the user specified a from-charset, use that
> > - if the name is printable ASCII (in 0x20-0x7f), take ASCII
> > - if the name is non-ASCII and valid UTF-8, take UTF-8
> > - otherwise take Unknown.
> 
> I think this is simpler and produces the same results:
>  - if the user specified a from-charset, use that
>  - otherwise assume UTF-8
> 
> > Determine a local character set as follows:
> > - if the user specified a to-charset, use that
> > - if the locale uses UTF-8, use that
> > - otherwise take ASCII
> 
> I suggest this instead:
>  - if the user specified a to-charset, use that
>  - otherwise, call nl_langinfo(CODESET) to find out the current
>    locale's encoding
> 
> > Convert the name from from-charset to to-charset:
> > - if the user asked for unmodified filenames, do nothing
> > - if the name is ASCII, do nothing
> > - if the name is UTF-8 and the locale uses UTF-8, do nothing
> > - convert from Unknown by hex-escaping the entire name
> > - convert to ASCII by hex-escaping the entire name
> > - otherwise invoke iconv(); upon failure, escape the illegal bytes
> 
> My suggestion:
>  - if the user asked for unmodified filenames, do nothing
>  - else invoke 'iconv' to convert from remote to local encoding
>  - if 'iconv' fails, convert to ASCII by hex-escaping
> 
> Hex-escaping only the bytes that fail 'iconv' is better than
> hex-escaping all of them, but it's more complex, and I'm not sure it's
> worth the hassle.  But if it can be implemented without undue trouble,
> I'm all for it, as it will make wget more user-friendly in those
> cases.
> 
> > Once we know what we want it is trivial to write the code,
> > but it may take a while to figure out what we want.
> > I think we should start applying the current patch.
> 
> Tim says he has some/most of that coded on a branch, so I think we
> should start by merging that branch, and then take it from there.


It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
'click on the merge button' to merge.
Basically, I keep track of the charset of each URL input (command line, input 
file, stdin, downloaded+scanned). So when generating the filename we have the 
to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), 
escaping takes place.

Tim

Re: [Bug-wget] bad filenames (again)

Reply via email to