Re: [Bug-wget] bad filenames (again)

Andries E. Brouwer Fri, 21 Aug 2015 15:39:01 -0700

On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:

> > Content-Disposition: attachment;
> > filename="20101202_%EB...%A8-%EB%B0%B1_.sgf"
> > This encodes a valid utf-8 filename, and that name should be used.
> > So wget should save this file under the name
> > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
> 
> This is a different issue. Here we are talking about the encoding of HTTP 
> headers, especially 'filename' values within Content-Disposition HTTP header.
> Wget simply does not parse this correctly - it is just not coded in.
> It is just Wget missing some code here (worth opening a separate bug).


Good, saved for later.

> If the server AND the document do not explicitly specify the character 
> encoding, there still is one - namely the default. Has been ISO-8859-1
> a while ago. AFAIR, HTML5 might have changed that (too late for me now
> to look it up).

Yes - that is our main difference. You read the standard and find there
what everyone is supposed to do, or what the default is.
I download stuff from the net and encounter lots of things people do,
that are perhaps not according to the most recent standard,
and may differ from the default.

As a consequence I prefer to base the decision about what to do
on the form of the filename (ASCII / UTF-8 / other), not on the
headers encountered on the way to this file.

Fortunately, almost all URLs are in ASCII - no problem.
Fortunately, almost all that are not in ASCII, are UTF-8.
The good thing of UTF-8 is that it has a quite typical bit pattern.
A non-ASCII filename that is valid UTF-8 is very likely UTF-8.
So, one can recognize ASCII and UTF-8 rather reliably.

(By the way, I checked my conjecture that iconv from UTF-8
to UTF-8 need not be the identity map, and that is indeed the case.
On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)

Andries

Re: [Bug-wget] bad filenames (again)

Reply via email to