On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote: > > Content-Disposition: attachment; > > filename="20101202_%EB...%A8-%EB%B0%B1_.sgf" > > This encodes a valid utf-8 filename, and that name should be used. > > So wget should save this file under the name > > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf > > This is a different issue. Here we are talking about the encoding of HTTP > headers, especially 'filename' values within Content-Disposition HTTP header. > Wget simply does not parse this correctly - it is just not coded in. > It is just Wget missing some code here (worth opening a separate bug).
Good, saved for later. > If the server AND the document do not explicitly specify the character > encoding, there still is one - namely the default. Has been ISO-8859-1 > a while ago. AFAIR, HTML5 might have changed that (too late for me now > to look it up). Yes - that is our main difference. You read the standard and find there what everyone is supposed to do, or what the default is. I download stuff from the net and encounter lots of things people do, that are perhaps not according to the most recent standard, and may differ from the default. As a consequence I prefer to base the decision about what to do on the form of the filename (ASCII / UTF-8 / other), not on the headers encountered on the way to this file. Fortunately, almost all URLs are in ASCII - no problem. Fortunately, almost all that are not in ASCII, are UTF-8. The good thing of UTF-8 is that it has a quite typical bit pattern. A non-ASCII filename that is valid UTF-8 is very likely UTF-8. So, one can recognize ASCII and UTF-8 rather reliably. (By the way, I checked my conjecture that iconv from UTF-8 to UTF-8 need not be the identity map, and that is indeed the case. On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.) Andries
