On Saturday 22 August 2015 00:39:01 Andries E. Brouwer wrote: > On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote: > > > Content-Disposition: attachment; > > > filename="20101202_%EB...%A8-%EB%B0%B1_.sgf" > > > This encodes a valid utf-8 filename, and that name should be used. > > > So wget should save this file under the name > > > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf > > > > This is a different issue. Here we are talking about the encoding of HTTP > > headers, especially 'filename' values within Content-Disposition HTTP > > header. Wget simply does not parse this correctly - it is just not coded > > in. It is just Wget missing some code here (worth opening a separate > > bug). > Good, saved for later.
Just implemented (or let's say fixed) Content-Disposition in wget2. It now saves the file as 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf Content-Disposition (filename, filename*) is standardized, but browsers seems to behave/parse very different, ignoring standards. See http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http (answer 2 from Martin Ørding-Thomsen) But that's just FYI. Different issue. > > If the server AND the document do not explicitly specify the character > > encoding, there still is one - namely the default. Has been ISO-8859-1 > > a while ago. AFAIR, HTML5 might have changed that (too late for me now > > to look it up). > > Yes - that is our main difference. You read the standard and find there > what everyone is supposed to do, or what the default is. > I download stuff from the net and encounter lots of things people do, > that are perhaps not according to the most recent standard, > and may differ from the default. > > As a consequence I prefer to base the decision about what to do > on the form of the filename (ASCII / UTF-8 / other), not on the > headers encountered on the way to this file. I guess we can find an easy agreement. 1. Wget has to obey the defaults. If it fails or we find a well-known misbehavior (server/document fault), handle it automatically. That's how we try do do it now. 2. If still a problem arises, the user should be able to intercept. Using special command line options for fine-tuning Wget's behavior. Of course we try our best, so that 2. is normally not necessary. You already gave some examples, one of it (the Content-Disposition example) already lead to an optimization (I'll transfer the code to Wget1.x soon). The other two obeyed the standards (one had f*cked up content, but that didn't touch Wget's functionality). I would ask you to give more examples of websites that you think aren't standard and/or where Wget has problems parsing out the links. That would be 50% of the work. > (By the way, I checked my conjecture that iconv from UTF-8 > to UTF-8 need not be the identity map, and that is indeed the case. > On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.) We should have a 'shortcut', so if to-charset and from-charset are the same, we don't convert. Tim
signature.asc
Description: This is a digitally signed message part.
