On Friday 21 August 2015 14:22:22 Andries E. Brouwer wrote: > On Fri, Aug 21, 2015 at 01:31:45PM +0200, Tim Ruehsen wrote: > > > There is a remote site. > > > Nothing is known about this remote site. > > > > Wrong. Regarding HTTP(S), we exactly know the encoding > > of each downloaded HTML and CSS document > > (that's what I call 'remote encoding'). > > You are an optimist. In my experience Firefox rarely gets it right. > Let me find some random site. Say > http://web2go.board19.com/gopro/go_view.php?id=12345
I try to be an optimist in all situations, yes :-) > If I go there with Firefox, I get a go board with a lot of mojibake > around it. Firefox took the encoding to be Unicode. Trying out what > I have to say in the "Text encoding" menu, it turns out to be > "Chinese, Traditional". The server tell us the document is UTF-8. The document tell us it is 'UTF-8. But then, some moron (there are a lot of these dudes doing webpage 'design') put non UTF-8 text into the document. That is like putting plum pudding into a jar labeled 'strawberry jam'. You will you do ? Go back and return it ? Or accept it saying 'uh oh, my strawberry allergy will bite me, but I am a tough guy'. *BUT* that is not the point for wget, since wget doesn't mess around with the texttual content (no conversion takes place). When used recursive, wget will extract URLs from the document. *NOT* from the text but from the HTML tags/attributes. And *surprise*, all of the links in the document are UTF-8 / ASCII (else not a single browser in the world would expect anything else). And all that matters are the URLs from the HTML attributes. > And you say "misconfigured servers", but often one gets a > Unix or Windows file hierarchy, and several character sets occur. > The server doesnt know. The sysadmin doesnt know. A university > machine will have many users with files in several languages > and character sets. Trust them, They know. If not, their web site will be heavily broken. But there is nothing to fix for us. > Moreover, the character set of a filename is in general unrelated > to the character set of the contents of the file. That is most clear > when the file is not a text file. What character set is the filename > > http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Wrong question. It is a JPEG file. Content doesn't matter to wget. Despite from that, if you want to download the above mentioned web page and you have a UTF-8 locale, you have to tell wget via --local-encoding what encoding the URL is. But if wget --recursive finds the above URL within a HTML attribute, you won't need --local-encoding. By the measures taken from http://www.w3.org/TR/html4/charset.html#h-5.2.2, wget will know the correct encoding and just will do the right thing (after the currently discussed change regarding charsets / file naming). Wget2 already does it. $ wget --local-encoding=iso-8859-1 'http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg' --2015-08-21 16:30:05-- http://www.win.tue.nl/~aeb/linux/lk/kn%C3%A4ckebr%C3%B6d.jpg Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.0.177 Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.0.177|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2015-08-21 16:30:05 ERROR 404: Not Found. --2015-08-21 16:30:05-- http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg Reusing existing connection to www.win.tue.nl:80. HTTP request sent, awaiting response... 200 OK Length: 11690 (11K) [image/jpeg] Saving to: ‘knäckebröd.jpg’ knäckebröd.jp 100%[=========================================================================>] 11.42K --.-KB/s in 0.002s 2015-08-21 16:30:05 (6.83 MB/s) - ‘knäckebröd.jpg’ saved [11690/11690] (Old wget having the progress bar bug.) Tim
signature.asc
Description: This is a digitally signed message part.
