On 2013-09-12 17:37, Tim Ruehsen wrote:
On Thursday 12 September 2013 12:59:00 Björn Mattsson wrote:
Run into a bug in wget last week.
Done some digging but can't solve it by my self.
If i tries to wget a file containing capital ÅÄÖ they gets coverted
wrongly, and åäö works fine.
I uses wget -m to backup one of my webb-sites to another machine. Have
worked like a cahrm for the last 4-5 years but a couple of week ago one
of teh files came down wrong. Thought it was a college that had uploaded
something wrong but after some digging it's wget that converts wrongly.
I have UTF-8 as charset on my machine.
If you want to test/see the problem
wget -m http://bmit.se/wget
A request to http://bmit.se/wget/ returns text/html document without
specifying the charset (AFAIR, default is iso-8859-1).
Either your Server has to tag the response as utf-8 (Content-Type: text/html;
charset=utf-8) or you have to specify utf-8 in your document header.
Ok.
But why is åäö working but not ÅÄÖ (same letters but capital)
Or you specify --remote-encoding=utf-8 when calling wget.
Didn't work
Could you give it a try, maybe with -d to see what is going on.
Done and attached the log-file.
Tim
// Björn
Script started on Thu 12 Sep 2013 09:16:23 PM CEST
]0;bmt@ronneby]0;bmt@ronneby: /tmpbmt@ronneby:/tmp$ wget -m -d
http://bmit.se/wget/
DEBUG output created by Wget 1.12 on linux-gnu.
Enqueuing http://bmit.se/wget/ at depth 0
Queue count 1, maxcount 1.
[IRI Enqueuing "http://bmit.se/wget/" with None
Dequeuing http://bmit.se/wget/ at depth 0
Queue count 0, maxcount 1.
--2013-09-12 21:16:39-- http://bmit.se/wget/
Resolving bmit.se... 31.209.29.190
Caching bmit.se => 31.209.29.190
Connecting to bmit.se|31.209.29.190|:80... connected.
Created socket 3.
Releasing 0x09089948 (new refcount 1).
---request begin---
GET /wget/ HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: bmit.se
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Thu, 12 Sep 2013 19:16:39 GMT
Server: Apache/2.2.22 (Debian)
Last-Modified: Wed, 11 Sep 2013 15:24:38 GMT
ETag: "ac004-78-4e61d3830b980"
Accept-Ranges: bytes
Content-Length: 120
Vary: Accept-Encoding
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html
---response end---
200 OK
Registered socket 3 for persistent reuse.
Length: 120 [text/html]
Saving to: "bmit.se/wget/index.html"
0% [
] 0 --.-K/s
100%[=============================================================================================================================>]
120 --.-K/s in 0s
2013-09-12 21:16:40 (2.64 MB/s) - "bmit.se/wget/index.html" saved [120/120]
Loaded bmit.se/wget/index.html (size 120).
bmit.se/wget/index.html: merge("http://bmit.se/wget/", "index.html") ->
http://bmit.se/wget/index.html
appending "http://bmit.se/wget/index.html" to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/", "test") ->
http://bmit.se/wget/test
appending "http://bmit.se/wget/test" to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/",
"teståäöÃ\205Ã\204Ã\226") -> http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226
appending "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226" to urlpos.
no-follow in bmit.se/wget/index.html: 0
Deciding whether to enqueue "http://bmit.se/wget/index.html".
Loading robots.txt; please ignore errors.
--2013-09-12 21:16:40-- http://bmit.se/robots.txt
Reusing existing connection to bmit.se:80.
Reusing fd 3.
---request begin---
GET /robots.txt HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: bmit.se
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 404 Not Found
Date: Thu, 12 Sep 2013 19:16:40 GMT
Server: Apache/2.2.22 (Debian)
Vary: Accept-Encoding
Content-Length: 281
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
---response end---
404 Not Found
Skipping 281 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /robots.txt was not found on this server.</p>
<hr>
<address>Apache/2.2.22 (Debian) Server at bmit.se Port 80</address>
</body></html>
] done.
2013-09-12 21:16:40 ERROR 404: Not Found.
Decided to load it.
Enqueuing http://bmit.se/wget/index.html at depth 1
Queue count 1, maxcount 1.
[IRI Enqueuing "http://bmit.se/wget/index.html" with None
Deciding whether to enqueue "http://bmit.se/wget/test".
Decided to load it.
Enqueuing http://bmit.se/wget/test at depth 1
Queue count 2, maxcount 2.
[IRI Enqueuing "http://bmit.se/wget/test" with None
Deciding whether to enqueue "http://bmit.se/wget/teståäöÅÄÖ".
Decided to load it.
Enqueuing http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226 at depth 1
Queue count 3, maxcount 3.
[IRI Enqueuing "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226" with None
Dequeuing http://bmit.se/wget/index.html at depth 1
Queue count 2, maxcount 3.
--2013-09-12 21:16:40-- http://bmit.se/wget/index.html
Reusing existing connection to bmit.se:80.
Reusing fd 3.
---request begin---
HEAD /wget/index.html HTTP/1.0
Referer: http://bmit.se/wget/
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: bmit.se
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Thu, 12 Sep 2013 19:16:40 GMT
Server: Apache/2.2.22 (Debian)
Last-Modified: Wed, 11 Sep 2013 15:24:38 GMT
ETag: "ac004-78-4e61d3830b980"
Accept-Ranges: bytes
Content-Length: 120
Vary: Accept-Encoding
Keep-Alive: timeout=5, max=98
Connection: Keep-Alive
Content-Type: text/html
---response end---
200 OK
Length: 120 [text/html]
Server file no newer than local file "bmit.se/wget/index.html" -- not
retrieving.
Loaded bmit.se/wget/index.html (size 120).
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html", "index.html")
-> http://bmit.se/wget/index.html
appending "http://bmit.se/wget/index.html" to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html", "test") ->
http://bmit.se/wget/test
appending "http://bmit.se/wget/test" to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html",
"teståäöÃ\205Ã\204Ã\226") -> http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226
appending "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226" to urlpos.
no-follow in bmit.se/wget/index.html: 0
Deciding whether to enqueue "http://bmit.se/wget/index.html".
Already on the black list.
Decided NOT to load it.
Deciding whether to enqueue "http://bmit.se/wget/test".
Already on the black list.
Decided NOT to load it.
Deciding whether to enqueue "http://bmit.se/wget/teståäöÅÄÖ".
Already on the black list.
Decided NOT to load it.
Dequeuing http://bmit.se/wget/test at depth 1
Queue count 1, maxcount 3.
--2013-09-12 21:16:40-- http://bmit.se/wget/test
Reusing existing connection to bmit.se:80.
Reusing fd 3.
---request begin---
GET /wget/test HTTP/1.0
Referer: http://bmit.se/wget/
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: bmit.se
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Thu, 12 Sep 2013 19:16:40 GMT
Server: Apache/2.2.22 (Debian)
Last-Modified: Wed, 11 Sep 2013 15:20:53 GMT
ETag: "ac002-0-4e61d2ac77f40"
Accept-Ranges: bytes
Content-Length: 0
Keep-Alive: timeout=5, max=97
Connection: Keep-Alive
---response end---
200 OK
Length: 0
Saving to: "bmit.se/wget/test"
[<=>
] 0 --.-K/s
[ <=>
] 0 --.-K/s in
0s
2013-09-12 21:16:40 (0.00 B/s) - "bmit.se/wget/test" saved [0/0]
Loaded bmit.se/wget/test (size 0).
no-follow in bmit.se/wget/test: 0
Dequeuing http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226 at depth 1
Queue count 0, maxcount 3.
--2013-09-12 21:16:40--
http://bmit.se/wget/test%C3%A5%C3%A4%C3%B6%C3%85%C3%84%C3%96
Reusing existing connection to bmit.se:80.
Reusing fd 3.
---request begin---
GET /wget/test%C3%A5%C3%A4%C3%B6%C3%85%C3%84%C3%96 HTTP/1.0
Referer: http://bmit.se/wget/
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: bmit.se
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Thu, 12 Sep 2013 19:16:40 GMT
Server: Apache/2.2.22 (Debian)
Last-Modified: Wed, 11 Sep 2013 15:21:01 GMT
ETag: "ac003-0-4e61d2b419140"
Accept-Ranges: bytes
Content-Length: 0
Keep-Alive: timeout=5, max=96
Connection: Keep-Alive
---response end---
200 OK
Length: 0
Saving to: "bmit.se/wget/teståäöÃ%85Ã%84Ã%96"
[<=>
] 0 --.-K/s
[ <=>
] 0 --.-K/s in
0s
2013-09-12 21:16:40 (0.00 B/s) - "bmit.se/wget/teståäöÃ%85Ã%84Ã%96" saved
[0/0]
Loaded bmit.se/wget/teståäöÃ%85Ã%84Ã%96 (size 0).
no-follow in bmit.se/wget/teståäöÃ%85Ã%84Ã%96: 0
FINISHED --2013-09-12 21:16:40--
Downloaded: 3 files, 120 in 0s (2.64 MB/s)
]0;bmt@ronneby: /tmpbmt@ronneby:/tmp$ exit
Script done on Thu 12 Sep 2013 09:16:41 PM CEST