On 2013-09-12 17:37, Tim Ruehsen wrote:
On Thursday 12 September 2013 12:59:00 Björn Mattsson wrote:
Run into a bug in wget last week.
Done some digging but can't solve it by my self.

If i tries to wget a file containing capital ÅÄÖ they gets coverted
wrongly, and åäö works fine.

I uses wget -m to backup one of my webb-sites to another machine. Have
worked like a cahrm for the last 4-5 years but a couple of week ago one
of teh files came down wrong. Thought it was a college that had uploaded
something wrong but after some digging it's wget that converts wrongly.

I have UTF-8 as charset on my machine.

If you want to test/see the problem

wget -m http://bmit.se/wget
A request to http://bmit.se/wget/ returns text/html document without
specifying the charset (AFAIR, default is iso-8859-1).
Either your Server has to tag the response as utf-8 (Content-Type: text/html;
charset=utf-8) or you have to specify utf-8 in your document header.
Ok.
But why is åäö working but not ÅÄÖ  (same letters but capital)

Or you specify --remote-encoding=utf-8 when calling wget.
Didn't work

Could you give it a try, maybe with -d to see what is going on.
Done and attached the log-file.

Tim

// Björn
Script started on Thu 12 Sep 2013 09:16:23 PM CEST
]0;bmt@ronneby]0;bmt@ronneby: /tmpbmt@ronneby:/tmp$ wget -m -d 
http://bmit.se/wget/
DEBUG output created by Wget 1.12 on linux-gnu.

Enqueuing http://bmit.se/wget/ at depth 0
Queue count 1, maxcount 1.
[IRI Enqueuing "http://bmit.se/wget/"; with None
Dequeuing http://bmit.se/wget/ at depth 0
Queue count 0, maxcount 1.
--2013-09-12 21:16:39--  http://bmit.se/wget/
Resolving bmit.se... 31.209.29.190
Caching bmit.se => 31.209.29.190
Connecting to bmit.se|31.209.29.190|:80... connected.
Created socket 3.
Releasing 0x09089948 (new refcount 1).

---request begin---
GET /wget/ HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Date: Thu, 12 Sep 2013 19:16:39 GMT

Server: Apache/2.2.22 (Debian)

Last-Modified: Wed, 11 Sep 2013 15:24:38 GMT

ETag: "ac004-78-4e61d3830b980"

Accept-Ranges: bytes

Content-Length: 120

Vary: Accept-Encoding

Keep-Alive: timeout=5, max=100

Connection: Keep-Alive

Content-Type: text/html



---response end---
200 OK
Registered socket 3 for persistent reuse.
Length: 120 [text/html]
Saving to: "bmit.se/wget/index.html"


 0% [                                                                           
                                                   ] 0           --.-K/s        
      
100%[=============================================================================================================================>]
 120         --.-K/s   in 0s      

2013-09-12 21:16:40 (2.64 MB/s) - "bmit.se/wget/index.html" saved [120/120]

Loaded bmit.se/wget/index.html (size 120).
bmit.se/wget/index.html: merge("http://bmit.se/wget/";, "index.html") -> 
http://bmit.se/wget/index.html
appending "http://bmit.se/wget/index.html"; to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/";, "test") -> 
http://bmit.se/wget/test
appending "http://bmit.se/wget/test"; to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/";, 
"teståäöÃ\205Ã\204Ã\226") -> http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226
appending "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226"; to urlpos.
no-follow in bmit.se/wget/index.html: 0
Deciding whether to enqueue "http://bmit.se/wget/index.html";.
Loading robots.txt; please ignore errors.
--2013-09-12 21:16:40--  http://bmit.se/robots.txt
Reusing existing connection to bmit.se:80.
Reusing fd 3.

---request begin---
GET /robots.txt HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 404 Not Found

Date: Thu, 12 Sep 2013 19:16:40 GMT

Server: Apache/2.2.22 (Debian)

Vary: Accept-Encoding

Content-Length: 281

Keep-Alive: timeout=5, max=99

Connection: Keep-Alive

Content-Type: text/html; charset=iso-8859-1



---response end---
404 Not Found
Skipping 281 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /robots.txt was not found on this server.</p>
<hr>
<address>Apache/2.2.22 (Debian) Server at bmit.se Port 80</address>
</body></html>
] done.
2013-09-12 21:16:40 ERROR 404: Not Found.

Decided to load it.
Enqueuing http://bmit.se/wget/index.html at depth 1
Queue count 1, maxcount 1.
[IRI Enqueuing "http://bmit.se/wget/index.html"; with None
Deciding whether to enqueue "http://bmit.se/wget/test";.
Decided to load it.
Enqueuing http://bmit.se/wget/test at depth 1
Queue count 2, maxcount 2.
[IRI Enqueuing "http://bmit.se/wget/test"; with None
Deciding whether to enqueue "http://bmit.se/wget/teståäöÅÄÖ";.
Decided to load it.
Enqueuing http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226 at depth 1
Queue count 3, maxcount 3.
[IRI Enqueuing "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226"; with None
Dequeuing http://bmit.se/wget/index.html at depth 1
Queue count 2, maxcount 3.
--2013-09-12 21:16:40--  http://bmit.se/wget/index.html
Reusing existing connection to bmit.se:80.
Reusing fd 3.

---request begin---
HEAD /wget/index.html HTTP/1.0

Referer: http://bmit.se/wget/

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Date: Thu, 12 Sep 2013 19:16:40 GMT

Server: Apache/2.2.22 (Debian)

Last-Modified: Wed, 11 Sep 2013 15:24:38 GMT

ETag: "ac004-78-4e61d3830b980"

Accept-Ranges: bytes

Content-Length: 120

Vary: Accept-Encoding

Keep-Alive: timeout=5, max=98

Connection: Keep-Alive

Content-Type: text/html



---response end---
200 OK
Length: 120 [text/html]
Server file no newer than local file "bmit.se/wget/index.html" -- not 
retrieving.

Loaded bmit.se/wget/index.html (size 120).
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html";, "index.html") 
-> http://bmit.se/wget/index.html
appending "http://bmit.se/wget/index.html"; to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html";, "test") -> 
http://bmit.se/wget/test
appending "http://bmit.se/wget/test"; to urlpos.
bmit.se/wget/index.html: merge("http://bmit.se/wget/index.html";, 
"teståäöÃ\205Ã\204Ã\226") -> http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226
appending "http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226"; to urlpos.
no-follow in bmit.se/wget/index.html: 0
Deciding whether to enqueue "http://bmit.se/wget/index.html";.
Already on the black list.
Decided NOT to load it.
Deciding whether to enqueue "http://bmit.se/wget/test";.
Already on the black list.
Decided NOT to load it.
Deciding whether to enqueue "http://bmit.se/wget/teståäöÅÄÖ";.
Already on the black list.
Decided NOT to load it.
Dequeuing http://bmit.se/wget/test at depth 1
Queue count 1, maxcount 3.
--2013-09-12 21:16:40--  http://bmit.se/wget/test
Reusing existing connection to bmit.se:80.
Reusing fd 3.

---request begin---
GET /wget/test HTTP/1.0

Referer: http://bmit.se/wget/

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Date: Thu, 12 Sep 2013 19:16:40 GMT

Server: Apache/2.2.22 (Debian)

Last-Modified: Wed, 11 Sep 2013 15:20:53 GMT

ETag: "ac002-0-4e61d2ac77f40"

Accept-Ranges: bytes

Content-Length: 0

Keep-Alive: timeout=5, max=97

Connection: Keep-Alive



---response end---
200 OK
Length: 0
Saving to: "bmit.se/wget/test"


    [<=>                                                                        
                                                   ] 0           --.-K/s        
      
    [ <=>                                                                       
                                                   ] 0           --.-K/s   in 
0s      

2013-09-12 21:16:40 (0.00 B/s) - "bmit.se/wget/test" saved [0/0]

Loaded bmit.se/wget/test (size 0).
no-follow in bmit.se/wget/test: 0
Dequeuing http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226 at depth 1
Queue count 0, maxcount 3.
--2013-09-12 21:16:40--  
http://bmit.se/wget/test%C3%A5%C3%A4%C3%B6%C3%85%C3%84%C3%96
Reusing existing connection to bmit.se:80.
Reusing fd 3.

---request begin---
GET /wget/test%C3%A5%C3%A4%C3%B6%C3%85%C3%84%C3%96 HTTP/1.0

Referer: http://bmit.se/wget/

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Date: Thu, 12 Sep 2013 19:16:40 GMT

Server: Apache/2.2.22 (Debian)

Last-Modified: Wed, 11 Sep 2013 15:21:01 GMT

ETag: "ac003-0-4e61d2b419140"

Accept-Ranges: bytes

Content-Length: 0

Keep-Alive: timeout=5, max=96

Connection: Keep-Alive



---response end---
200 OK
Length: 0
Saving to: "bmit.se/wget/teståäöÃ%85Ã%84Ã%96"


    [<=>                                                                        
                                                   ] 0           --.-K/s        
      
    [ <=>                                                                       
                                                   ] 0           --.-K/s   in 
0s      

2013-09-12 21:16:40 (0.00 B/s) - "bmit.se/wget/teståäöÃ%85Ã%84Ã%96" saved 
[0/0]

Loaded bmit.se/wget/teståäöÃ%85Ã%84Ã%96 (size 0).
no-follow in bmit.se/wget/teståäöÃ%85Ã%84Ã%96: 0
FINISHED --2013-09-12 21:16:40--
Downloaded: 3 files, 120 in 0s (2.64 MB/s)
]0;bmt@ronneby: /tmpbmt@ronneby:/tmp$ exit

Script done on Thu 12 Sep 2013 09:16:41 PM CEST

Reply via email to