[Bug-wget] Problem with ÅÄÖ and wget

2013-09-12 Thread Björn Mattsson

Run into a bug in wget last week.
Done some digging but can't solve it by my self.

If i tries to wget a file containing capital ÅÄÖ they gets coverted 
wrongly, and åäö works fine.


I uses wget -m to backup one of my webb-sites to another machine. Have 
worked like a cahrm for the last 4-5 years but a couple of week ago one 
of teh files came down wrong. Thought it was a college that had uploaded 
something wrong but after some digging it's wget that converts wrongly.


I have UTF-8 as charset on my machine.

If you want to test/see the problem

wget -m http://bmit.se/wget

--
Best regards
Björn Mattsson
Network engineer
IT Department
Blekinge Institute of Technology
 
bjorn.matts...@bth.se

Office: +46 (0)455-385163
IT Helpdesk: +46 (0)455-385100




Re: [Bug-wget] Problem with ÅÄÖ and wget

2013-09-12 Thread Tim Rühsen
Am Donnerstag, 12. September 2013, 12:59:00 schrieb Björn Mattsson:
 Run into a bug in wget last week.
 Done some digging but can't solve it by my self.
 
 If i tries to wget a file containing capital ÅÄÖ they gets coverted
 wrongly, and åäö works fine.
 
 I uses wget -m to backup one of my webb-sites to another machine. Have
 worked like a cahrm for the last 4-5 years but a couple of week ago one
 of teh files came down wrong. Thought it was a college that had uploaded
 something wrong but after some digging it's wget that converts wrongly.
 
 I have UTF-8 as charset on my machine.
 
 If you want to test/see the problem
 
 wget -m http://bmit.se/wget

Just use 
wget --restrict-file-names=nocontrol -m http://bmit.se/wget

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Problem with ÅÄÖ and wget

2013-09-12 Thread Björn Mattsson

On 2013-09-12 17:37, Tim Ruehsen wrote:

On Thursday 12 September 2013 12:59:00 Björn Mattsson wrote:

Run into a bug in wget last week.
Done some digging but can't solve it by my self.

If i tries to wget a file containing capital ÅÄÖ they gets coverted
wrongly, and åäö works fine.

I uses wget -m to backup one of my webb-sites to another machine. Have
worked like a cahrm for the last 4-5 years but a couple of week ago one
of teh files came down wrong. Thought it was a college that had uploaded
something wrong but after some digging it's wget that converts wrongly.

I have UTF-8 as charset on my machine.

If you want to test/see the problem

wget -m http://bmit.se/wget

A request to http://bmit.se/wget/ returns text/html document without
specifying the charset (AFAIR, default is iso-8859-1).
Either your Server has to tag the response as utf-8 (Content-Type: text/html;
charset=utf-8) or you have to specify utf-8 in your document header.

Ok.
But why is åäö working but not ÅÄÖ  (same letters but capital)


Or you specify --remote-encoding=utf-8 when calling wget.

Didn't work


Could you give it a try, maybe with -d to see what is going on.

Done and attached the log-file.


Tim


// Björn
Script started on Thu 12 Sep 2013 09:16:23 PM CEST
]0;bmt@ronneby]0;bmt@ronneby: /tmpbmt@ronneby:/tmp$ wget -m -d 
http://bmit.se/wget/
DEBUG output created by Wget 1.12 on linux-gnu.

Enqueuing http://bmit.se/wget/ at depth 0
Queue count 1, maxcount 1.
[IRI Enqueuing http://bmit.se/wget/; with None
Dequeuing http://bmit.se/wget/ at depth 0
Queue count 0, maxcount 1.
--2013-09-12 21:16:39--  http://bmit.se/wget/
Resolving bmit.se... 31.209.29.190
Caching bmit.se = 31.209.29.190
Connecting to bmit.se|31.209.29.190|:80... connected.
Created socket 3.
Releasing 0x09089948 (new refcount 1).

---request begin---
GET /wget/ HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Date: Thu, 12 Sep 2013 19:16:39 GMT

Server: Apache/2.2.22 (Debian)

Last-Modified: Wed, 11 Sep 2013 15:24:38 GMT

ETag: ac004-78-4e61d3830b980

Accept-Ranges: bytes

Content-Length: 120

Vary: Accept-Encoding

Keep-Alive: timeout=5, max=100

Connection: Keep-Alive

Content-Type: text/html



---response end---
200 OK
Registered socket 3 for persistent reuse.
Length: 120 [text/html]
Saving to: bmit.se/wget/index.html


 0% [   
   ] 0   --.-K/s
  
100%[=]
 120 --.-K/s   in 0s  

2013-09-12 21:16:40 (2.64 MB/s) - bmit.se/wget/index.html saved [120/120]

Loaded bmit.se/wget/index.html (size 120).
bmit.se/wget/index.html: merge(http://bmit.se/wget/;, index.html) - 
http://bmit.se/wget/index.html
appending http://bmit.se/wget/index.html; to urlpos.
bmit.se/wget/index.html: merge(http://bmit.se/wget/;, test) - 
http://bmit.se/wget/test
appending http://bmit.se/wget/test; to urlpos.
bmit.se/wget/index.html: merge(http://bmit.se/wget/;, 
teståäöÃ\205Ã\204Ã\226) - http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226
appending http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226; to urlpos.
no-follow in bmit.se/wget/index.html: 0
Deciding whether to enqueue http://bmit.se/wget/index.html;.
Loading robots.txt; please ignore errors.
--2013-09-12 21:16:40--  http://bmit.se/robots.txt
Reusing existing connection to bmit.se:80.
Reusing fd 3.

---request begin---
GET /robots.txt HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: bmit.se

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 404 Not Found

Date: Thu, 12 Sep 2013 19:16:40 GMT

Server: Apache/2.2.22 (Debian)

Vary: Accept-Encoding

Content-Length: 281

Keep-Alive: timeout=5, max=99

Connection: Keep-Alive

Content-Type: text/html; charset=iso-8859-1



---response end---
404 Not Found
Skipping 281 bytes of body: [!DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN
htmlhead
title404 Not Found/title
/headbody
h1Not Found/h1
pThe requested URL /robots.txt was not found on this server./p
hr
addressApache/2.2.22 (Debian) Server at bmit.se Port 80/address
/body/html
] done.
2013-09-12 21:16:40 ERROR 404: Not Found.

Decided to load it.
Enqueuing http://bmit.se/wget/index.html at depth 1
Queue count 1, maxcount 1.
[IRI Enqueuing http://bmit.se/wget/index.html; with None
Deciding whether to enqueue http://bmit.se/wget/test;.
Decided to load it.
Enqueuing http://bmit.se/wget/test at depth 1
Queue count 2, maxcount 2.
[IRI Enqueuing http://bmit.se/wget/test; with None
Deciding whether to enqueue http://bmit.se/wget/teståäöÅÄÖ;.
Decided to load it.
Enqueuing http://bmit.se/wget/teståäöÃ\205Ã\204Ã\226 at depth 1
Queue count 3, maxcount 

Re: [Bug-wget] Problem with ÅÄÖ and wget

2013-09-12 Thread Tim Rühsen
Am Donnerstag, 12. September 2013, 17:37:17 schrieb Tim Ruehsen:
 On Thursday 12 September 2013 12:59:00 Björn Mattsson wrote:
  Run into a bug in wget last week.
  Done some digging but can't solve it by my self.
  
  If i tries to wget a file containing capital ÅÄÖ they gets coverted
  wrongly, and åäö works fine.
  
  I uses wget -m to backup one of my webb-sites to another machine. Have
  worked like a cahrm for the last 4-5 years but a couple of week ago one
  of teh files came down wrong. Thought it was a college that had uploaded
  something wrong but after some digging it's wget that converts wrongly.
  
  I have UTF-8 as charset on my machine.
  
  If you want to test/see the problem
  
  wget -m http://bmit.se/wget
 
 A request to http://bmit.se/wget/ returns text/html document without
 specifying the charset (AFAIR, default is iso-8859-1).
 Either your Server has to tag the response as utf-8 (Content-Type:
 text/html; charset=utf-8) or you have to specify utf-8 in your document
 header.
 
 Or you specify --remote-encoding=utf-8 when calling wget.
 
 Could you give it a try, maybe with -d to see what is going on.

Sorry, forget my answer.
Meanwhile I could make some tests in an utf-8 env, and yes, Wget 1.14 (Debian 
package as well as current git) has the problem you described.

I am not shure if we can change it without breaking backward compatibility !?

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Problem with ÅÄÖ and wget

2013-09-12 Thread Ángel González

Tim Rühsen schrieb:

On Thursday 12 September 2013 12:59:00 Björn Mattsson wrote:

Run into a bug in wget last week.
Done some digging but can't solve it by my self.

If i tries to wget a file containing capital ÅÄÖ they gets coverted
wrongly, and åäö works fine.

I uses wget -m to backup one of my webb-sites to another machine. Have
worked like a cahrm for the last 4-5 years but a couple of week ago one
of teh files came down wrong. Thought it was a college that had uploaded
something wrong but after some digging it's wget that converts wrongly.

I have UTF-8 as charset on my machine.

If you want to test/see the problem

wget -m http://bmit.se/wget

(...)

Sorry, forget my answer.
Meanwhile I could make some tests in an utf-8 env, and yes, Wget 1.14 (Debian
package as well as current git) has the problem you described.

I am not shure if we can change it without breaking backward compatibility !?

Tim

Wasn't that problem always there?
Looks like bug 37564 [1], you can work around it with 
--restrict-file-names=nocontrol

You may find some more information in the list archives.

1- https://savannah.gnu.org/bugs/index.php?37564