wget does not uncompress compressed HTML files when running via squid

2006-02-02 Thread Steffen Kaiser

Hello,

I have problem with this command:

wget http://www.nirsoft.net/faq.html

the file is downloaded, but it is gzipp'ed. Hence, if mirroring, the other 
links are not found.


I run this command via squid 2.4.6-2woody11 .
wget is 1.9.1-12 (Debian Sarge) and 1.8.1-6.1 (Debian Woody).

If I run the command bypassing the proxy, I get the uncompressed file.
If I run curl 7.13.2-2sarge4 _via_ proxy, I get the uncompressed file, 
too.


Reading http://www.mail-archive.com/wget%40sunsite.dk/msg06952.html I have 
the impression, that it should work?


Please see step 3, too:

step 1)

$ wget -S http://www.nirsoft.net/search_freeware.html
--10:28:50--  http://www.nirsoft.net/search_freeware.html
   = `search_freeware.html'
Resolving www-cache... 10.20.10.6
Connecting to www-cache[10.20.10.6]:8080... connected.
Proxy request sent, awaiting response...
 1 HTTP/1.0 200 OK
 2 Date: Thu, 02 Feb 2006 08:02:07 GMT
 3 Server: Apache/1.3.33 (Unix) mod_gzip/1.3.26.1a 
mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 
PHP/4.3.11FrontPage/5.0.2.2634a mod_ssl/2.8.22 OpenSSL/0.9.7a

 4 Last-Modified: Thu, 27 Jan 2005 03:23:51 GMT
 5 ETag: b00fe-a0d-41f85ec7
 6 Accept-Ranges: bytes
 7 Content-Type: text/html
 8 Content-Encoding: gzip
 9 Content-Length: 896
10 X-Cache: MISS from www-cache.inf.fh-bonn-rhein-sieg.de
11 X-Cache-Lookup: MISS from www-cache.inf.fh-bonn-rhein-sieg.de:3128
12 Age: 5204
13 X-Cache: HIT from www-cache2.inf.fh-bonn-rhein-sieg.de
14 X-Cache-Lookup: HIT from www-cache2.inf.fh-bonn-rhein-sieg.de:3128
15 Proxy-Connection: close

100%[===] 
896   --.--K/s


10:28:50 (8.54 MB/s) - `search_freeware.html' saved [896/896]

The file is still gzip'ed on the disk.

step 2)
$ curl http://www.nirsoft.net/search_freeware.html file

file is uncompressed.

step 3)

$ rm http://www.nirsoft.net/search_freeware.html  wget -S 
http://www.nirsoft.net/search_freeware.html

--10:33:27--  http://www.nirsoft.net/search_freeware.html
   = `search_freeware.html'
Resolving www-cache... 10.20.10.6
Connecting to www-cache[10.20.10.6]:8080... connected.
Proxy request sent, awaiting response...
 1 HTTP/1.0 200 OK
 2 Date: Thu, 02 Feb 2006 09:33:13 GMT
 3 Server: Apache/1.3.33 (Unix) mod_gzip/1.3.26.1a 
mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 
PHP/4.3.11FrontPage/5.0.2.2634a mod_ssl/2.8.22 OpenSSL/0.9.7a

 4 Last-Modified: Thu, 27 Jan 2005 03:23:51 GMT
 5 ETag: b00fe-a0d-41f85ec7
 6 Accept-Ranges: bytes
 7 Content-Length: 2573
 8 Content-Type: text/html
 9 X-Cache: MISS from www-cache.inf.fh-bonn-rhein-sieg.de
10 X-Cache-Lookup: HIT from www-cache.inf.fh-bonn-rhein-sieg.de:3128
11 Age: 16
12 X-Cache: HIT from www-cache2.inf.fh-bonn-rhein-sieg.de
13 X-Cache-Lookup: HIT from www-cache2.inf.fh-bonn-rhein-sieg.de:3128
14 Proxy-Connection: close

100%[===] 
2,573 --.--K/s


10:33:27 (24.54 MB/s) - `search_freeware.html' saved [2573/2573]

The file is now uncompressed, too!

Bye,

--
Steffen Kaiser


compressed html files?

2004-09-24 Thread Leonid Petrov
Juhana,
Looks like wget ignores header record Content-Encoding. At least I 
did not find any traces of attempts to parse it in source code.
 What can be done?
  I  think it is a good requested feature. You are welcome to implement 
it and submit a patch to [EMAIL PROTECTED]


compressed html files?

2004-09-23 Thread Juhana Sadeharju
Hello.
The file
  http://www.cs.utah.edu/~gooch/JOT/index.html
is compressed and wget could not follow the urls in it.
What can be done? Should wget uncompress the compressed *.htm
and *.html files? *.asp, *.php??

Juhana