Re: Feature suggestion: change detection for "wget -c"

2006-09-15 Thread John McCabe-Dansted

On 9/15/06, Mauro Tortonesi <[EMAIL PROTECTED]> wrote:

reliable detection of changes in the resource to be downloaded would be
a very interesting feature. but do you really think that checking the
last X (< 100) bytes would be enough to be reasonably sure the resource
was (not) modified? what about resources which are updated by appending
information, such as log files?


In terms of corruption prevention, wget -c is safe if the resources
are updated only by appending.

Two weaknesses I can think of are logs with fixed width repetitive
messages, e.g.

 12:05 Disks not mirrored
 12:10 Disks not mirrored

Then if we did a wget -c on the new log  file

 11:40 Disks not mirrored
 11:45 Disks not mirrored
 11:50 Disks not mirrored

we would get an invalid log file. However I imagine most log files
have at least a few variable length messages, so this technique would
work on a majority of log files (well over 50%).

Another weakness would be uncompressed database files...

However I suspect that comparing the last 4 bytes would catch 90% of
the real world snafus. I can't verify this without doing a survey of
wget users, but I can say that this would have caught 100% of my own
snafus.

There are two problems common enough to be mentioned in the man page,
proxies that append "transfer interrupted" to the end of failed
downloads and inappropriate use of "wget -c -r".  Checking the last 4
bytes would catch ~100% of cases of "transfer interrupted" being
appended. If wget acts recursively on a directory (wget -c -r) there
are many more opportunities for corruption to be detected.

--
John C. McCabe-Dansted
PhD Student
University of Western Australia


Re: wget "how do I do..."

2006-09-15 Thread Steven M. Schweda
From: Craig A. Finseth

   It might help to know which version of Wget you're using ("wget -V"),
and on which system type you're running it.  Adding "-d" to the wget
command line might give you more clues as to what it's trying to do. 
Seeing the debug output might save considerable code tracing, as I, for
example, don't have access (so far as I know) to an FTP server which
acts that way.

   Probably useless guesswork: Does it help to add a trailing "/" to the
URL ("ftp://...:...@//")?  Same behavior with "-r"?



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547


error running tests: Can't locate object method "new" via package "HTTPTest"

2006-09-15 Thread Ryan Barrett

hi all. is anyone successfully running the perl unit tests? i have perl 5.8.0
and libwww-perl 5.65 happily installed, but i'm getting this error:

heaven:~/wget/tests> ./Test1.px
Can't locate object method "new" via package "HTTPTest" at ./Test1.px line 38.

the "new" method is defined in Test, which is HTTPTest's base class. evidently
it finds and loads the Test and HTTPTest packages ok, it just can't locate the
"new" method in either package. from "Programming Perl":

  Can't locate object method "%s" via package "%s"

  (F) You called a method correctly, and it correctly indicated a package
  functioning as a class, but the package doesn't define that method name,
  nor do any of its base classes (which is why the message says "via"
  rather than "in").

thoughts?

-Ryan

--
http://snarfed.org/


wget "how do I do..."

2006-09-15 Thread Craig A. Finseth
I am trying to mirror an FTP site which has access control in that it
doesn't let you do a "dir" on the root (it returns an empty list).  In
other wods, if you manually do:

ftp 
username: ...
password: ...
dir

You get an empty list.  But if you do:

cd 
dir

You get your files.  Note that I've tried doing:

wget ftp://...:...@/

and it still fails.  It appears that wget is getting the empty listing
and (not unreasonably) deciding that there is nothing there.

In essence, what I want to do insert a "cd " after login but
before wget tries to do anything else.

Is there an existing way to do this, or do I need to modify the code?

FWIW, it's a Windows server.

Craig A. Finseth[EMAIL PROTECTED]
Systems Architect   +1 651 201 1011 desk
State of Minnesota, Office of Enterprise Technology
658 Cedar Ave   +1 651 297 5368 fax
St Paul MN 55155+1 651 297  NOC, for reporting problems




Re: Bug

2006-09-15 Thread Mauro Tortonesi

Reece ha scritto:

Found a bug (sort of).

When trying to get all the images in the directory below:
http://www.netstate.com/states/maps/images/

It gives 403 Forbidden errors for most of the images even after
setting the agent string to firefox's, and setting -e robots=off

After a packet capture, it appears that the site will give the
forbidden error if the Refferer is not exaclty correct.  However,
since wget actually uses the domain www.netstate.com:80 instead of
without the port, it screws it all up.  I've been unable to find any
way to tell wget not to insert the port in the requesting url and
referrer url.

Here is the full command I was using:

wget -r -l 1 -H -U "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT
5.0)" -e robots=off -d -nh http://www.netstate.com/states/maps/images/


hi reece,

that's an interesting bug. i've just added it to my "THINGS TO FIX" list.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: --html-extension and --convert-links don't work together

2006-09-15 Thread Mauro Tortonesi

Ryan Barrett ha scritto:

hi wget developers! nicolas mizel reported a bug with --html-extension and
--convert-links about a year and a half ago. in a nutshell, 
--html-extension

appends .html to non-html filenames, but --converted-links doesn't use the
.html filenames when it converts links.

http://www.mail-archive.com/wget@sunsite.dk/msg07688.html

he reported it against 1.9.1, but it's still broken in 1.10.2. any 
chance it could be fixed in the next release?


in my opinion, this is a serious bug. we should fix it ASAP.

i have a lot on my plate right now, but if it'd help, i could probably 
whip up a patch in a few weeks or so...


that would be great. thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget: ignores Content-Disposition header

2006-09-15 Thread Mauro Tortonesi

Jochen Roderburg ha scritto:

Noèl Köthe schrieb:

Hello,

I can reproduce the following with 1.10.2 and 1.11.beta1:

Wget ignores Content-Disposition header described in RFC 2616,
19.5.1 Content-Disposition.

an example URL is:

http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1 


Sorry, I don't see any Content-Disposition header in this example URL  ;-)

Result of a HEAD request:

200 OK
Connection: close
Date: Fri, 15 Sep 2006 12:58:14 GMT
Server: Apache/1.3.33 (Debian GNU/Linux)
Content-Type: text/html; charset=utf-8
Last-Modified: Mon, 04 Aug 2003 21:18:10 GMT
Client-Date: Fri, 15 Sep 2006 12:58:14 GMT
Client-Response-Num: 1


My own experience is that the 1.11 alpha/beta versions (where this 
feature was introduced) worked fine with the examples I encountered.


Jochen is right:

[EMAIL PROTECTED]:~/tmp$ LANG=C ~/code/svn/wget/src/wget -S -d 
http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1

DEBUG output created by Wget 1.10+devel on linux-gnu.

--16:58:52-- 
http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715

Resolving bugs.debian.org... 140.211.166.43
Caching bugs.debian.org => 140.211.166.43
Connecting to bugs.debian.org|140.211.166.43|:80... connected.
Created socket 3.
Releasing 0x00556550 (new refcount 1).

---request begin---
GET /cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715 
HTTP/1.0

User-Agent: Wget/1.10+devel
Accept: */*
Host: bugs.debian.org
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.0 200 OK
Date: Fri, 15 Sep 2006 14:54:55 GMT
Content-Type: text/html; charset=utf-8
Server: Apache/1.3.33 (Debian GNU/Linux)
Via: 1.1 proxy (NetCache NetApp/5.6.2R1)

---response end---

  HTTP/1.0 200 OK
  Date: Fri, 15 Sep 2006 14:54:55 GMT
  Content-Type: text/html; charset=utf-8
  Server: Apache/1.3.33 (Debian GNU/Linux)
  Via: 1.1 proxy (NetCache NetApp/5.6.2R1)
Length: unspecified [text/html]
Saving to: `%2Ftmp%2Fupdate-grub.patch?bug=168715'

[<=> 
  ] 20,018  32.6K/s 
  in 0.6s


Closed fd 3
16:58:54 (32.6 KB/s) - `%2Ftmp%2Fupdate-grub.patch?bug=168715' saved [20018]


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: help downloading site

2006-09-15 Thread Mauro Tortonesi

Tate Mitchell ha scritto:


Would it be possible to download each lesson individually, so that as
lessons are added, or finished, I can download them w/out re-downloading 
the whole site? Could someone tell me how please? Or would it be possible to

download the whole thing and just re-download parts that have been added
since the previous download?


why don't you try something like:

wget -m -k -np 
http://www.ncsu.edu/project/hindi_lessons/Hindi.Less.01/index.html


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: one more thing.

2006-09-15 Thread Mauro Tortonesi

Tate Mitchell ha scritto:

If anyone could show me how to do this on the wget gui, that would be
appreciated, too.

http://www.jensroesner.de/wgetgui/


wget and wgetgui are releated programs, but they are developed by two 
different teams. you should ask this question to the wgetgui authors.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: REST - error for files bigger than 4GB

2006-09-15 Thread Mauro Tortonesi

Steven M. Schweda ha scritto:


   Are you certain that the FTP _server_ can handle file offsets greater
than 4GB in the REST command?


i agree with steven here. it's very likely to be a server-side problem.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget 1.11 beta 1 released

2006-09-15 Thread Mauro Tortonesi

Oliver Schulze L. ha scritto:

Does this version have the conection cache code?


no, not yet. i have some preliminary code for connection caching, but i 
am not going to finish it and merge it into the trunk before wget 1.11 
is released.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: timestamp and backup

2006-09-15 Thread Mauro Tortonesi

Olav Mørkrid ha scritto:

hi

let's say i fetch 10 files from a server with wget.

then i want to download any modifications to these files.

HOWEVER, if a new version of a file is downloaded, i want a backup of
the old file (eg. write to .bak, or possibly .001
and .002 to keep a record of all versions of a file.

can wget do this?


yes. if file X is already present in your filesystem, by default wget 
downloads the new file and saves it as "X.1".



i tried to combine -N with -nc, which would seem logical (do timestamp
checking, and prevent overwriting), but wget protests that they are
mutually exclusive.

and if i use no options, then wget fetches a new file even though it's
not updated.


you should not use -nc, just -N.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: Feature suggestion: change detection for "wget -c"

2006-09-15 Thread Mauro Tortonesi

John McCabe-Dansted ha scritto:

"Wget has no way of verifying that the local file is
  really a valid prefix of the remote file"

Couldn't wget redownload the last 4 bytes (or so) of the file?

For a few bytes per file we could detect changes to almost all
compressed files and the majority of uncompressed files.


reliable detection of changes in the resource to be downloaded would be 
a very interesting feature. but do you really think that checking the 
last X (< 100) bytes would be enough to be reasonably sure the resource 
was (not) modified? what about resources which are updated by appending 
information, such as log files?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: -P ignored by parse_content_disposition

2006-09-15 Thread Mauro Tortonesi

Ashley Bone ha scritto:

When wget determines the local filename from Content-Disposition,
the -P (--directory-prefix) is ignored.  The file is always
downloaded to the current directory.  Looking at 
parse_content_disposition(),
I think this may be by design.  Does anyone know for sure?  


no, it's clearly a bug.


If not, I can submit a patch.


yes, please do it if you can.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget: ignores Content-Disposition header

2006-09-15 Thread Jochen Roderburg

Noèl Köthe schrieb:

Hello,

I can reproduce the following with 1.10.2 and 1.11.beta1:

Wget ignores Content-Disposition header described in RFC 2616,
19.5.1 Content-Disposition.

an example URL is:

http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1




Sorry, I don't see any Content-Disposition header in this example URL  ;-)

Result of a HEAD request:

200 OK
Connection: close
Date: Fri, 15 Sep 2006 12:58:14 GMT
Server: Apache/1.3.33 (Debian GNU/Linux)
Content-Type: text/html; charset=utf-8
Last-Modified: Mon, 04 Aug 2003 21:18:10 GMT
Client-Date: Fri, 15 Sep 2006 12:58:14 GMT
Client-Response-Num: 1


My own experience is that the 1.11 alpha/beta versions (where this 
feature was introduced) worked fine with the examples I encountered.


Best regards,

Jochen Roderburg
ZAIK/RRZK
University of Cologne
Robert-Koch-Str. 10 Tel.:   +49-221/478-7024
D-50931 Koeln   E-Mail: [EMAIL PROTECTED]
Germany



wget: ignores Content-Disposition header

2006-09-15 Thread Noèl Köthe
Hello,

I can reproduce the following with 1.10.2 and 1.11.beta1:

Wget ignores Content-Disposition header described in RFC 2616,
19.5.1 Content-Disposition.

an example URL is:

http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1


-- 
Noèl Köthe 
Debian GNU/Linux, www.debian.org


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil