Re: OT: wget bug
On Sat, 18 Jul 2009, Andrew Brampton wrote: Date: Sat, 18 Jul 2009 18:09:54 +0100 From: Andrew Brampton brampton+free...@gmail.com To: Joe R. Jah j...@cloud.ccsf.cc.ca.us Cc: freebsd-questions@freebsd.org Subject: Re: OT: wget bug 2009/7/18 Joe R. Jah j...@cloud.ccsf.cc.ca.us: Thank you Andrew. Yes the server is truly returning 401. I have already reconfigured wget to download everything regardless of their timestamp, but it's a waste of bandwidth, because most of the site is unchanged. Do you know of any workaround in wget, or an alternative tool to ONLY download newer files by http? Joe, There are two ways to check if the file has been changed. One, read the time the file was last changed, or two, read the file and compare it to a old copy. Wget was obviously trying to do option 1 but this is denied by the remote server. You most likely could get it to do option 2, however by doing so you are wasting bandwidth downloading unchanged files just to check if they had been changed. If you have control over the remote webserver, then the simplest way to solve this problem is to configure the webserver not to return 401 when wget sends the If-Modified-Since HTTP header. A better solution, again assuming you have control of the remote server, is to use rsync as it is designed for this kind of task. If you don't have control over the remote server, then you are stuck with your current solution. Andrew Thank you Andrew. Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ahj...@cloud.ccsf.cc.ca.us___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
On Sat, 18 Jul 2009, Karl Vogel wrote: Date: Sat, 18 Jul 2009 19:34:24 -0400 (EDT) From: Karl Vogel vogelke+u...@pobox.com To: freebsd-questions@freebsd.org Subject: Re: OT: wget bug On Sat, 18 Jul 2009 09:41:00 -0700 (PDT), Joe R. Jah j...@cloud.ccsf.cc.ca.us said: J Do you know of any workaround in wget, or an alternative tool to ONLY J download newer files by http? curl can help for things like this. For example, if you're getting just a few files, fetch only the header and check the last-modified date: me% curl -I http://curl.haxx.se/docs/manual.html HTTP/1.1 200 OK Proxy-Connection: Keep-Alive Connection: Keep-Alive Date: Sat, 18 Jul 2009 23:24:24 GMT Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 Last-Modified: Mon, 20 Apr 2009 17:46:02 GMT ETag: 5d63c-b2c5-1a936a80 Accept-Ranges: bytes Content-Length: 45765 Content-Type: text/html; charset=ISO-8859-1 You can download files only if the remote one is newer than a local copy: me% curl -z local.html http://remote.server.com/remote.html Or only download the file if it was updated since Jan 12, 2009: me% curl -z Jan 12 2009 http://remote.server.com/remote.html Curl tries to use persistent connections for transfers, so put as many URLs on the same line as you can if you're looking to mirror a site. I don't know how to make curl do something like walking a directory for a recursive download. You can get the source at http://curl.haxx.se/download.html Thank you Karl. I already have curl installed, but I don't believe it can get an entire website by giving it the base URL. Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ahj...@cloud.ccsf.cc.ca.us ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
2009/7/17 Joe R. Jah j...@cloud.ccsf.cc.ca.us: Hello all, I want to wget a site at regular intervals and only get the updated pages, so I use the this wget command line: wget -b -m -nH http://host.domain/Directory/file.html It works fine on the first try, but it fails on subsequent tries with the following error message: --8-- Connecting to host.domain ... connected. HTTP request sent, awaiting response... 401 Unauthorized Authorization failed. --8-- This to me seems like the remote server is replying with 401. Perhaps wget is sending the If-Modified-Since HTTP header, and the remote server does not support this. I would confirm this by running tcpdump (or wireshark) to sniff the traffic and see what the remote server is replying with. If the remote server is truly returning 401, then you might either need to use an alternative tool, or configure wget differently. Hope this helps Andrew ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
On Sat, 18 Jul 2009, Andrew Brampton wrote: Date: Sat, 18 Jul 2009 12:52:07 +0100 From: Andrew Brampton brampton+free...@gmail.com To: Joe R. Jah j...@cloud.ccsf.cc.ca.us Cc: freebsd-questions@freebsd.org Subject: Re: OT: wget bug 2009/7/17 Joe R. Jah j...@cloud.ccsf.cc.ca.us: Hello all, I want to wget a site at regular intervals and only get the updated pages, so I use the this wget command line: wget -b -m -nH http://host.domain/Directory/file.html It works fine on the first try, but it fails on subsequent tries with the following error message: --8-- Connecting to host.domain ... connected. HTTP request sent, awaiting response... 401 Unauthorized Authorization failed. --8-- This to me seems like the remote server is replying with 401. Perhaps wget is sending the If-Modified-Since HTTP header, and the remote server does not support this. I would confirm this by running tcpdump (or wireshark) to sniff the traffic and see what the remote server is replying with. If the remote server is truly returning 401, then you might either need to use an alternative tool, or configure wget differently. Hope this helps Andrew Thank you Andrew. Yes the server is truly returning 401. I have already reconfigured wget to download everything regardless of their timestamp, but it's a waste of bandwidth, because most of the site is unchanged. Do you know of any workaround in wget, or an alternative tool to ONLY download newer files by http? Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ahj...@cloud.ccsf.cc.ca.us ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
2009/7/18 Joe R. Jah j...@cloud.ccsf.cc.ca.us: Thank you Andrew. Yes the server is truly returning 401. I have already reconfigured wget to download everything regardless of their timestamp, but it's a waste of bandwidth, because most of the site is unchanged. Do you know of any workaround in wget, or an alternative tool to ONLY download newer files by http? Joe, There are two ways to check if the file has been changed. One, read the time the file was last changed, or two, read the file and compare it to a old copy. Wget was obviously trying to do option 1 but this is denied by the remote server. You most likely could get it to do option 2, however by doing so you are wasting bandwidth downloading unchanged files just to check if they had been changed. If you have control over the remote webserver, then the simplest way to solve this problem is to configure the webserver not to return 401 when wget sends the If-Modified-Since HTTP header. A better solution, again assuming you have control of the remote server, is to use rsync as it is designed for this kind of task. If you don't have control over the remote server, then you are stuck with your current solution. Andrew ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
On Sat, 18 Jul 2009 09:41:00 -0700 (PDT), Joe R. Jah j...@cloud.ccsf.cc.ca.us said: J Do you know of any workaround in wget, or an alternative tool to ONLY J download newer files by http? curl can help for things like this. For example, if you're getting just a few files, fetch only the header and check the last-modified date: me% curl -I http://curl.haxx.se/docs/manual.html HTTP/1.1 200 OK Proxy-Connection: Keep-Alive Connection: Keep-Alive Date: Sat, 18 Jul 2009 23:24:24 GMT Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4 Last-Modified: Mon, 20 Apr 2009 17:46:02 GMT ETag: 5d63c-b2c5-1a936a80 Accept-Ranges: bytes Content-Length: 45765 Content-Type: text/html; charset=ISO-8859-1 You can download files only if the remote one is newer than a local copy: me% curl -z local.html http://remote.server.com/remote.html Or only download the file if it was updated since Jan 12, 2009: me% curl -z Jan 12 2009 http://remote.server.com/remote.html Curl tries to use persistent connections for transfers, so put as many URLs on the same line as you can if you're looking to mirror a site. I don't know how to make curl do something like walking a directory for a recursive download. You can get the source at http://curl.haxx.se/download.html -- Karl Vogel I don't speak for the USAF or my company If lawyers are disbarred and clergymen defrocked, doesn't it follow that electricians can be delighted, musicians denoted, cowboys deranged, models deposed, tree surgeons debarked and dry cleaners depressed? ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
OT: wget bug
Hello all, I want to wget a site at regular intervals and only get the updated pages, so I use the this wget command line: wget -b -m -nH http://host.domain/Directory/file.html It works fine on the first try, but it fails on subsequent tries with the following error message: --8-- Connecting to host.domain ... connected. HTTP request sent, awaiting response... 401 Unauthorized Authorization failed. --8-- I can change directory from which to run wget every time, but that defeats the purpose of downloading only the changed files. I googled wget fails on second try and found this small patch in a Linux group that should supposedly fix the problem: --8-- --- wget-1.10.2/src/ftp.c.cwd 2006-12-03 13:23:08.801467652 +0100 +++ wget-1.10.2/src/ftp.c 2006-12-03 20:30:24.641876672 +0100 @@ -1172,7 +1172,7 @@ len = 0; err = getftp (u, len, restval, con); - if (con-csock != -1) + if (con-csock == -1) con-st = ~DONE_CWD; else con-st |= DONE_CWD; --8-- My wget is the latest version in the ports, 1.11.4. Any ideas or advise is greatly appreciated. Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ahj...@cloud.ccsf.cc.ca.us ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
On Friday 17 July 2009 06:12:33 pm Joe R. Jah wrote: I want to wget a site at regular intervals and only get the updated pages, so I use the this wget command line: wget -b -m -nH http://host.domain/Directory/file.html It works fine on the first try, but it fails on subsequent tries with the following error message: --8-- Connecting to host.domain ... connected. HTTP request sent, awaiting response... 401 Unauthorized Authorization failed. --8-- I can change directory from which to run wget every time, but that defeats the purpose of downloading only the changed files. I googled wget fails on second try and found this small patch in a Linux group that should supposedly fix the problem: --8-- --- wget-1.10.2/src/ftp.c.cwd 2006-12-03 13:23:08.801467652 +0100 +++ wget-1.10.2/src/ftp.c 2006-12-03 20:30:24.641876672 +0100 @@ -1172,7 +1172,7 @@ len = 0; err = getftp (u, len, restval, con); - if (con-csock != -1) + if (con-csock == -1) con-st = ~DONE_CWD; else con-st |= DONE_CWD; --8-- My wget is the latest version in the ports, 1.11.4. Any ideas or advise is greatly appreciated. I can't tell if your patch has already been applied upstream or if it's a reverse patch. The current distfile matches the +++ version at line 1185. (normally the +++ file is the new version but it's easy to get the order reversed if you're not used to running diff). You could always just try the patch. Something along the lines of this: cd /usr/ports/ftp/wget make clean make patch #extract the distfiles and apply FreeBSD patches cd work/wget-1.11.4/src vi ftp.c#or any editor you like ...go to line 1185 and change == to != ...save and quit the editor cd /usr/ports/ftp/wget make make deinstall make reinstall ... try your procedure again. If you don't like the results a make clean will erase your (modified) work directory and you can build the original version again. JN ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: OT: wget bug
On Fri, 17 Jul 2009, John Nielsen wrote: Date: Fri, 17 Jul 2009 18:52:46 -0400 From: John Nielsen li...@jnielsen.net To: freebsd-questions@freebsd.org Cc: Joe R. Jah j...@cloud.ccsf.cc.ca.us Subject: Re: OT: wget bug On Friday 17 July 2009 06:12:33 pm Joe R. Jah wrote: I want to wget a site at regular intervals and only get the updated pages, so I use the this wget command line: wget -b -m -nH http://host.domain/Directory/file.html It works fine on the first try, but it fails on subsequent tries with the following error message: --8-- Connecting to host.domain ... connected. HTTP request sent, awaiting response... 401 Unauthorized Authorization failed. --8-- I can change directory from which to run wget every time, but that defeats the purpose of downloading only the changed files. I googled wget fails on second try and found this small patch in a Linux group that should supposedly fix the problem: --8-- --- wget-1.10.2/src/ftp.c.cwd 2006-12-03 13:23:08.801467652 +0100 +++ wget-1.10.2/src/ftp.c 2006-12-03 20:30:24.641876672 +0100 @@ -1172,7 +1172,7 @@ len = 0; err = getftp (u, len, restval, con); - if (con-csock != -1) + if (con-csock == -1) con-st = ~DONE_CWD; else con-st |= DONE_CWD; --8-- My wget is the latest version in the ports, 1.11.4. Any ideas or advise is greatly appreciated. I can't tell if your patch has already been applied upstream or if it's a reverse patch. The current distfile matches the +++ version at line 1185. (normally the +++ file is the new version but it's easy to get the order reversed if you're not used to running diff). You could always just try the patch. Something along the lines of this: cd /usr/ports/ftp/wget make clean make patch#extract the distfiles and apply FreeBSD patches cd work/wget-1.11.4/src vi ftp.c #or any editor you like ...go to line 1185 and change == to != ...save and quit the editor cd /usr/ports/ftp/wget make make deinstall make reinstall ... try your procedure again. If you don't like the results a make clean will erase your (modified) work directory and you can build the original version again. Thank you John. That was a simple procedure, but unfortunately the patch did not fix the problem. Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ahj...@cloud.ccsf.cc.ca.us ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org