Re: OT: wget bug

2009-07-19 Thread Joe R. Jah
On Sat, 18 Jul 2009, Andrew Brampton wrote:

 Date: Sat, 18 Jul 2009 18:09:54 +0100
 From: Andrew Brampton brampton+free...@gmail.com
 To: Joe R. Jah j...@cloud.ccsf.cc.ca.us
 Cc: freebsd-questions@freebsd.org
 Subject: Re: OT: wget bug

 2009/7/18 Joe R. Jah j...@cloud.ccsf.cc.ca.us:
  Thank you Andrew.  Yes the server is truly returning 401.  I have already
  reconfigured wget to download everything regardless of their timestamp,
  but it's a waste of bandwidth, because most of the site is unchanged.
 
  Do you know of any workaround in wget, or an alternative tool to ONLY
  download newer files by http?
 

 Joe,
 There are two ways to check if the file has been changed. One, read
 the time the file was last changed, or two, read the file and compare
 it to a old copy. Wget was obviously trying to do option 1 but this is
 denied by the remote server. You most likely could get it to do option
 2, however by doing so you are wasting bandwidth downloading unchanged
 files just to check if they had been changed.

 If you have control over the remote webserver, then the simplest way
 to solve this problem is to configure the webserver not to return 401
 when wget sends the If-Modified-Since HTTP header. A better solution,
 again assuming you have control of the remote server, is to use
 rsync as it is designed for this kind of task.

 If you don't have control over the remote server, then you are stuck
 with your current solution.

 Andrew

Thank you Andrew.

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ahj...@cloud.ccsf.cc.ca.us___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org

Re: OT: wget bug

2009-07-19 Thread Joe R. Jah
On Sat, 18 Jul 2009, Karl Vogel wrote:

 Date: Sat, 18 Jul 2009 19:34:24 -0400 (EDT)
 From: Karl Vogel vogelke+u...@pobox.com
 To: freebsd-questions@freebsd.org
 Subject: Re: OT: wget bug

  On Sat, 18 Jul 2009 09:41:00 -0700 (PDT),
  Joe R. Jah j...@cloud.ccsf.cc.ca.us said:

 J Do you know of any workaround in wget, or an alternative tool to ONLY
 J download newer files by http?

curl can help for things like this.  For example, if you're getting
just a few files, fetch only the header and check the last-modified date:

   me% curl -I http://curl.haxx.se/docs/manual.html
   HTTP/1.1 200 OK
   Proxy-Connection: Keep-Alive
   Connection: Keep-Alive
   Date: Sat, 18 Jul 2009 23:24:24 GMT
   Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4
   Last-Modified: Mon, 20 Apr 2009 17:46:02 GMT
   ETag: 5d63c-b2c5-1a936a80
   Accept-Ranges: bytes
   Content-Length: 45765
   Content-Type: text/html; charset=ISO-8859-1

You can download files only if the remote one is newer than a local copy:

   me% curl -z local.html http://remote.server.com/remote.html

Or only download the file if it was updated since Jan 12, 2009:

   me% curl -z Jan 12 2009 http://remote.server.com/remote.html

Curl tries to use persistent connections for transfers, so put as many
URLs on the same line as you can if you're looking to mirror a site.  I
don't know how to make curl do something like walking a directory for a
recursive download.

You can get the source at http://curl.haxx.se/download.html

Thank you Karl.  I already have curl installed, but I don't believe it can
get an entire website by giving it the base URL.

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ahj...@cloud.ccsf.cc.ca.us
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-18 Thread Andrew Brampton
2009/7/17 Joe R. Jah j...@cloud.ccsf.cc.ca.us:

 Hello all,

 I want to wget a site at regular intervals and only get the updated pages,
 so I use the this wget command line:

 wget -b -m -nH http://host.domain/Directory/file.html

 It works fine on the first try, but it fails on subsequent tries with the
 following error message:

 --8--
 Connecting to host.domain ... connected.
 HTTP request sent, awaiting response... 401 Unauthorized
 Authorization failed.
 --8--

This to me seems like the remote server is replying with 401. Perhaps
wget is sending the If-Modified-Since HTTP header, and the remote
server does not support this. I would confirm this by running tcpdump
(or wireshark) to sniff the traffic and see what the remote server is
replying with.

If the remote server is truly returning 401, then you might either
need to use an alternative tool, or configure wget differently.

Hope this helps
Andrew
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-18 Thread Joe R. Jah
On Sat, 18 Jul 2009, Andrew Brampton wrote:

 Date: Sat, 18 Jul 2009 12:52:07 +0100
 From: Andrew Brampton brampton+free...@gmail.com
 To: Joe R. Jah j...@cloud.ccsf.cc.ca.us
 Cc: freebsd-questions@freebsd.org
 Subject: Re: OT: wget bug

 2009/7/17 Joe R. Jah j...@cloud.ccsf.cc.ca.us:
 
  Hello all,
 
  I want to wget a site at regular intervals and only get the updated pages,
  so I use the this wget command line:
 
  wget -b -m -nH http://host.domain/Directory/file.html
 
  It works fine on the first try, but it fails on subsequent tries with the
  following error message:
 
  --8--
  Connecting to host.domain ... connected.
  HTTP request sent, awaiting response... 401 Unauthorized
  Authorization failed.
  --8--

 This to me seems like the remote server is replying with 401. Perhaps
 wget is sending the If-Modified-Since HTTP header, and the remote
 server does not support this. I would confirm this by running tcpdump
 (or wireshark) to sniff the traffic and see what the remote server is
 replying with.

 If the remote server is truly returning 401, then you might either
 need to use an alternative tool, or configure wget differently.

 Hope this helps
 Andrew

Thank you Andrew.  Yes the server is truly returning 401.  I have already
reconfigured wget to download everything regardless of their timestamp,
but it's a waste of bandwidth, because most of the site is unchanged.

Do you know of any workaround in wget, or an alternative tool to ONLY
download newer files by http?

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ahj...@cloud.ccsf.cc.ca.us
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-18 Thread Andrew Brampton
2009/7/18 Joe R. Jah j...@cloud.ccsf.cc.ca.us:
 Thank you Andrew.  Yes the server is truly returning 401.  I have already
 reconfigured wget to download everything regardless of their timestamp,
 but it's a waste of bandwidth, because most of the site is unchanged.

 Do you know of any workaround in wget, or an alternative tool to ONLY
 download newer files by http?


Joe,
There are two ways to check if the file has been changed. One, read
the time the file was last changed, or two, read the file and compare
it to a old copy. Wget was obviously trying to do option 1 but this is
denied by the remote server. You most likely could get it to do option
2, however by doing so you are wasting bandwidth downloading unchanged
files just to check if they had been changed.

If you have control over the remote webserver, then the simplest way
to solve this problem is to configure the webserver not to return 401
when wget sends the If-Modified-Since HTTP header. A better solution,
again assuming you have control of the remote server, is to use
rsync as it is designed for this kind of task.

If you don't have control over the remote server, then you are stuck
with your current solution.

Andrew
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-18 Thread Karl Vogel
 On Sat, 18 Jul 2009 09:41:00 -0700 (PDT), 
 Joe R. Jah j...@cloud.ccsf.cc.ca.us said:

J Do you know of any workaround in wget, or an alternative tool to ONLY
J download newer files by http?

   curl can help for things like this.  For example, if you're getting
   just a few files, fetch only the header and check the last-modified date:

  me% curl -I http://curl.haxx.se/docs/manual.html
  HTTP/1.1 200 OK
  Proxy-Connection: Keep-Alive
  Connection: Keep-Alive
  Date: Sat, 18 Jul 2009 23:24:24 GMT
  Server: Apache/2.2.3 (Debian) mod_python/3.2.10 Python/2.4.4
  Last-Modified: Mon, 20 Apr 2009 17:46:02 GMT
  ETag: 5d63c-b2c5-1a936a80
  Accept-Ranges: bytes
  Content-Length: 45765
  Content-Type: text/html; charset=ISO-8859-1

   You can download files only if the remote one is newer than a local copy:

  me% curl -z local.html http://remote.server.com/remote.html

   Or only download the file if it was updated since Jan 12, 2009:

  me% curl -z Jan 12 2009 http://remote.server.com/remote.html

   Curl tries to use persistent connections for transfers, so put as many
   URLs on the same line as you can if you're looking to mirror a site.  I
   don't know how to make curl do something like walking a directory for a
   recursive download.

   You can get the source at http://curl.haxx.se/download.html

-- 
Karl Vogel  I don't speak for the USAF or my company

If lawyers are disbarred and clergymen defrocked, doesn't it follow
that electricians can be delighted, musicians denoted, cowboys deranged,
models deposed, tree surgeons debarked and dry cleaners depressed?
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


OT: wget bug

2009-07-17 Thread Joe R. Jah
Hello all,

I want to wget a site at regular intervals and only get the updated pages,
so I use the this wget command line:

wget -b -m -nH http://host.domain/Directory/file.html

It works fine on the first try, but it fails on subsequent tries with the
following error message:

--8--
Connecting to host.domain ... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Authorization failed.
--8--

I can change directory from which to run wget every time, but that defeats
the purpose of downloading only the changed files.

I googled wget fails on second try and found this small patch in a Linux
group that should supposedly fix the problem:

--8--
--- wget-1.10.2/src/ftp.c.cwd   2006-12-03 13:23:08.801467652 +0100
+++ wget-1.10.2/src/ftp.c   2006-12-03 20:30:24.641876672 +0100
@@ -1172,7 +1172,7 @@
len = 0;
   err = getftp (u, len, restval, con);

-  if (con-csock != -1)
+  if (con-csock == -1)
con-st = ~DONE_CWD;
   else
con-st |= DONE_CWD;
--8--

My wget is the latest version in the ports, 1.11.4.

Any ideas or advise is greatly appreciated.

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ahj...@cloud.ccsf.cc.ca.us
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-17 Thread John Nielsen
On Friday 17 July 2009 06:12:33 pm Joe R. Jah wrote:
 I want to wget a site at regular intervals and only get the updated
 pages, so I use the this wget command line:

 wget -b -m -nH http://host.domain/Directory/file.html

 It works fine on the first try, but it fails on subsequent tries with
 the following error message:

 --8--
 Connecting to host.domain ... connected.
 HTTP request sent, awaiting response... 401 Unauthorized
 Authorization failed.
 --8--

 I can change directory from which to run wget every time, but that
 defeats the purpose of downloading only the changed files.

 I googled wget fails on second try and found this small patch in a
 Linux group that should supposedly fix the problem:

 --8--
 --- wget-1.10.2/src/ftp.c.cwd   2006-12-03 13:23:08.801467652 +0100
 +++ wget-1.10.2/src/ftp.c   2006-12-03 20:30:24.641876672 +0100
 @@ -1172,7 +1172,7 @@
 len = 0;
err = getftp (u, len, restval, con);

 -  if (con-csock != -1)
 +  if (con-csock == -1)
 con-st = ~DONE_CWD;
else
 con-st |= DONE_CWD;
 --8--

 My wget is the latest version in the ports, 1.11.4.

 Any ideas or advise is greatly appreciated.

I can't tell if your patch has already been applied upstream or if it's 
a reverse patch. The current distfile matches the +++ version at line 
1185. (normally the +++ file is the new version but it's easy to get 
the order reversed if you're not used to running diff).

You could always just try the patch. Something along the lines of this:

cd /usr/ports/ftp/wget
make clean
make patch  #extract the distfiles and apply FreeBSD patches
cd work/wget-1.11.4/src
vi ftp.c#or any editor you like
  ...go to line 1185 and change == to !=
  ...save and quit the editor
cd /usr/ports/ftp/wget
make
make deinstall  make reinstall
  ... try your procedure again.

If you don't like the results a make clean will erase your (modified) 
work directory and you can build the original version again.

JN
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: OT: wget bug

2009-07-17 Thread Joe R. Jah
On Fri, 17 Jul 2009, John Nielsen wrote:

 Date: Fri, 17 Jul 2009 18:52:46 -0400
 From: John Nielsen li...@jnielsen.net
 To: freebsd-questions@freebsd.org
 Cc: Joe R. Jah j...@cloud.ccsf.cc.ca.us
 Subject: Re: OT: wget bug

 On Friday 17 July 2009 06:12:33 pm Joe R. Jah wrote:
  I want to wget a site at regular intervals and only get the updated
  pages, so I use the this wget command line:
 
  wget -b -m -nH http://host.domain/Directory/file.html
 
  It works fine on the first try, but it fails on subsequent tries with
  the following error message:
 
  --8--
  Connecting to host.domain ... connected.
  HTTP request sent, awaiting response... 401 Unauthorized
  Authorization failed.
  --8--
 
  I can change directory from which to run wget every time, but that
  defeats the purpose of downloading only the changed files.
 
  I googled wget fails on second try and found this small patch in a
  Linux group that should supposedly fix the problem:
 
  --8--
  --- wget-1.10.2/src/ftp.c.cwd   2006-12-03 13:23:08.801467652 +0100
  +++ wget-1.10.2/src/ftp.c   2006-12-03 20:30:24.641876672 +0100
  @@ -1172,7 +1172,7 @@
  len = 0;
 err = getftp (u, len, restval, con);
 
  -  if (con-csock != -1)
  +  if (con-csock == -1)
  con-st = ~DONE_CWD;
 else
  con-st |= DONE_CWD;
  --8--
 
  My wget is the latest version in the ports, 1.11.4.
 
  Any ideas or advise is greatly appreciated.

 I can't tell if your patch has already been applied upstream or if it's
 a reverse patch. The current distfile matches the +++ version at line
 1185. (normally the +++ file is the new version but it's easy to get
 the order reversed if you're not used to running diff).

 You could always just try the patch. Something along the lines of this:

 cd /usr/ports/ftp/wget
 make clean
 make patch#extract the distfiles and apply FreeBSD patches
 cd work/wget-1.11.4/src
 vi ftp.c  #or any editor you like
   ...go to line 1185 and change == to !=
   ...save and quit the editor
 cd /usr/ports/ftp/wget
 make
 make deinstall  make reinstall
   ... try your procedure again.

 If you don't like the results a make clean will erase your (modified)
 work directory and you can build the original version again.

Thank you John.  That was a simple procedure, but unfortunately the patch
did not fix the problem.

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ahj...@cloud.ccsf.cc.ca.us
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org