Re: new wget bug when doing incremental backup of very large site

2006-10-21 Thread Steven M. Schweda
From dev:

 I checked and the .wgetrc file has continue=on. Is there any way to
 surpress the sending of getting by byte range? I will read through the
 email and see if I can gather some more information that may be needed.

   Remove continue=on from .wgetrc?

   Consider:

  -N,  --timestampingdon't re-retrieve files unless newer than
 local.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547


new wget bug when doing incremental backup of very large site

2006-10-15 Thread dev
I was running wget to test mirroring an internal development site, and 
using large database dumps (binary format) as part of the content to 
provide me with a large number of binary files for the test.  For the 
test I wanted to see if wget would run and download a quantity of 500K 
files with 100GB of total data transferred.


The test was going fine and wget ran flawlessly for 3 days downloading 
almost the entire contents of the test site and I was at 85GB.  wget 
would have run until the very end and would have passed the test 
downloading all 100GB of the test files.


Then a power outage occurred, my local test box was not on battery 
backup, so I had to restart wget and the test.  wget did not refetch the 
binary backup files and gave (for each file that had already been 
retrieved the following message:


-
   = `domain/database/dbdump_107899.gz'
Connecting to domain|ip|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

   The file is already fully retrieved; nothing to do.
-

---
wget continued to run for about eight hours, and gave the above message 
on several thousands files, then crashed giving:

   wget: realloc: Failed to allocate 536870912 bytes; memory exhausted.


This was surprising because wget ran flawlessly on the initial download 
for several days but on a refresh or incremental backup of the data, 
wget crashed after eight hours.   

I believe it has something to do with the code that is run when wget 
already finds a local file with the same name and sends a range 
request.  Maybe there is some data structure that keeps getting added to 
so it exhausts the memory on my test box which has 2GB.  There were no 
other programs running on the test box.


This may be a bug.  To get around this for purposes of my test, I would 
like to know if there is anyway (any switch) to tell wget to not send 
any type of range request at all, if the local filename exists but to 
skip sending any type of request, if it finds a file with the same 
name.  I do not want it to check to see if the file is newer, if the 
file is complete, just skip it and go on to the next file.



I was running wget under cygwin on a Windows XP box.

The wget command that I ran was the following:
   wget -m -l inf --convert-links --page-requisites http://domain

I had the following .wgetrc file
$HOME/.wgetrc
#backup_converted=on
page_requisites=on
continue=on
dirstruct=on
#mirror=on
#noclobber=on
#recursive=on
wait=3
http_user=username
http_passwd=passwd
#convert_links=on
verbose=on
user_agent=firefox
dot_style=binary


new wget bug when doing incremental backup of very large site

2006-10-15 Thread dev
I was running wget to test mirroring an internal development site, and 
using large database dumps (binary format) as part of the content to 
provide me with a large number of binary files for the test.  For the 
test I wanted to see if wget would run and download a quantity of 500K 
files with 100GB of total data transferred.


The test was going fine and wget ran flawlessly for 3 days downloading 
almost the entire contents of the test site and I was at 85GB.  wget 
would have run until the very end and would have passed the test 
downloading all 100GB of the test files.


Then a power outage occurred, my local test box was not on battery 
backup, so I had to restart wget and the test.  wget did not refetch the 
binary backup files and gave (for each file that had already been 
retrieved the following message:


-
  = `domain/database/dbdump_107899.gz'
Connecting to domain|ip|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

  The file is already fully retrieved; nothing to do.
-

---
wget continued to run for about eight hours, and gave the above message 
on several thousands files, then crashed giving:

  wget: realloc: Failed to allocate 536870912 bytes; memory exhausted.


This was surprising because wget ran flawlessly on the initial download 
for several days but on a refresh or incremental backup of the data, 
wget crashed after eight hours.  
I believe it has something to do with the code that is run when wget 
already finds a local file with the same name and sends a range 
request.  Maybe there is some data structure that keeps getting added to 
so it exhausts the memory on my test box which has 2GB.  There were no 
other programs running on the test box.


This may be a bug.  To get around this for purposes of my test, I would 
like to know if there is anyway (any switch) to tell wget to not send 
any type of range request at all, if the local filename exists but to 
skip sending any type of request, if it finds a file with the same 
name.  I do not want it to check to see if the file is newer, if the 
file is complete, just skip it and go on to the next file.



I was running wget under cygwin on a Windows XP box.

The wget command that I ran was the following:
  wget -m -l inf --convert-links --page-requisites http://domain

I had the following .wgetrc file
$HOME/.wgetrc
#backup_converted=on
page_requisites=on
continue=on
dirstruct=on
#mirror=on
#noclobber=on
#recursive=on
wait=3
http_user=username
http_passwd=passwd
#convert_links=on
verbose=on
user_agent=firefox
dot_style=binary



new wget bug when doing incremental backup of very large site

2006-10-15 Thread dev
I was running wget to test mirroring an internal development site, and 
using large database dumps (binary format) as part of the content to 
provide me with a large number of binary files for the test.  For the 
test I wanted to see if wget would run and download a quantity of 500K 
files with 100GB of total data transferred.


The test was going fine and wget ran flawlessly for 3 days downloading 
almost the entire contents of the test site and I was at 85GB.  wget 
would have run until the very end and would have passed the test 
downloading all 100GB of the test files.


Then a power outage occurred, my local test box was not on battery 
backup, so I had to restart wget and the test.  wget did not refetch the 
binary backup files and gave (for each file that had already been 
retrieved the following message:


-
  = `domain/database/dbdump_107899.gz'
Connecting to domain|ip|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

  The file is already fully retrieved; nothing to do.
-

---
wget continued to run for about eight hours, and gave the above message 
on several thousands files, then crashed giving:

  wget: realloc: Failed to allocate 536870912 bytes; memory exhausted.


This was surprising because wget ran flawlessly on the initial download 
for several days but on a refresh or incremental backup of the data, 
wget crashed after eight hours.  
I believe it has something to do with the code that is run when wget 
already finds a local file with the same name and sends a range 
request.  Maybe there is some data structure that keeps getting added to 
so it exhausts the memory on my test box which has 2GB.  There were no 
other programs running on the test box.


This may be a bug.  To get around this for purposes of my test, I would 
like to know if there is anyway (any switch) to tell wget to not send 
any type of range request at all, if the local filename exists but to 
skip sending any type of request, if it finds a file with the same 
name.  I do not want it to check to see if the file is newer, if the 
file is complete, just skip it and go on to the next file.



I was running wget under cygwin on a Windows XP box.

The wget command that I ran was the following:
  wget -m -l inf --convert-links --page-requisites http://domain

I had the following .wgetrc file
$HOME/.wgetrc
#backup_converted=on
page_requisites=on
continue=on
dirstruct=on
#mirror=on
#noclobber=on
#recursive=on
wait=3
http_user=username
http_passwd=passwd
#convert_links=on
verbose=on
user_agent=firefox
dot_style=binary



Re: new wget bug when doing incremental backup of very large site

2006-10-15 Thread Steven M. Schweda
   1. It would help to know the wget version (wget -V).

   2. It might help to see some output when you add -d to the wget
command line.  (One existing file should be enough.)  It's not
immediately clear whose fault the 416 error is.  It might also help to
know which Web server is running on the server, and how big the file is
which you're trying to re-fetch.

 This was surprising [...]

   You're easily surprised.

 wget: realloc: Failed to allocate 536870912 bytes; memory exhausted.

   500MB sounds to me like a lot.

 [...] it exhausts the memory on my test box which has 2GB.

   A memory exhausted complaint here probably refers to virtual
memory, not physical memory.

 [...] I do not want it to check to see if the file is
 newer, if the file is complete, just skip it and go on to the next
 file.

   I haven't checked the code, but with continue=on, I'd expect wget
to check the size and date together, and not download any real data if
the size checks, and the local file date is later.  The 416 error
suggests that it's trying to do a partial (byte-range) download, and is
failing because either it's sending a bad byte range, or the server is
misinterpreting a good byte range.  Adding -d should show what wget
thinks that it's sending.  Knowing that and the actual file size might
show a problem.

   If the -d output looks reasonable, the fault may lie with the
server, and an actual URL may be needed to persue the diagnosis from
there.

   The memory allocation failure could be a bug, but finding it could be
difficult.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547