new wget bug when doing incremental backup of very large site

dev Sun, 15 Oct 2006 12:20:43 -0700

I was running wget to test mirroring an internal development site, andusing large database dumps (binary format) as part of the content toprovide me with a large number of binary files for the test. For thetest I wanted to see if wget would run and download a quantity of 500Kfiles with 100GB of total data transferred.

The test was going fine and wget ran flawlessly for 3 days downloadingalmost the entire contents of the test site and I was at 85GB. wgetwould have run until the very end and would have passed the testdownloading all 100GB of the test files.

Then a power outage occurred, my local test box was not on batterybackup, so I had to restart wget and the test. wget did not refetch thebinary backup files and gave (for each file that had already beenretrieved the following message:


-
               => `<domain>/database/dbdump_107899.gz'
Connecting to <domain>|<ip>|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

   The file is already fully retrieved; nothing to do.
-

---

wget continued to run for about eight hours, and gave the above messageon several thousands files, then crashed giving:

   wget: realloc: Failed to allocate 536870912 bytes; memory exhausted.

This was surprising because wget ran flawlessly on the initial downloadfor several days but on a "refresh" or incremental backup of the data,wget crashed after eight hours.I believe it has something to do with the code that is run when wgetalready finds a local file with the same name and sends a "range"request. Maybe there is some data structure that keeps getting added toso it exhausts the memory on my test box which has 2GB. There were noother programs running on the test box.

This may be a bug. To get around this for purposes of my test, I wouldlike to know if there is anyway (any switch) to tell wget to not sendany type of range request at all, if the local filename exists but toskip sending any type of request, if it finds a file with the samename. I do not want it to check to see if the file is newer, if thefile is complete, just skip it and go on to the next file.


----
I was running wget under cygwin on a Windows XP box.

The wget command that I ran was the following:
   wget -m -l inf --convert-links --page-requisites http://<domain>

I had the following .wgetrc file
$HOME/.wgetrc
#backup_converted=on
page_requisites=on
continue=on
dirstruct=on
#mirror=on
#noclobber=on
#recursive=on
wait=3
http_user=<username>
http_passwd=<passwd>
#convert_links=on
verbose=on
user_agent=firefox
dot_style=binary

new wget bug when doing incremental backup of very large site

Reply via email to