Re: new wget bug when doing incremental backup of very large site
From dev: I checked and the .wgetrc file has continue=on. Is there any way to surpress the sending of getting by byte range? I will read through the email and see if I can gather some more information that may be needed. Remove continue=on from .wgetrc? Consider: -N, --timestampingdon't re-retrieve files unless newer than local. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
new wget bug when doing incremental backup of very large site
I was running wget to test mirroring an internal development site, and using large database dumps (binary format) as part of the content to provide me with a large number of binary files for the test. For the test I wanted to see if wget would run and download a quantity of 500K files with 100GB of total data transferred. The test was going fine and wget ran flawlessly for 3 days downloading almost the entire contents of the test site and I was at 85GB. wget would have run until the very end and would have passed the test downloading all 100GB of the test files. Then a power outage occurred, my local test box was not on battery backup, so I had to restart wget and the test. wget did not refetch the binary backup files and gave (for each file that had already been retrieved the following message: - = `domain/database/dbdump_107899.gz' Connecting to domain|ip|:80... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable The file is already fully retrieved; nothing to do. - --- wget continued to run for about eight hours, and gave the above message on several thousands files, then crashed giving: wget: realloc: Failed to allocate 536870912 bytes; memory exhausted. This was surprising because wget ran flawlessly on the initial download for several days but on a refresh or incremental backup of the data, wget crashed after eight hours. I believe it has something to do with the code that is run when wget already finds a local file with the same name and sends a range request. Maybe there is some data structure that keeps getting added to so it exhausts the memory on my test box which has 2GB. There were no other programs running on the test box. This may be a bug. To get around this for purposes of my test, I would like to know if there is anyway (any switch) to tell wget to not send any type of range request at all, if the local filename exists but to skip sending any type of request, if it finds a file with the same name. I do not want it to check to see if the file is newer, if the file is complete, just skip it and go on to the next file. I was running wget under cygwin on a Windows XP box. The wget command that I ran was the following: wget -m -l inf --convert-links --page-requisites http://domain I had the following .wgetrc file $HOME/.wgetrc #backup_converted=on page_requisites=on continue=on dirstruct=on #mirror=on #noclobber=on #recursive=on wait=3 http_user=username http_passwd=passwd #convert_links=on verbose=on user_agent=firefox dot_style=binary
new wget bug when doing incremental backup of very large site
I was running wget to test mirroring an internal development site, and using large database dumps (binary format) as part of the content to provide me with a large number of binary files for the test. For the test I wanted to see if wget would run and download a quantity of 500K files with 100GB of total data transferred. The test was going fine and wget ran flawlessly for 3 days downloading almost the entire contents of the test site and I was at 85GB. wget would have run until the very end and would have passed the test downloading all 100GB of the test files. Then a power outage occurred, my local test box was not on battery backup, so I had to restart wget and the test. wget did not refetch the binary backup files and gave (for each file that had already been retrieved the following message: - = `domain/database/dbdump_107899.gz' Connecting to domain|ip|:80... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable The file is already fully retrieved; nothing to do. - --- wget continued to run for about eight hours, and gave the above message on several thousands files, then crashed giving: wget: realloc: Failed to allocate 536870912 bytes; memory exhausted. This was surprising because wget ran flawlessly on the initial download for several days but on a refresh or incremental backup of the data, wget crashed after eight hours. I believe it has something to do with the code that is run when wget already finds a local file with the same name and sends a range request. Maybe there is some data structure that keeps getting added to so it exhausts the memory on my test box which has 2GB. There were no other programs running on the test box. This may be a bug. To get around this for purposes of my test, I would like to know if there is anyway (any switch) to tell wget to not send any type of range request at all, if the local filename exists but to skip sending any type of request, if it finds a file with the same name. I do not want it to check to see if the file is newer, if the file is complete, just skip it and go on to the next file. I was running wget under cygwin on a Windows XP box. The wget command that I ran was the following: wget -m -l inf --convert-links --page-requisites http://domain I had the following .wgetrc file $HOME/.wgetrc #backup_converted=on page_requisites=on continue=on dirstruct=on #mirror=on #noclobber=on #recursive=on wait=3 http_user=username http_passwd=passwd #convert_links=on verbose=on user_agent=firefox dot_style=binary
new wget bug when doing incremental backup of very large site
I was running wget to test mirroring an internal development site, and using large database dumps (binary format) as part of the content to provide me with a large number of binary files for the test. For the test I wanted to see if wget would run and download a quantity of 500K files with 100GB of total data transferred. The test was going fine and wget ran flawlessly for 3 days downloading almost the entire contents of the test site and I was at 85GB. wget would have run until the very end and would have passed the test downloading all 100GB of the test files. Then a power outage occurred, my local test box was not on battery backup, so I had to restart wget and the test. wget did not refetch the binary backup files and gave (for each file that had already been retrieved the following message: - = `domain/database/dbdump_107899.gz' Connecting to domain|ip|:80... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable The file is already fully retrieved; nothing to do. - --- wget continued to run for about eight hours, and gave the above message on several thousands files, then crashed giving: wget: realloc: Failed to allocate 536870912 bytes; memory exhausted. This was surprising because wget ran flawlessly on the initial download for several days but on a refresh or incremental backup of the data, wget crashed after eight hours. I believe it has something to do with the code that is run when wget already finds a local file with the same name and sends a range request. Maybe there is some data structure that keeps getting added to so it exhausts the memory on my test box which has 2GB. There were no other programs running on the test box. This may be a bug. To get around this for purposes of my test, I would like to know if there is anyway (any switch) to tell wget to not send any type of range request at all, if the local filename exists but to skip sending any type of request, if it finds a file with the same name. I do not want it to check to see if the file is newer, if the file is complete, just skip it and go on to the next file. I was running wget under cygwin on a Windows XP box. The wget command that I ran was the following: wget -m -l inf --convert-links --page-requisites http://domain I had the following .wgetrc file $HOME/.wgetrc #backup_converted=on page_requisites=on continue=on dirstruct=on #mirror=on #noclobber=on #recursive=on wait=3 http_user=username http_passwd=passwd #convert_links=on verbose=on user_agent=firefox dot_style=binary
Re: new wget bug when doing incremental backup of very large site
1. It would help to know the wget version (wget -V). 2. It might help to see some output when you add -d to the wget command line. (One existing file should be enough.) It's not immediately clear whose fault the 416 error is. It might also help to know which Web server is running on the server, and how big the file is which you're trying to re-fetch. This was surprising [...] You're easily surprised. wget: realloc: Failed to allocate 536870912 bytes; memory exhausted. 500MB sounds to me like a lot. [...] it exhausts the memory on my test box which has 2GB. A memory exhausted complaint here probably refers to virtual memory, not physical memory. [...] I do not want it to check to see if the file is newer, if the file is complete, just skip it and go on to the next file. I haven't checked the code, but with continue=on, I'd expect wget to check the size and date together, and not download any real data if the size checks, and the local file date is later. The 416 error suggests that it's trying to do a partial (byte-range) download, and is failing because either it's sending a bad byte range, or the server is misinterpreting a good byte range. Adding -d should show what wget thinks that it's sending. Knowing that and the actual file size might show a problem. If the -d output looks reasonable, the fault may lie with the server, and an actual URL may be needed to persue the diagnosis from there. The memory allocation failure could be a bug, but finding it could be difficult. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547