Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
Tim Ruehsen tim.rueh...@gmx.de writes: And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget). As was already said, replacing mmap by read would not crash (wget_read_file() reads as many bytes as there are without prior checking the length of the file). But without additional logic, it might read random data (many processes writing into the file at the same time, not necessarily the same data). Wget would try to parse / change (-k) it, the result would be broken, but no error would be printed. So, replacing mmap is not a solution, but maybe a part of a solution. Now to the possible solutions that come into my mind: 1. While downloading / writing data, Wget could build a checksum of the file. It allows checking later when re-reading the file. In this case we could really tell the user: hey, someone trashed our file while we are working... To get this working, we must remove the mmap code. 2. Using tempfiles / tempdirs only and move them to the right place. That would bring in some kind of atomicity, though there are still conflicts to solve (e.g. a second Wget instance is faster - should we overwrite existing files / directories). 3. Keeping html/css files in memory after downloading. These are the ones we later re-read to parse them for links/URLs. Writing them to disk after parsing using a tempfile and a move/rename to have atomicity. 4. Using (advisory) file-locks just helps against other Wget instances (is that enough ?). And with -k you have to keep the descriptor open for each file until Wget is done with downloading everything. This is not practical, since there could be (10-, 100-)thousands of files to be downloaded. If someone likes to work on a patch, here is my opinion: I would implement 1. as the least complex to code (but it needs more CPU). Point 4 is would not work in any cases. I don't think we should aim at supporting more instances of wget that can run on the same tree, but we can aim at having at least atomic operations per file. Said so, I think that using temp files is more than enough and we shouldn't really care about possible conflicts, another instance of wget is just a separate entity for us that we should not consider. File locks can be implemented as an additional level of security and requiring an explicit argument to enable them, but still I don't see the point since we don't support multiple processes running simultaneously on the same data. Hopefully we will merge the parallel-wget branch soon, so we will have threads instead of processes :-) Giuseppe
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
- Original Message - Very well, if this would be possible. Right now I have no idea how to print something like the above. I made Tomas Hozza's test with valgrind and wget having debug info. I got 18x (out of 20x) SIGBUS, but on completely different places in the code. Within the misuse test situation, SIGBUS could occur at any place where memory access (read or write) allocated by wget_read_file(). Absolutely randomly / unpredictable if an outside process changes the file size and/or content at the same time. And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget). As was already said, replacing mmap by read would not crash (wget_read_file() reads as many bytes as there are without prior checking the length of the file). But without additional logic, it might read random data (many processes writing into the file at the same time, not necessarily the same data). Wget would try to parse / change (-k) it, the result would be broken, but no error would be printed. So, replacing mmap is not a solution, but maybe a part of a solution. Now to the possible solutions that come into my mind: 1. While downloading / writing data, Wget could build a checksum of the file. It allows checking later when re-reading the file. In this case we could really tell the user: hey, someone trashed our file while we are working... To get this working, we must remove the mmap code. 2. Using tempfiles / tempdirs only and move them to the right place. That would bring in some kind of atomicity, though there are still conflicts to solve (e.g. a second Wget instance is faster - should we overwrite existing files / directories). 3. Keeping html/css files in memory after downloading. These are the ones we later re-read to parse them for links/URLs. Writing them to disk after parsing using a tempfile and a move/rename to have atomicity. 4. Using (advisory) file-locks just helps against other Wget instances (is that enough ?). And with -k you have to keep the descriptor open for each file until Wget is done with downloading everything. This is not practical, since there could be (10-, 100-)thousands of files to be downloaded. If someone likes to work on a patch, here is my opinion: I would implement 1. as the least complex to code (but it needs more CPU). Point 4 is would not work in any cases. Regards, Tim Thanks for the brainstorming. The solution #1 seems as most reasonable to me. I was thinking about 2. and 4., but there are possible issues that you've already mentioned. I had a look at the source, but unfortunately the changes to create and verify the checksum of downloaded files is not trivial. Regards, Tomas
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On Tuesday 03 September 2013 23:00:57 Daniel Stenberg wrote: On Tue, 3 Sep 2013, Tim Ruehsen wrote: There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or your hardware is broken. Please think about it.. If you think SIGBUS is the ultimate way to inform a user about an error situation, then by all means do that. I wouldn't. I am not sure why you say this... I did not say that I think that. I said the opposite (I would not print something like that to the user since it has no additional value and I have no idea how to add more value to the error message). Daniel, why don't you say, what your favorite error message would look like ? Regards, Tim
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On 04/09/13 09:38, Tim Ruehsen wrote: Very well, if this would be possible. Right now I have no idea how to print something like the above. I made Tomas Hozza's test with valgrind and wget having debug info. I got 18x (out of 20x) SIGBUS, but on completely different places in the code. Within the misuse test situation, SIGBUS could occur at any place where memory access (read or write) allocated by wget_read_file(). Absolutely randomly / unpredictable if an outside process changes the file size and/or content at the same time. And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget). Yes, that would be a bit hard. I suspect it could be possible (on some systems) by setting the handler with SA_SIGINFO and checking whether the si_addr member of the siginfo_t argument falls inside the memory area we allocated for a file. Probably not worth the code complexity. :)
[Bug-wget] Race condition on downloaded files among multiple wget instances
Hello. In Fedora I have a bug [1] from guy that is using wget to test web server network load. He runs multiple instances of wget to download some site recursively. Something like this: for i in `seq 20`; do wget -r http://www.makerwise.com/ done Some of those wget instances are killed with SIGBUS. The problem seems to be that when wget downloads the site, it tries to parse it, but in the meantime the file is being rewritten by another wget. When parsing the web page wget uses mmap() to map the file into the memory. Unfortunately mmap behaviour is undefined if the file is changed after mmap call. Also there is stat() call to get the file size before calling mmap, but the file can be already changed (with different size) when calling mmap. I tried the fallback (if mmap is not available) code that uses read() to get the file and it worked without problems also if multiple instances of wget were run. Another problem might be that read() is slower than mmap(), but from what I tried it is not any noticeable difference. Even though I think the described use case is a misuse of wget I'm wondering if wget shouldn't use read() rather than mmap() to prevent the crashes in such a situations. Thanks! Regards, Tomas Hozza [1] https://bugzilla.redhat.com/show_bug.cgi?id=999647
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
- Original Message - On Tuesday 03 September 2013 03:17:56 Tomas Hozza wrote: In Fedora I have a bug [1] from guy that is using wget to test web server network load. He runs multiple instances of wget to download some site recursively. Something like this: for i in `seq 20`; do wget -r http://www.makerwise.com/ done Currently, I call this misuse not bug. Wget could care for such cases, but right now it just doesn't. I have the same opinion, just wanted to check with other people. He could make his test with something like this: for i in `seq 20`; do mkdir $i wget -P $i -r http://www.makerwise.com/ done To have an own directory for each Wget instance. Yes, I know. I already advised him to do this. BTW, read() instead of mmap() would not help in this case. It would eliminate the SIGBUS because the length of the file is determined from the number of read bytes by read(). This way we might end up with truncated file, but will not try to access the memory where we shouldn't. Regards, Tomas
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On Tuesday 03 September 2013 03:48:15 Tomas Hozza wrote: BTW, read() instead of mmap() would not help in this case. It would eliminate the SIGBUS because the length of the file is determined from the number of read bytes by read(). This way we might end up with truncated file, but will not try to access the memory where we shouldn't. In this case, the (or some) links couldn't be followed and the test could have succeeded without doing the expected action. The tester maybe wouldn't have recognized the fault. In your special case, the tester maybe would not mind, but in general it is a good idea not to suppress errors or misbehavior, just to make people feel better. Regards, Tim
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On Tue, 3 Sep 2013, Tim Ruehsen wrote: but in general it is a good idea not to suppress errors or misbehavior, just to make people feel better. Then it should return an error and error message etc, it shouldn't crash with a SIGBUS... -- / daniel.haxx.se
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
but in general it is a good idea not to suppress errors or misbehavior, just to make people feel better. Then it should return an error and error message etc, it shouldn't crash with a SIGBUS... We could add a simple signal handler to catch a SIGBUS and gracefully exit. -- Thanking You, Darshit Shah
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On Tuesday 03 September 2013 10:59:20 Daniel Stenberg wrote: On Tue, 3 Sep 2013, Tim Ruehsen wrote: but in general it is a good idea not to suppress errors or misbehavior, just to make people feel better. Then it should return an error and error message etc, it shouldn't crash with a SIGBUS... What should it say than ? My ideas are limited to something like There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or your hardware is broken. Please think about it.. This does not give more information than a SIGBUS. Ideas welcome. Regards, Tim
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On Tue, 3 Sep 2013, Tim Ruehsen wrote: There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or your hardware is broken. Please think about it.. If you think SIGBUS is the ultimate way to inform a user about an error situation, then by all means do that. I wouldn't. -- / daniel.haxx.se
Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
On 03/09/13 11:16, Tim Ruehsen wrote: What should it say than ? My ideas are limited to something like There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or your hardware is broken. Please think about it.. This does not give more information than a SIGBUS. Ideas welcome. Well, if it shall provide more information... Error reading links.html file. I was expecting it to have 23K, but it now suddenly has only 420 bytes. Seems that another program has changed it behind my back. It is unacceptable to perform my job under this conditions. *wget exited*