Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-10 Thread Giuseppe Scrivano
Tim Ruehsen tim.rueh...@gmx.de writes:

 And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).

 As was already said, replacing mmap by read would not crash (wget_read_file() 
 reads as many bytes as there are without prior checking the length of the 
 file). But without additional logic, it might read random data (many 
 processes 
 writing into the file at the same time, not necessarily the same data). Wget 
 would try to parse / change (-k) it, the result would be broken, but no error 
 would be printed. So, replacing mmap is not a solution, but maybe a part of a 
 solution.

 Now to the possible solutions that come into my mind:
 1. While downloading / writing data, Wget could build a checksum of the file.
 It allows checking later when re-reading the file. In this case we could 
 really tell the user: hey, someone trashed our file while we are working...
 To get this working, we must remove the mmap code.

 2. Using tempfiles / tempdirs only and move them to the right place. That 
 would bring in some kind of atomicity, though there are still conflicts to 
 solve (e.g. a second Wget instance is faster - should we overwrite existing 
 files / directories).

 3. Keeping html/css files in memory after downloading. These are the ones we 
 later re-read to parse them for links/URLs. Writing them to disk after 
 parsing 
 using a tempfile and a move/rename to have atomicity.

 4. Using (advisory) file-locks just helps against other Wget instances (is 
 that enough ?). And with -k you have to keep the descriptor open for each 
 file 
 until Wget is done with downloading everything. This is not practical, since 
 there could be (10-, 100-)thousands of files to be downloaded.

 If someone likes to work on a patch, here is my opinion: I would implement 1. 
 as the least complex to code (but it needs more CPU). Point 4 is would not 
 work in any cases.

I don't think we should aim at supporting more instances of wget that
can run on the same tree, but we can aim at having at least atomic
operations per file.

Said so, I think that using temp files is more than enough and we
shouldn't really care about possible conflicts, another instance of wget
is just a separate entity for us that we should not consider.

File locks can be implemented as an additional level of security and
requiring an explicit argument to enable them, but still I don't see the
point since we don't support multiple processes running simultaneously
on the same data.

Hopefully we will merge the parallel-wget branch soon, so we will have
threads instead of processes :-)

Giuseppe



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-09 Thread Tomas Hozza
- Original Message -
 Very well, if this would be possible. Right now I have no idea how to print
 something like the above. I made Tomas Hozza's test with valgrind and wget
 having debug info. I got 18x (out of 20x) SIGBUS, but on completely different
 places in the code. Within the misuse test situation, SIGBUS could occur at
 any place where memory access (read or write) allocated by wget_read_file().
 Absolutely randomly / unpredictable if an outside process changes the file
 size and/or content at the same time.
 
 And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).
 
 As was already said, replacing mmap by read would not crash (wget_read_file()
 reads as many bytes as there are without prior checking the length of the
 file). But without additional logic, it might read random data (many
 processes
 writing into the file at the same time, not necessarily the same data). Wget
 would try to parse / change (-k) it, the result would be broken, but no error
 would be printed. So, replacing mmap is not a solution, but maybe a part of a
 solution.
 
 Now to the possible solutions that come into my mind:
 1. While downloading / writing data, Wget could build a checksum of the file.
 It allows checking later when re-reading the file. In this case we could
 really tell the user: hey, someone trashed our file while we are working...
 To get this working, we must remove the mmap code.
 
 2. Using tempfiles / tempdirs only and move them to the right place. That
 would bring in some kind of atomicity, though there are still conflicts to
 solve (e.g. a second Wget instance is faster - should we overwrite existing
 files / directories).
 
 3. Keeping html/css files in memory after downloading. These are the ones we
 later re-read to parse them for links/URLs. Writing them to disk after
 parsing
 using a tempfile and a move/rename to have atomicity.
 
 4. Using (advisory) file-locks just helps against other Wget instances (is
 that enough ?). And with -k you have to keep the descriptor open for each
 file
 until Wget is done with downloading everything. This is not practical, since
 there could be (10-, 100-)thousands of files to be downloaded.
 
 If someone likes to work on a patch, here is my opinion: I would implement 1.
 as the least complex to code (but it needs more CPU). Point 4 is would not
 work in any cases.
 
 Regards, Tim

Thanks for the brainstorming. The solution #1 seems as most reasonable
to me. I was thinking about 2. and 4., but there are possible issues
that you've already mentioned.

I had a look at the source, but unfortunately the changes to create and
verify the checksum of downloaded files is not trivial.

Regards,

Tomas



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-04 Thread Tim Ruehsen
On Tuesday 03 September 2013 23:00:57 Daniel Stenberg wrote:
 On Tue, 3 Sep 2013, Tim Ruehsen wrote:
  There was an unexpected signal SIGBUS. It may be a bug or a misuse of
  Wget
  or your hardware is broken. Please think about it..
 
 If you think SIGBUS is the ultimate way to inform a user about an error
 situation, then by all means do that. I wouldn't.

I am not sure why you say this...
I did not say that I think that. I said the opposite (I would not print 
something like that to the user since it has no additional value and I have no 
idea how to add more value to the error message).

Daniel, why don't you say, what your favorite error message would look like ?

Regards, Tim




Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-04 Thread Ángel González

On 04/09/13 09:38, Tim Ruehsen wrote:

Very well, if this would be possible. Right now I have no idea how to print
something like the above. I made Tomas Hozza's test with valgrind and wget
having debug info. I got 18x (out of 20x) SIGBUS, but on completely different
places in the code. Within the misuse test situation, SIGBUS could occur at
any place where memory access (read or write) allocated by wget_read_file().
Absolutely randomly / unpredictable if an outside process changes the file
size and/or content at the same time.

And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).

Yes, that would be a bit hard.
I suspect it could be possible (on some systems) by setting the handler 
with SA_SIGINFO
and checking whether the si_addr member of the siginfo_t argument falls 
inside the

memory area we allocated for a file.
Probably not worth the code complexity. :)




[Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Tomas Hozza
Hello.

In Fedora I have a bug [1] from guy that is using wget
to test web server network load. He runs multiple 
instances of wget to download some site recursively.
Something like this:

for i in `seq 20`; do
wget -r http://www.makerwise.com/ 
done

Some of those wget instances are killed with SIGBUS.
The problem seems to be that when wget downloads the
site, it tries to parse it, but in the meantime the
file is being rewritten by another wget. When parsing
the web page wget uses mmap() to map the file into the
memory. Unfortunately mmap behaviour is undefined if
the file is changed after mmap call. Also there is 
stat() call to get the file size before calling mmap,
but the file can be already changed (with different size)
when calling mmap.

I tried the fallback (if mmap is not available) code that
uses read() to get the file and it worked without problems
also if multiple instances of wget were run. Another problem
might be that read() is slower than mmap(), but from what I
tried it is not any noticeable difference.

Even though I think the described use case is a misuse of
wget I'm wondering if wget shouldn't use read() rather than
mmap() to prevent the crashes in such a situations.

Thanks!

Regards,

Tomas Hozza


[1] https://bugzilla.redhat.com/show_bug.cgi?id=999647



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Tomas Hozza


- Original Message -
 On Tuesday 03 September 2013 03:17:56 Tomas Hozza wrote:
  In Fedora I have a bug [1] from guy that is using wget
  to test web server network load. He runs multiple
  instances of wget to download some site recursively.
  Something like this:
  
  for i in `seq 20`; do
  wget -r http://www.makerwise.com/ 
  done
 
 Currently, I call this misuse not bug.
 Wget could care for such cases, but right now it just doesn't.

I have the same opinion, just wanted to check with other people.

 He could make his test with something like this:
 for i in `seq 20`; do
 mkdir $i
 wget -P $i -r http://www.makerwise.com/ 
 done
 To have an own directory for each Wget instance.

Yes, I know. I already advised him to do this. 

 BTW, read() instead of mmap() would not help in this case.

It would eliminate the SIGBUS because the length of the file
is determined from the number of read bytes by read(). This way
we might end up with truncated file, but will not try to access
the memory where we shouldn't.


Regards,

Tomas



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Tim Ruehsen
On Tuesday 03 September 2013 03:48:15 Tomas Hozza wrote:
  BTW, read() instead of mmap() would not help in this case.
 
 It would eliminate the SIGBUS because the length of the file
 is determined from the number of read bytes by read(). This way
 we might end up with truncated file, but will not try to access
 the memory where we shouldn't.

In this case, the (or some) links couldn't be followed and the test could have 
succeeded without doing the expected action. The tester maybe wouldn't have 
recognized the fault. In your special case, the tester maybe would not mind, 
but in general it is a good idea not to suppress errors or misbehavior, just 
to make people feel better.

Regards, Tim



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Daniel Stenberg

On Tue, 3 Sep 2013, Tim Ruehsen wrote:

but in general it is a good idea not to suppress errors or misbehavior, just 
to make people feel better.


Then it should return an error and error message etc, it shouldn't crash with 
a SIGBUS...


--

 / daniel.haxx.se



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Darshit Shah
  but in general it is a good idea not to suppress errors or misbehavior,
 just to make people feel better.


 Then it should return an error and error message etc, it shouldn't crash
 with a SIGBUS...


We could add a simple signal handler to catch a SIGBUS and gracefully exit.

-- 
Thanking You,
Darshit Shah


Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Tim Ruehsen
On Tuesday 03 September 2013 10:59:20 Daniel Stenberg wrote:
 On Tue, 3 Sep 2013, Tim Ruehsen wrote:
  but in general it is a good idea not to suppress errors or misbehavior,
  just to make people feel better.
 
 Then it should return an error and error message etc, it shouldn't crash
 with a SIGBUS...

What should it say than ?
My ideas are limited to something like
There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or 
your hardware is broken. Please think about it..

This does not give more information than a SIGBUS.
Ideas welcome.

Regards, Tim



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Daniel Stenberg

On Tue, 3 Sep 2013, Tim Ruehsen wrote:

There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget 
or your hardware is broken. Please think about it..


If you think SIGBUS is the ultimate way to inform a user about an error 
situation, then by all means do that. I wouldn't.


--

 / daniel.haxx.se



Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

2013-09-03 Thread Ángel González

On 03/09/13 11:16, Tim Ruehsen wrote:

What should it say than ?
My ideas are limited to something like
There was an unexpected signal SIGBUS. It may be a bug or a misuse of Wget or
your hardware is broken. Please think about it..

This does not give more information than a SIGBUS.
Ideas welcome.

Well, if it shall provide more information...

Error reading links.html file. I was expecting it to have 23K, but it 
now suddenly has
only 420 bytes. Seems that another program has changed it behind my 
back. It is

unacceptable to perform my job under this conditions.
*wget exited*