Re: race condition destroying condition variables
Brent W. Baccala, on mer. 27 déc. 2017 15:07:51 -0500, wrote: > On Wed, Dec 27, 2017 at 2:31 PM, Samuel Thibault <[1]samuel.thiba...@gnu.org> > wrote: > > Ok, so even if Posix explicitly says that it has undefined behavior, > since nptl behaves fine we should probably behave fine too. > > My point was that since Linux waits for the other threads in > pthread_cond_destroy(), we should too. Well, that's exactly what I wrote :) Samuel
Re: race condition destroying condition variables
On Wed, Dec 27, 2017 at 2:31 PM, Samuel Thibault wrote: > Hello, > > Brent W. Baccala, on mar. 26 déc. 2017 23:06:13 -0500, wrote: > > Also, the Linux source code in nptl/ includes the following comment: > > > > /* If there are waiters which have been already signalled or > > broadcasted, but still are using the pthread_cond_t structure, > > pthread_cond_destroy needs to wait for them. */ > > Ok, so even if Posix explicitly says that it has undefined behavior, > since nptl behaves fine we should probably behave fine too. > Let me clarify - that comment precedes a block of code in pthread_cond_destroy() that waits for the other threads. See glibc-2.23/nptl/pthread_cond_destroy.c, around line 50. That code's been rewritten in glibc-2.25, but the requirement is still there. glibc-2.25/nptl/pthread_cond_destroy.c, lines 38-40: Thus, we can assume that all waiters that are still accessing the condvar > have been woken. We wait until they have confirmed to have woken up by > decrementing __wrefs. ... and then it waits for "wrefs >> 3" to become zero. My point was that since Linux waits for the other threads in pthread_cond_destroy(), we should too. agape brent
Re: race condition destroying condition variables
Hello, Brent W. Baccala, on mar. 26 déc. 2017 23:06:13 -0500, wrote: > Also, the Linux source code in nptl/ includes the following comment: > > /* If there are waiters which have been already signalled or > broadcasted, but still are using the pthread_cond_t structure, > pthread_cond_destroy needs to wait for them. */ Ok, so even if Posix explicitly says that it has undefined behavior, since nptl behaves fine we should probably behave fine too. > The problem that I've got now is that I've changed the size of the condition > variables by adding an extra field, Not a good idea, for the resaons you mention :) > First, I'd like to build the packages using the same scripts used to build the > Debian distribution packages. Is this available somewhere? Just run dpkg-buildpackage from the package source, or use pdebuild, or use a wanna-build infrastructure. Nothing simple, at any rate. > Next, how will modifying the memory layout of condition variables affect our > packaging? I figure that we'll have to bump libc0.3 to libc0.4, and that > should pretty much do the trick, right? Yes, but that's a *HUGE* transition, which is painful. Better use symbol versioning, so that old non-recompiled programs use the old implementation, and recompiled programs used the new. > Finally, I'm wondering if we need to change this at all. The "__data" field > in > struct __pthread_cond seems unused, at least in the libpthreads directory. Is > there any use for this pointer? Or can I use those bits? AFAICT it was used for compatibility with libcthreads, which we don't care any more, so I'd say feel free to use it, and avoid the transition altogether. Samuel
Re: race condition destroying condition variables
Well, I've tried both Samuel's (swap space) and Svante's (nocheck) suggestions and have found that both allow me to successfully build the glibc packages! The problem that I've got now is that I've changed the size of the condition variables by adding an extra field, an integer that tracks the number of waiting threads. This means that I have to recompile anything that depends on the size of a condition variable, which is potentially everything on the system. First, I'd like to build the packages using the same scripts used to build the Debian distribution packages. Is this available somewhere? Next, how will modifying the memory layout of condition variables affect our packaging? I figure that we'll have to bump libc0.3 to libc0.4, and that should pretty much do the trick, right? Finally, I'm wondering if we need to change this at all. The "__data" field in struct __pthread_cond seems unused, at least in the libpthreads directory. Is there any use for this pointer? Or can I use those bits? Also, the Linux source code in nptl/ includes the following comment: /* If there are waiters which have been already signalled or broadcasted, but still are using the pthread_cond_t structure, pthread_cond_destroy needs to wait for them. */ ...which is the conclusion I've come to as well. agape brent
Re: race condition destroying condition variables
On Fri, 2017-12-22 at 00:22 +0100, Samuel Thibault wrote: > > > > > The testsuite for glibc is known to be broken. Just > > export DEB_BUILD_OPTIONS=nocheck before building > > with dpkg-buildpackage -b ... > > Broken tests are marked as such, there aren't so many marked as such. > > Using nocheck is a way to avoid the issue indeed. > > The build boxes have 3GB memory and 4GB swap, and do pass the > testsuite > fine (minus the marked tests) Well, I've never managed to build glibc with tests enabled. And it takes a much longer time to build. But of course YMWV.
Re: race condition destroying condition variables
Svante Signell, on jeu. 21 déc. 2017 23:58:20 +0100, wrote: > On Thu, 2017-12-21 at 17:32 -0500, Brent W. Baccala wrote: > > Well, I've got a patch that might work, but I'm having a lot of > > trouble testing it. > > > > I can't dpkg-buildpackage the Debian glibc package. > > > > It gets into the test routines, then a bunch of the math tests crash > > with SIGSEGVs and SIGILLs, then I get a bunch of kernel errors: > > > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > > > > This happens in either tst-vfprintf-width-prec or tst-mallocfork2, > > depending on whether I allocate 2 GB or 4 GB to the virtual machine. > > > > Any ideas what the problem might be? > > > > And how are the Debian packages built for downloading? It must be > > something different than what I'm doing... > > Hi Brent, > > The testsuite for glibc is known to be broken. Just > export DEB_BUILD_OPTIONS=nocheck before building > with dpkg-buildpackage -b ... Broken tests are marked as such, there aren't so many marked as such. Using nocheck is a way to avoid the issue indeed. The build boxes have 3GB memory and 4GB swap, and do pass the testsuite fine (minus the marked tests) Samuel
Re: race condition destroying condition variables
On Thu, 2017-12-21 at 17:32 -0500, Brent W. Baccala wrote: > Well, I've got a patch that might work, but I'm having a lot of > trouble testing it. > > I can't dpkg-buildpackage the Debian glibc package. > > It gets into the test routines, then a bunch of the math tests crash > with SIGSEGVs and SIGILLs, then I get a bunch of kernel errors: > > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) > > This happens in either tst-vfprintf-width-prec or tst-mallocfork2, > depending on whether I allocate 2 GB or 4 GB to the virtual machine. > > Any ideas what the problem might be? > > And how are the Debian packages built for downloading? It must be > something different than what I'm doing... Hi Brent, The testsuite for glibc is known to be broken. Just export DEB_BUILD_OPTIONS=nocheck before building with dpkg-buildpackage -b ... HTH
Re: race condition destroying condition variables
Well, I've got a patch that might work, but I'm having a lot of trouble testing it. I can't dpkg-buildpackage the Debian glibc package. It gets into the test routines, then a bunch of the math tests crash with SIGSEGVs and SIGILLs, then I get a bunch of kernel errors: no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2423)) no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) no more room in ee26a908 ((...i386-libc/elf/ld.so.1(2814)) This happens in either tst-vfprintf-width-prec or tst-mallocfork2, depending on whether I allocate 2 GB or 4 GB to the virtual machine. Any ideas what the problem might be? And how are the Debian packages built for downloading? It must be something different than what I'm doing... agape brent
Re: race condition destroying condition variables
On Tue, Dec 19, 2017 at 3:25 AM, Samuel Thibault wrote: > Brent W. Baccala, on mar. 19 déc. 2017 00:08:44 -0500, wrote: > > Looks like there's a race condition when we destroy a condition > variable. My > > understanding of the expected behavior is that once all the threads have > been > > signaled (i.e, pthread_cond_broadcast is called), the condition variable > can be > > safely destroyed with pthread_cond_destroy. > > Err, I don't think that POSIX allows to assume that. The fact that > pthread_cond_broadcast has returned doesn't mean that other threads have > finished with pthread_cond_wait. > > POSIX seems a little unclear on that, but C++ seems to require it. POSIX: "It shall be safe to destroy an initialized condition variable upon which no threads are currently blocked." [1] and "The *pthread_cond_broadcast*() function shall unblock all threads currently blocked on the specified condition variable *cond*." [2] A little further down on [1], in an "informative" section, is the following snippet of code: (A) pthread_cond_broadcast(&ep->notbusy); pthread_mutex_unlock(&lp->lm); (B) pthread_cond_destroy(&rp->notbusy); ...accompanied by the following comment: "In this example, the condition variable and its list element may be freed (line B) immediately after all threads waiting for it are awakened (line A)..." which is what makes sense to me. Of course, our pthread_cond_destroy() does nothing, but assuming that the condvar is now freed in a step (C), this code won't work reliably in our current implementation. I found a discussion of this on stackexchange [3], where the top answer observes that 'The standard only says "shall unblock" and not "on successful return from this call they will be unblocked".' The C++ standard, however, is more explicit, in section 30.5.1: "Only the notification to unblock the wait must happen before destruction". cppreference.com [4] says: "It is only safe to invoke the destructor if all threads have been notified. It is not required that they have exited their respective wait functions: some threads may still be waiting to reacquire the associated lock, or may be waiting to be scheduled to run after reacquiring it." That's the behavior I was counting on. I'm using C++, and have found that I can't just notify_all() a condition variable, then destroy it. And, yes, I've got everything locked so that another thread can't jump in there with a wait() between those two events. On Tue, Dec 19, 2017 at 4:17 AM, Richard Braun wrote: Besides, the threads should also all go through reacquiring the associated > mutex, usually sitting right next to the condition variable, and usually > both embedded in a parent object. What you're normally really interested > in is releasing this parent object, including destroying the mutex, which > means you also have to wait for all threads to unlock it. One common way > to deal with this is reference counters on the parent object. In my case, I've got a lot more condition variables than mutexes, because I don't want threads waking up for events other than the one they're waiting for. One mutex for the entire pager, and a condition variable for every outstanding lock and write. So, for example, if a thread is waiting for a particular write to finish, it waits on that write's condition variable, which gets signaled and destroyed when the write finishes, while the pager object and the mutex continue to exist. I was thinking about wrapping such a counter as you suggest into the condition variable structure, to ensure that the POSIX informative behavior and the C++ behavior work reliably. Increment the counter (atomically) when we're about to block on a condition variable, and decrement it when we're done the post-block processing. Then pthread_cond_destroy() can first check to make sure that the wait queue is empty (return EBUSY if not), then spin wait for the counter to be zero otherwise. I think that should ensure that we can free() the condition variable after pthread_cond_destroy() has returned, and that, in turn, will ensure that the C++ destructor works reliably, too. agape brent [1] http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_destroy.html [2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_broadcast.html [3] https://stackoverflow.com/questions/7598457/when-can-a-cond-var-be-used-to-synchronize-its-own-destruction-unmapping [4] http://en.cppreference.com/w/cpp/thread/condition_variable/~condition_variable
Re: race condition destroying condition variables
On Tue, Dec 19, 2017 at 09:25:16AM +0100, Samuel Thibault wrote: > Brent W. Baccala, on mar. 19 déc. 2017 00:08:44 -0500, wrote: > > Looks like there's a race condition when we destroy a condition variable. > > My > > understanding of the expected behavior is that once all the threads have > > been > > signaled (i.e, pthread_cond_broadcast is called), the condition variable > > can be > > safely destroyed with pthread_cond_destroy. > > Err, I don't think that POSIX allows to assume that. The fact that > pthread_cond_broadcast has returned doesn't mean that other threads have > finished with pthread_cond_wait. Besides, the threads should also all go through reacquiring the associated mutex, usually sitting right next to the condition variable, and usually both embedded in a parent object. What you're normally really interested in is releasing this parent object, including destroying the mutex, which means you also have to wait for all threads to unlock it. One common way to deal with this is reference counters on the parent object. -- Richard Braun
Re: race condition destroying condition variables
Brent W. Baccala, on mar. 19 déc. 2017 00:08:44 -0500, wrote: > Looks like there's a race condition when we destroy a condition variable. My > understanding of the expected behavior is that once all the threads have been > signaled (i.e, pthread_cond_broadcast is called), the condition variable can > be > safely destroyed with pthread_cond_destroy. Err, I don't think that POSIX allows to assume that. The fact that pthread_cond_broadcast has returned doesn't mean that other threads have finished with pthread_cond_wait. Samuel