On 7/29/17 8:46 AM, William Allen Simpson wrote:
On 7/28/17 7:37 AM, Dominique Martinet wrote:
I also get a random crash with ASAN during init, one out of 3-4 times
[...]
I'm not able to reproduce. But Matt and I agree that you've found a
race condition that has been there for a long time. It affects both
UDP and TCP (and probably RDMA).
The window is between the return from epoll_wait() and the SVC_REF()
for the task that will handle the signal. A different thread could
decrement the counter during that window, find it is transiently zero,
and destroy the transport.
Looking back at patches over the years, we've been trying to fix this
problem for a very long time:
fe2db638 (Matt Benjamin 2012-02-13 16:17:56 -0500 871) xprt
= (SVCXPRT *) ev->data.ptr;
e0687ee7 (Matt Benjamin 2012-09-18 13:06:49 -0400 872)
xp_ev = (struct svc_xprt_ev *) xprt->xp_ev;
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 873)
e0687ee7 (Matt Benjamin 2012-09-18 13:06:49 -0400 874) if (!
(xp_ev->flags & XP_EV_FLAG_BLOCKED)) {
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 875)
/* check for valid xprt */
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 876)
mutex_lock(&xprt->xp_lock);
dabdef7d (Matt Benjamin 2013-03-15 09:10:12 -0700 877)
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 878) if ((!
(xprt->xp_flags & SVC_XPRT_FLAG_DESTROYED)) &&
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 879)
(xprt->xp_refcnt > 0)) {
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 880)
/* XXX take extra ref, callout will release */
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 881)
refcnt = xprt->xp_refcnt;
dabdef7d (Matt Benjamin 2013-03-15 09:10:12 -0700 882)
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 883)
__warnx(TIRPC_DEBUG_FLAG_REFCNT,
dabdef7d (Matt Benjamin 2013-03-15 09:10:12 -0700 884)
"%s: pre getreq ref %p %u",
dabdef7d (Matt Benjamin 2013-03-15 09:10:12 -0700 885)
__func__, xprt, refcnt);
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 886)
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 887)
__warnx(TIRPC_DEBUG_FLAG_SVC_RQST,
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 888)
"%s: event ix %d fd or ptr (%d:%p) "
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 889)
"EPOLL event %d (refs %d)",
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 890)
__func__, ix, ev->data.fd, ev->data.ptr,
dabdef7d (Matt Benjamin 2013-03-15 09:10:12 -0700 891)
ev->events, refcnt);
b36f0878 (Matt Benjamin 2013-03-04 14:29:53 -0800 892)
9f7e29a6 (Matt Benjamin 2012-12-28 06:10:27 -0500 893)
SVC_REF(xprt, SVC_REF_FLAG_LOCKED);
Although I've looked at and refactored this code over the years, it
never occurred to me that was sometimes using already freed memory to
check the flags and reference counter.
Because it will be an API change to SVC_RELEASE() -- a macro and inline
function -- we cannot backport. Obviously, we haven't seen this very
often. But it will be a good reason for moving downstreams to V2.6.
We don't have many crashes in old code because the very recently freed
memory is still accessible. But still, the new code will be better
and faster.
Many thanks to Dominique for spending 5 hours finding this subtle and
pernicious bug, providing the diagnostic "Eureka!" moment.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel