Thanks. Yeah, I didn't track it down, but I suspect this behavior has always been there. I do think it's ultimately incorrect behavior (i.e., a violation of edge triggering semantics as I note in the initial report). The implications of this is that we'd like the option to construct hierarchical epoll sets and use them in generic code that uses epoll_wait and expects edge triggering, but we can't (because epoll fd's behave differently from other fd's). I'm a little surprised others haven't come across this.
I'm not really a kernel developer nor am I particularly familiar with this code, so I'm unclear how ugly a fix would be...I can imagine there may be locking issues with recursively traversing child epoll fd's?... Nick On Wed, Jan 17, 2018 at 9:21 AM, Jason Baron <jba...@akamai.com> wrote: > > > On 01/12/2018 07:06 PM, Nick Murphy wrote: >> [1.] One line summary of the problem: >> epoll_wait does not obey edge triggering semantics for file >> descriptors which are themselves epoll file descriptors (i.e., epoll >> fd's added to an epoll set with the EPOLLET flag) >> >> [2.] Full description of the problem/report: >> When executing the following sequence: >> 1) create and add an event fd (for example) to an inner epoll set >> 2) add the inner epoll fd to an outer epoll set (with EPOLLET flag set) >> 3) write to (increase the value of) the event fd >> 4) epoll_wait on outer fd >> 5) epoll_wait on outer fd again >> >> Edge triggering semantics imply that the epoll_wait in step 5 should >> block (nothing has changed). It does not. It returns immediately. >> >> If epoll_wait is called on the inner fd between steps 4 and 5, the >> epoll_wait in step 5 will then block as expected. >> >> Does not seem to matter if the event is added to the inner epoll set >> with EPOLLET set or not. >> >> [3.] Keywords (i.e., modules, networking, kernel): epoll, epoll_wait, >> edge triggering >> >> [4.] Kernel version (from /proc/version): 4.4.0-103-generic (gcc version >> 4.8.4) >> >> [6.] A small shell script or example program which triggers the >> problem (if possible) >> > > Interesting - it seems that epoll can excessively queue wakeup events > when not desired. Here's a small patch which cures this case, if you > want to try it out. The semantics around nested edge trigger, though do > seem unexpected. For example, in the test case you presented. If one > does epoll_wait() on the outer first and then the inner both > epoll_wait() will return, however if one does the inner first and then > the outer, only the inner will return an event. This has to do with how > epoll implements its polling, it seems odd as well and trickier to fix. > Afaict its always acted like this. > > Thanks, > > -Jason > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index afd548e..6bd1f46 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -713,6 +713,7 @@ static int ep_scan_ready_list(struct eventpoll *ep, > if (!ep_is_linked(&epi->rdllink)) { > list_add_tail(&epi->rdllink, &ep->rdllist); > ep_pm_stay_awake(epi); > + pwake = 1; > } > } > /* > @@ -728,7 +729,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, > list_splice(&txlist, &ep->rdllist); > __pm_relax(ep->ws); > > - if (!list_empty(&ep->rdllist)) { > + if (unlikely(pwake)) { > + pwake = 0; > /* > * Wake up (if active) both the eventpoll wait list and > * the ->poll() wait list (delayed after we release the > lock). > @@ -744,7 +746,7 @@ static int ep_scan_ready_list(struct eventpoll *ep, > mutex_unlock(&ep->mtx); > > /* We have to call this outside the lock */ > - if (pwake) > + if (unlikely(pwake)) > ep_poll_safewake(&ep->poll_wait); > > return error; > > >> #include <fcntl.h> >> #include <stdio.h> >> #include <sys/epoll.h> >> #include <sys/eventfd.h> >> #include <unistd.h> >> >> int main(int argc, char** argv) { >> struct epoll_event ev, events[1]; >> int inner_ep, outer_ep, sem, flags, ret; >> long long val = 1; >> >> if ((sem = eventfd(0, 0)) < 0) { >> fprintf(stderr, "eventfd failed"); >> return -1; >> } >> >> if ((inner_ep = epoll_create(1)) < 0) { >> fprintf(stderr, "inner epoll_create failed"); >> return -1; >> } >> >> // Set inner to be non-blocking (probably irrelevant, but...) >> if ((flags = fcntl(inner_ep, F_GETFL, 0)) < 0) { >> fprintf(stderr, "fcntl get failed"); >> return -1; >> } >> flags |= O_NONBLOCK; >> if (fcntl(inner_ep, F_SETFL, flags) < 0) { >> fprintf(stderr, "fcntl set failed"); >> return -1; >> } >> >> // Add the event to the inner epoll instance. >> ev.events = EPOLLIN | EPOLLET; >> ev.data.fd = sem; >> if (epoll_ctl(inner_ep, EPOLL_CTL_ADD, sem, &ev) < 0) { >> fprintf(stderr, "inner add failed"); >> return -1; >> } >> >> if ((outer_ep = epoll_create(1)) < 0) { >> fprintf(stderr, "outer epoll_create failed"); >> return -1; >> } >> >> // Add the inner epoll instance to the outer. Note the EPOLLET! >> ev.events = EPOLLIN | EPOLLET; >> ev.data.fd = inner_ep; >> if (epoll_ctl(outer_ep, EPOLL_CTL_ADD, inner_ep, &ev) < 0) { >> fprintf(stderr, "outer add failed"); >> return -1; >> } >> >> // Write to the event. >> if (write(sem, &val, sizeof(val)) != sizeof(val)) { >> fprintf(stderr, "write failed"); >> return -1; >> } >> >> // This should return immediately. >> printf("First wait.\n"); >> if ((ret = epoll_wait(outer_ep, events, 1, 10000)) < 0) { >> fprintf(stderr, "epoll_wait failed"); >> return -1; >> } >> printf("...returned %d.\n", ret); >> >> // This should _not_ return immediately if edge triggered! >> printf("Second wait."); >> if ((ret = epoll_wait(outer_ep, events, 1, 10000)) < 0) { >> fprintf(stderr, "epoll_wait failed"); >> return -1; >> } >> printf("...returned %d.\n", ret); >> >> printf("Done.\n"); >> >> return 0; >> } >> >> [7.] Environment >> (Should not be sensitive to hardware or kernel version, I don't >> think...a basic flaw with epoll_wait logic) >>