Hi all

A recent update of the libevent seems to cause a regression on our side.

On my 32 cpus node cluster , process launch by srun, hang on opal_event_loop(). We see a deadlock in MPI_Init (endlessly looping in opal_event_loop()) when we launch processes with pure srun on 32 cores nodes.

Here is the changeset which seems to be the cause of this regression :
changeset:    17590:58d39172e7b3
branch:       v1.5
parent:       17552:6fdc0376d114
user:         bosilca
date:         Tue Feb 23 22:38:06 2010 +0000
summary:      Refresh the libevent to 1.4.13.

It seems that the libevent 1.4.13 was modified while being merged with Open MPI. The regression disappears if I apply the attached patch, which restores the original libevent code.

Is there a reason for this difference between Open MPI and the official libevent ?
Do you think my fix is correct ?

Thanks
Damien

libevent changes

diff -r 36e71ea5d92f opal/event/devpoll.c
--- a/opal/event/devpoll.c      Wed May 19 11:22:01 2010 +0200
+++ b/opal/event/devpoll.c      Fri Jun 04 11:36:52 2010 +0200
@@ -144,7 +144,7 @@ devpoll_init(struct event_base *base)

        if (getrlimit(RLIMIT_NOFILE, &rl) == 0 &&
            rl.rlim_cur != RLIM_INFINITY)
-               nfiles = rl.rlim_cur - 1;
+               nfiles = rl.rlim_cur;

        /* Initialize the kernel queue */
        if ((dpfd = open("/dev/poll", O_RDWR)) == -1) {
@@ -192,12 +192,12 @@ devpoll_recalc(struct event_base *base, 
 {
        struct devpollop *devpollop = arg;

-       if (max > devpollop->nfds) {
+       if (max >= devpollop->nfds) {
                struct evdevpoll *fds;
                int nfds;

                nfds = devpollop->nfds;
-               while (nfds < max)
+               while (nfds <= max)
                        nfds <<= 1;

                fds = realloc(devpollop->fds, nfds * sizeof(struct evdevpoll));
diff -r 36e71ea5d92f opal/event/epoll.c
--- a/opal/event/epoll.c        Wed May 19 11:22:01 2010 +0200
+++ b/opal/event/epoll.c        Fri Jun 04 11:36:52 2010 +0200
@@ -167,12 +167,12 @@ epoll_recalc(struct event_base *base, vo
 {
        struct epollop *epollop = arg;

-       if (max > epollop->nfds) {
+       if (max >= epollop->nfds) {
                struct evepoll *fds;
                int nfds;

                nfds = epollop->nfds;
-               while (nfds < max)
+               while (nfds <= max)
                        nfds <<= 1;

                fds = realloc(epollop->fds, nfds * sizeof(struct evepoll));

Reply via email to