Hi all
A recent update of the libevent seems to cause a regression on our side.
On my 32 cpus node cluster , process launch by srun, hang on
opal_event_loop().
We see a deadlock in MPI_Init (endlessly looping in opal_event_loop())
when we launch processes with pure srun on 32 cores nodes.
Here is the changeset which seems to be the cause of this regression :
changeset: 17590:58d39172e7b3
branch: v1.5
parent: 17552:6fdc0376d114
user: bosilca
date: Tue Feb 23 22:38:06 2010 +0000
summary: Refresh the libevent to 1.4.13.
It seems that the libevent 1.4.13 was modified while being merged with
Open MPI. The regression disappears if I apply the attached patch, which
restores the original libevent code.
Is there a reason for this difference between Open MPI and the official
libevent ?
Do you think my fix is correct ?
Thanks
Damien
libevent changes
diff -r 36e71ea5d92f opal/event/devpoll.c
--- a/opal/event/devpoll.c Wed May 19 11:22:01 2010 +0200
+++ b/opal/event/devpoll.c Fri Jun 04 11:36:52 2010 +0200
@@ -144,7 +144,7 @@ devpoll_init(struct event_base *base)
if (getrlimit(RLIMIT_NOFILE, &rl) == 0 &&
rl.rlim_cur != RLIM_INFINITY)
- nfiles = rl.rlim_cur - 1;
+ nfiles = rl.rlim_cur;
/* Initialize the kernel queue */
if ((dpfd = open("/dev/poll", O_RDWR)) == -1) {
@@ -192,12 +192,12 @@ devpoll_recalc(struct event_base *base,
{
struct devpollop *devpollop = arg;
- if (max > devpollop->nfds) {
+ if (max >= devpollop->nfds) {
struct evdevpoll *fds;
int nfds;
nfds = devpollop->nfds;
- while (nfds < max)
+ while (nfds <= max)
nfds <<= 1;
fds = realloc(devpollop->fds, nfds * sizeof(struct evdevpoll));
diff -r 36e71ea5d92f opal/event/epoll.c
--- a/opal/event/epoll.c Wed May 19 11:22:01 2010 +0200
+++ b/opal/event/epoll.c Fri Jun 04 11:36:52 2010 +0200
@@ -167,12 +167,12 @@ epoll_recalc(struct event_base *base, vo
{
struct epollop *epollop = arg;
- if (max > epollop->nfds) {
+ if (max >= epollop->nfds) {
struct evepoll *fds;
int nfds;
nfds = epollop->nfds;
- while (nfds < max)
+ while (nfds <= max)
nfds <<= 1;
fds = realloc(epollop->fds, nfds * sizeof(struct evepoll));