[OMPI devel] Refresh the libevent to 1.4.13.
Hi all A recent update of the libevent seems to cause a regression on our side. On my 32 cpus node cluster , process launch by srun, hang on opal_event_loop(). We see a deadlock in MPI_Init (endlessly looping in opal_event_loop()) when we launch processes with pure srun on 32 cores nodes. Here is the changeset which seems to be the cause of this regression : changeset:17590:58d39172e7b3 branch: v1.5 parent: 17552:6fdc0376d114 user: bosilca date: Tue Feb 23 22:38:06 2010 + summary: Refresh the libevent to 1.4.13. It seems that the libevent 1.4.13 was modified while being merged with Open MPI. The regression disappears if I apply the attached patch, which restores the original libevent code. Is there a reason for this difference between Open MPI and the official libevent ? Do you think my fix is correct ? Thanks Damien libevent changes diff -r 36e71ea5d92f opal/event/devpoll.c --- a/opal/event/devpoll.c Wed May 19 11:22:01 2010 +0200 +++ b/opal/event/devpoll.c Fri Jun 04 11:36:52 2010 +0200 @@ -144,7 +144,7 @@ devpoll_init(struct event_base *base) if (getrlimit(RLIMIT_NOFILE, &rl) == 0 && rl.rlim_cur != RLIM_INFINITY) - nfiles = rl.rlim_cur - 1; + nfiles = rl.rlim_cur; /* Initialize the kernel queue */ if ((dpfd = open("/dev/poll", O_RDWR)) == -1) { @@ -192,12 +192,12 @@ devpoll_recalc(struct event_base *base, { struct devpollop *devpollop = arg; - if (max > devpollop->nfds) { + if (max >= devpollop->nfds) { struct evdevpoll *fds; int nfds; nfds = devpollop->nfds; - while (nfds < max) + while (nfds <= max) nfds <<= 1; fds = realloc(devpollop->fds, nfds * sizeof(struct evdevpoll)); diff -r 36e71ea5d92f opal/event/epoll.c --- a/opal/event/epoll.cWed May 19 11:22:01 2010 +0200 +++ b/opal/event/epoll.cFri Jun 04 11:36:52 2010 +0200 @@ -167,12 +167,12 @@ epoll_recalc(struct event_base *base, vo { struct epollop *epollop = arg; - if (max > epollop->nfds) { + if (max >= epollop->nfds) { struct evepoll *fds; int nfds; nfds = epollop->nfds; - while (nfds < max) + while (nfds <= max) nfds <<= 1; fds = realloc(epollop->fds, nfds * sizeof(struct evepoll));
Re: [OMPI devel] Migrate OpenMPI to the VxWorks
Sure - just configure with --enable-mca-no-build=filem-rsh,ess-singleton That will avoid building either of those. On Jun 6, 2010, at 9:46 PM, 张晶 wrote: > I find the calls to fork/exec in the orte/mca/ess/singleton and > orte/mca/filem/rsh. Since the rsh is the only componentfor the filem, > I wonder I can also omit the orte/mca/filem/rsh? > > 2010/6/4 Ralph Castain : >> Jeff is correct - create an orte/odls/vxworks and do whatever you need for >> that platform to launch a local child process. >> >> I believe you will also find calls to fork/exec in the >> orte/mca/ess/singleton area. You may want to add a configure.m4 to that >> component to tell it not to build for vxworks. >> >> >> 2010/6/4 Jeff Squyres >>> >>> Maybe gettimeofday() be replaced with opal_gettimeofday(), which could do >>> the Right Thing on different platforms...? >>> >>> Also, for fork/exec, I think that should be mostly limited to >>> orte/odls/default, right? If so, perhaps the right thing to do is to clone >>> that plugin and adapt it for you platform. >>> >>> >>> On Jun 4, 2010, at 1:43 AM, 张晶 wrote: >>> Hi Castain , Your last mail to me is really helpful . I met most of the issues listed and fixed them as the off-list solution or mine . Also as the openmpi code changed there are some other issues (almost the missing function ) that are not reported .For example , the gettimeofday posix function is not implemented by vxworks library ,I just wrote a small library for those function. Until now I have successfully compiled the libopen-rte.a and libopen-pal.a , but now I stuck at the problem of fork and exec ,which is not available in the vxworks. It is not possible to implement the fork and exec by myself.I have to read through the code using the fork ,then substitute them with rtpspawn() . It is a challenging work.I really want to know how Brian Barrett deals with the fork() and exec() . Thanks Jing 2010/3/18 Ralph Castain : > Hi Jing > Someone else took a look at this off-list a few years ago. It was > mostly a > problem with the build system (some flags are different) and header > file > names. I don't believe the port was ever completed though. > I have appended the results of that conversation - the last message > contained a list of the issues. You would need to update that to the > trunk > of course as the code has changed considerably since that discussion > took > place. Brian Barrett subsequently created a first-cut at fixing some > of > these, but that appears to have been lost in the years since it was > done - > and wouldn't really be current anyway. > I would be happy to assist as I can. > Ralph > > 1. configure issues with "checking prefix for global symbol labels" > > 1a. VxWorks assembler (CCAS=asppc) generates a.out by default (vs. > > conftest.o that we need subsequently) > > there is this fragment to determine the way to assemble conftest.s: > > if test "$CC" = "$CCAS" ; then > >ompi_assemble="$CCAS $CCASFLAGS -c conftest.s >conftest.out 2>&1" > > else > >ompi_assemble="$CCAS $CCASFLAGS conftest.s >conftest.out 2>&1" > > fi > > The subsequent link fails because conftest.o does not exist: > > ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o > conftest > > conftest.link 2>&1" > > To work around the problem, I did not set CCAS. This gives me the > first > > invocation that includes the -c argument to CC=ccppc, generating > > conftest.o output. > > > 1b. linker fails because LDFLAGS are not passed > > The same linker command line caused problems because $CFLAGS were > passed > > to the linker > > ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o > conftest > > conftest.link 2>&1" > > In my environment, I set CC/CFLAGS/LDFLAGS as follows: > > CC=ccppc > > CFLAGS=-ggdb3 -std=c99 -pedantic -mrtp -msoft-float -mstrict-align > > -mregnames -fno-builtin -fexceptions' > > LDFLAGS=-mrtp -msoft-float -Wl,--start-group -Wl,--end-group > > > -L/amd/raptor/root/opt/WindRiver/vxworks-6.3/target/usr/lib/ppc/PPC32/sfcommon > > The linker flags are not passed because the ompi_link > > [xp-kcain1:build_vxworks] ccppc -ggdb3 -std=c99 -pedantic -mrtp > > -msoft-float -mstrict-align -mregnames -fno-builtin -fexceptions -o > > hello hello.c > > > /amd/raptor/root/opt/WindRiver/gnu/3.4.4-vxworks-6.3/x86-linux2/bin/../lib/gcc/powerpc-wrs-vxworks/3.4.4/../../../../powerpc-wrs-vxworks/bin/ld: > > > cannot find -lc_internal > > collect2: ld returned 1 exit status > > > 2. OPAL ato
[OMPI devel] amd64 atomic.h warnings
I'm getting these warnings from PGI 7.0.7. Do they matter? Is "clobber" important? CXXmpicxx.lo "../../../opal/include/opal/sys/amd64/atomic.h", line 91: warning: "cc" clobber ignored : "memory", "cc"); ^ "../../../opal/include/opal/sys/amd64/atomic.h", line 83: warning: parameter "oldval" was set but never used int32_t oldval, int32_t newval) ^ "../../../opal/include/opal/sys/amd64/atomic.h", line 112: warning: "cc" clobber ignored : "memory", "cc" ^ "../../../opal/include/opal/sys/amd64/atomic.h", line 104: warning: parameter "oldval" was set but never used int64_t oldval, int64_t newval) ^ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] amd64 atomic.h warnings
On Jun 7, 2010, at 19:47 , Jeff Squyres wrote: > I'm getting these warnings from PGI 7.0.7. Do they matter? Is "clobber" > important? clobber might be the most important piece of information there, it trigger a warning for the compiler that the condition code register have been altered. This code is protected by OMPI_GCC_INLINE_ASSEMBLY, so if we're compiling it it means that somehow we detected that PGI support the GCC inline assembly. Now, if they only half-support it, there is not much we can do. Can you send the assembly instructions generated by the PGI compiler for the following code: int32_t oldval; do { oldval = *addr; } while (0 == opal_atomic_cmpset_32(addr, oldval, oldval + delta)); return (oldval + delta); Thanks, george. > > CXXmpicxx.lo > "../../../opal/include/opal/sys/amd64/atomic.h", line 91: warning: "cc" > clobber ignored > : "memory", "cc"); > ^ > > "../../../opal/include/opal/sys/amd64/atomic.h", line 83: warning: parameter > "oldval" was set but never used > int32_t oldval, int32_t newval) > ^ > > "../../../opal/include/opal/sys/amd64/atomic.h", line 112: warning: "cc" > clobber ignored > : "memory", "cc" > ^ > > "../../../opal/include/opal/sys/amd64/atomic.h", line 104: warning: parameter > "oldval" was set but never used > int64_t oldval, int64_t newval) > ^ > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 .so version numbers
Totally insane ... but I was not talking about your rationale. ^^ How did we ended up with the following situation: -libmca_common_sm_so_version=1:0:0 -libmca_common_mx_so_version=0:0:0 +libmca_common_sm_so_version=2:0:0 +libmca_common_mx_so_version=1:0:0 Where the same type of component (common sm and mx here) have different version numbers? Thanks, george. On Jun 5, 2010, at 06:08 , Ralf Wildenhues wrote: > Hi Jeff, > > * Jeff Squyres wrote on Thu, Jun 03, 2010 at 09:34:16PM CEST: >> SHORT VERSION: We broke ABI from the 1.4 series to the v1.5 series. I >> propose changing all the libtool .so version numbers as shown below to >> enforce that break. Can someone sanity check this? > > Looks sane to me, with the details you have given. > > Cheers, > Ralf > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel