Re: [OMPI devel] singleton broken on master

2016-07-21 Thread Ralph Castain
Fix included in PR https://github.com/open-mpi/ompi/pull/1897 > On Jul 21, 2016, at 5:34 AM, Gilles Gouaillardet > wrote: > > Ralph, > > I noted singleton are broken on master. > git bisect points to the commit in which PMIx_tool were introduced. > if you revert to this commit, orted forked b

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
Thank you! 2016-07-21 22:05 GMT+06:00 Ralph Castain : > I’ve got this fixed in PR https://github.com/open-mpi/ompi/pull/1897 > > > > On Jul 21, 2016, at 8:31 AM, Jeff Squyres (jsquyres) > wrote: > > > > FWIW, we have the Travis issue solved on master (see > https://github.com/open-mpi/ompi/commi

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Ralph Castain
I’ve got this fixed in PR https://github.com/open-mpi/ompi/pull/1897 > On Jul 21, 2016, at 8:31 AM, Jeff Squyres (jsquyres) > wrote: > > FWIW, we have the Travis issue solved on master (see > https://github.com/open-mpi/ompi/commit/af23dcc1239188e06c1b71f0735a83edc45178f2 > if you care). I

Re: [OMPI devel] LANL jenkins update

2016-07-21 Thread Jeff Squyres (jsquyres)
Done. > On Jul 21, 2016, at 11:42 AM, Josh Hursey wrote: > > Awesome. Can you add that to this wiki page for future reference: > > https://github.com/open-mpi/ompi/wiki/PRJenkins#how-to-re-trigger-jenkins-testing > > On Thu, Jul 21, 2016 at 10:39 AM, Pritchard Jr., Howard > wrote: > Hi Fol

Re: [OMPI devel] LANL jenkins update

2016-07-21 Thread Josh Hursey
Awesome. Can you add that to this wiki page for future reference: https://github.com/open-mpi/ompi/wiki/PRJenkins#how-to-re-trigger-jenkins-testing On Thu, Jul 21, 2016 at 10:39 AM, Pritchard Jr., Howard wrote: > Hi Folks, > > The LANL/(soon to not be iu) jenkins should now work with > > bot:la

[OMPI devel] LANL jenkins update

2016-07-21 Thread Pritchard Jr., Howard
Hi Folks, The LANL/(soon to not be iu) jenkins should now work with bot:lanl:retest Also, NERSC Cori system went down this morning for maintenance during CI check of PR 1896 on master. I didn't see any others impacted by the cori maintenance. Howard -- Howard Pritchard HPC-DES Los Alamos Nat

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Jeff Squyres (jsquyres)
FWIW, we have the Travis issue solved on master (see https://github.com/open-mpi/ompi/commit/af23dcc1239188e06c1b71f0735a83edc45178f2 if you care). I just filed a v2.x PR to get the fix over there, too. However, it looks like Travis doesn't merge to current HEAD when it's doing building, so ex

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Ralph Castain
Yeah - Travis was dead for the issues cited elsewhere, and Mellanox failed for other reasons (thread-related, distclean, or some such as I recall). I’m checking the builds now - suspect it has to do with the new PMIx_Get retrieval rules > On Jul 21, 2016, at 8:25 AM, Artem Polyakov wrote: >

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
correction: 3 out of 5 checks passed. 2016-07-21 21:24 GMT+06:00 Artem Polyakov : > Yes I though so as well. I see that only 2 checks was passed when your PR > was merged so it might be. > > 2016-07-21 21:23 GMT+06:00 Ralph Castain : > >> I’m checking this - could be something to do with the rece

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
Yes I though so as well. I see that only 2 checks was passed when your PR was merged so it might be. 2016-07-21 21:23 GMT+06:00 Ralph Castain : > I’m checking this - could be something to do with the recent PMIx update > > On Jul 21, 2016, at 8:21 AM, Artem Polyakov wrote: > > I see the same err

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Ralph Castain
I’m checking this - could be something to do with the recent PMIx update > On Jul 21, 2016, at 8:21 AM, Artem Polyakov wrote: > > I see the same error with `sm,self` and `vader,self` in the PR > https://github.com/open-mpi/ompi/pull/1883 > . > > `op

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
I see the same error with `sm,self` and `vader,self` in the PR https://github.com/open-mpi/ompi/pull/1883. `openib` and `tcp` works fine. Seems like regression. 2016-07-21 20:11 GMT+06:00 Jeff Squyres (jsquyres) : > On Jul 21, 2016, at 3:53 AM, Gilles Gouaillardet > wrote: > > > > Folks, > > >

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
We run autogen.pl when doin'g make_tarball this breaks jenkins. 2016-07-21 20:26 GMT+06:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > I explicitly removed this directory in autogen.sh of > https://github.com/open-mpi/ompi/pull/1891 > > if only this pr is causing this error, then plea

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
Ha! funny that we chose exactly your PR to test jenkins :) 2016-07-21 20:26 GMT+06:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > I explicitly removed this directory in autogen.sh of > https://github.com/open-mpi/ompi/pull/1891 > > if only this pr is causing this error, then please d

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Gilles Gouaillardet
I reproduced the error, then ran git bisect (in which I remove both install and build dir) git bisect pointed to the most recent commit I rebuilt it (after removing both install and build dir) and the error was gone so I concluded the cause was a dirty install/build dir I will double check that to

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Gilles Gouaillardet
I explicitly removed this directory in autogen.sh of https://github.com/open-mpi/ompi/pull/1891 if only this pr is causing this error, then please disregard it until I update it tomorrow. note this log suggests a workspace shared by all pr, so I guess this is obsolete now Cheers, Gilles On Th

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
We see the following error: *14:26:55* + taskset -c 2,3 timeout -s SIGSEGV 15m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to none -mca pml ob1 -mca btl self,tcp taskset -c 2,3 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Jeff Squyres (jsquyres)
On Jul 21, 2016, at 3:53 AM, Gilles Gouaillardet wrote: > > Folks, > > Mellanox Jenkins marks recent PR's as failed for very surprising reasons. > > mpirun --mca btl sm,self ... > > failed because processes could not contact each other. i was able to > reproduce this once on my workstation, >

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Jeff Squyres (jsquyres)
Sweet; thanks Artem! > On Jul 21, 2016, at 9:24 AM, Artem Polyakov wrote: > > This is fixed now. Jenkins update dropped this setting. > We dealing with some other issue now. Will update later. > > четверг, 21 июля 2016 г. пользователь Jeff Squyres (jsquyres) написал: > Gilles: Oh, sweet! This

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
This is fixed now. Jenkins update dropped this setting. We dealing with some other issue now. Will update later. четверг, 21 июля 2016 г. пользователь Jeff Squyres (jsquyres) написал: > Gilles: Oh, sweet! This could answer a long-standing question: why PR's > sometimes fail with unexplained Libt

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Jeff Squyres (jsquyres)
Gilles: Oh, sweet! This could answer a long-standing question: why PR's sometimes fail with unexplained Libtool / depcomp problems. Artem: I'm mailing list several hours after your initial exchange with Gilles, so you may have solved this by now, but since your Jenkins was running multiple Ope

[OMPI devel] singleton broken on master

2016-07-21 Thread Gilles Gouaillardet
Ralph, I noted singleton are broken on master. git bisect points to the commit in which PMIx_tool were introduced. if you revert to this commit, orted forked by the singleton crashes. iirc, latest master does nit work, but orted does not crash either. sorry for the lack of details , I am afk unti

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
Thank you for the input by the way. It sounds very useful! 2016-07-21 13:54 GMT+06:00 Artem Polyakov : > Gilles, we are aware and working on this. > > 2016-07-21 13:53 GMT+06:00 Gilles Gouaillardet : > >> Folks, >> >> >> Mellanox Jenkins marks recent PR's as failed for very surprising reasons. >>

Re: [OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Artem Polyakov
Gilles, we are aware and working on this. 2016-07-21 13:53 GMT+06:00 Gilles Gouaillardet : > Folks, > > > Mellanox Jenkins marks recent PR's as failed for very surprising reasons. > > > mpirun --mca btl sm,self ... > > > failed because processes could not contact each other. i was able to > repro

[OMPI devel] about Mellanox Jenkins

2016-07-21 Thread Gilles Gouaillardet
Folks, Mellanox Jenkins marks recent PR's as failed for very surprising reasons. mpirun --mca btl sm,self ... failed because processes could not contact each other. i was able to reproduce this once on my workstation, and found the root cause was a dirty build and/or install dir. i adde