Hello
I found that SLURM installations that use cgroup plugin and
have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all
processes on non-launch node are assigned on one core. This leads to quite
poor performance.
The problem can be seen only if using mpirun to start parallel applica
de1, node2 with 12 cpus
2) node3 with 7 cpus.
then it uses separate srun's for each group.
The weakness of this patch is that we need to deal with several srun's and
I am not sure that cleanup will perform correctly. I plan to test this case
additionaly.
2014-02-12 17:42 GMT+07:00 Artem
Good idea :)!
среда, 7 мая 2014 г. пользователь Ralph Castain написал:
> Jeff actually had a useful suggestion (gasp!).He proposed that we separate
> the PMI-1 and PMI-2 codes into separate components so you could select them
> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs
. There are several places in
> OMPI where the distinction between PMI1and PMI2 is made, not only in
> grpcomm. DB and ESS frameworks off the top of my head.
>
> Josh
>
>
> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote:
>
>> Good idea :)!
>>
>> ср
ng what PMI version to use.
Does that sounds reasonable?
2014-05-07 23:10 GMT+07:00 Artem Polyakov :
> That's a good point. There is actually a bunch of modules in ompi, opal
> and orte that has to be duplicated.
>
> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>
>&
are legal. If not - we'll do that sequentially.
> In other places we'll just use the flag saying what PMI version to use.
> Does that sounds reasonable?
>
> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>
>> That's a good point. There is actually a bunch of module
to implement.
2. or to have 2 separate common modules for PMI1 and one for PMI2, and does
this fit opal/mca/common/ ideology at all?
2014-05-08 6:44 GMT+07:00 Artem Polyakov :
>
> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
>
> Ummmno, I don't think that's right. I
ming the codes are mostly common
> in the individual frameworks.
>
>
> On May 7, 2014, at 4:51 PM, Artem Polyakov wrote:
>
> Just reread your suggestions in our out-of-list discussion and found that
> I misunderstand it. So no parallel PMI! Take all possible code into
> opal/mca/com
cted at runtime
>
> * moving some additional functions into that code area and out of the
> individual components
>
Ok, that is pretty clear now. And will do exactly #2.
Thank you.
>
>
> On May 7, 2014, at 5:08 PM, Artem Polyakov wrote:
>
> I like #2 too.
> B
Hi Chris.
Current disign is to provide the runtime parameter for PMI version
selection. It would be even more flexible that configuration-time selection
and (with my current understanding) not very hard to acheive.
2014-05-08 8:15 GMT+07:00 Christopher Samuel :
> -BEGIN PGP SIGNED MESSAGE--
That is interesting. I think I will reconstruct your experiments on my
system when I will be testing PMI selection logic. According to your
resource count numbers I can do that. I will publish my results in the list.
2014-05-08 8:51 GMT+07:00 Christopher Samuel :
> -BEGIN PGP SIGNED MESSAGE-
2014-05-08 9:54 GMT+07:00 Ralph Castain :
>
> On May 7, 2014, at 6:15 PM, Christopher Samuel
> wrote:
>
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Hi all,
> >
> > Apologies for having dropped out of the thread, night intervened here.
> ;-)
> >
> > On 08/05/14 00:45, Ralph Casta
Hi, all.
Ralph commited the code that was developed for this RFC (r31908). This
commit will brake PMI1 support. In case of hurry - apply attached patch.
Ralph will apply it once he'll be online. I have no rights for that yet.
2014-05-19 21:18 GMT+07:00 Ralph Castain :
> WHAT:Refactor the PM
Thank you, Mike!
2014-06-01 13:43 GMT+07:00 Mike Dubman :
> applied here: https://svn.open-mpi.org/trac/ompi/changeset/31909
>
>
> On Sun, Jun 1, 2014 at 9:15 AM, Artem Polyakov wrote:
>
>> Hi, all.
>>
>> Ralph commited the code that was developed for this
Hello, while testing new PMI implementation I faced a problem with OpenIB
and/or usNIC support.
The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
code influence I used mpirun to launch the job. Slurm job
P.S.
1. Just to make sure I tried the same program with old ompi-1.6.5 that is
installed on our cluster without any problem.
2. My testing program just sends data through the ring.
2014-06-01 13:57 GMT+07:00 Artem Polyakov :
> Hello, while testing new PMI implementation I faced a problem w
t; Gilles
>
>
>
> On Sun, Jun 1, 2014 at 3:57 PM, Artem Polyakov > wrote:
>
>>
>> 2. With fixed OpenIB support (add export OMPI_MCA_btl="openib,self" in
>> attached batch script) I get followint error:
>> hellompi:
>> /home/research/artpol
he
> openib BTL)
>
>
> On Jun 1, 2014, at 2:57 AM, Artem Polyakov wrote:
>
> > Hello, while testing new PMI implementation I faced a problem with
> OpenIB and/or usNIC support.
> > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
>
2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:
> export OMPI_MCA_btl_openib_use_eager_rdma=0
Gilles,
I test your approach. Both:
a) export OMPI_MCA_btl_openib_use_eager_rdma=0
b) applying your patch and run without "export
OMPI_MCA_btl_openib_use_eager_rdma=0"
I did check this for SLURM 2.6.5
2014-06-01 20:31 GMT+07:00 Ralph Castain :
> That really wasn't necessary - I had tested it under PMI-1 and it was
> fine. Artem: did you test it, or just assume it wasn't right?
>
>
> On May 31, 2014, at 11:47 PM, Artem Polyakov w
Here is quick fix of OMPI timing facility. Currently first measurement is
bogus because OMPI_PROC_MY_NAME is not initialized at the time of first
ompistart setup:
*time from start to completion of rte_init 1348381643658244 usec*
time from completion of rte_init to modex 17585 usec
time to execute
Hello,
I would like to participate in PMI and modex discussions remotely.
2014-06-19 22:44 GMT+07:00 Jeff Squyres (jsquyres) :
> We have a bunch of topics listed on the wiki, but no real set agenda:
>
> https://svn.open-mpi.org/trac/ompi/wiki/Jun14Meeting
>
> We had remote-attendance reques
ime.
>
>
> On Jun 19, 2014, at 9:26 PM, Artem Polyakov > wrote:
>
> > Hello,
> >
> > I would like to participate in PMI and modex discussions remotely.
> >
> >
> > 2014-06-19 22:44 GMT+07:00 Jeff Squyres (jsquyres) >:
> > We ha
*^*
> *pshmem_put_f.c:36:5:* *note: *in expansion of macro ‘*MCA_SPML_CALL*’
> *MCA_SPML_CALL*(put(FPTR_2_VOID_PTR(target),
>
>
>
>
--
-
Best regards, Artem Polyakov
(Mobile mail)
Ok, thank you. We will take a look
понедельник, 18 июля 2016 г. пользователь Ralph Castain написал:
> Sorry - this is on today’s master
>
> On Jul 17, 2016, at 8:31 PM, Artem Polyakov > wrote:
>
> What is it? What repository?
>
> понедельник, 18 июля 2016 г. пользовате
We have the fix. Will PR shortly.
понедельник, 18 июля 2016 г. пользователь Ralph Castain написал:
> Sorry - this is on today’s master
>
> On Jul 17, 2016, at 8:31 PM, Artem Polyakov > wrote:
>
> What is it? What repository?
>
> понедельник, 18 июля 2016 г. пользовате
Gilles, we are aware and working on this.
2016-07-21 13:53 GMT+06:00 Gilles Gouaillardet :
> Folks,
>
>
> Mellanox Jenkins marks recent PR's as failed for very surprising reasons.
>
>
> mpirun --mca btl sm,self ...
>
>
> failed because processes could not contact each other. i was able to
> repro
Thank you for the input by the way. It sounds very useful!
2016-07-21 13:54 GMT+06:00 Artem Polyakov :
> Gilles, we are aware and working on this.
>
> 2016-07-21 13:53 GMT+06:00 Gilles Gouaillardet :
>
>> Folks,
>>
>>
>> Mellanox Jenkins marks recent PR
> different build dir, different install dir.
>
>
>
>
> > On Jul 21, 2016, at 3:56 AM, Artem Polyakov > wrote:
> >
> > Thank you for the input by the way. It sounds very useful!
> >
> > 2016-07-21 13:54 GMT+06:00 Artem Polyakov >:
> > Gilles,
We see the following error:
*14:26:55* + taskset -c 2,3 timeout -s SIGSEGV 15m
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun
-np 8 -bind-to none -mca pml ob1 -mca btl self,tcp taskset -c 2,3
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello
error, then please disregard it until I
> update it tomorrow.
>
> note this log suggests a workspace shared by all pr, so I guess this is
> obsolete now
>
> Cheers,
>
> Gilles
>
>
>
> On Thursday, July 21, 2016, Artem Polyakov wrote:
>
>> We see th
ausing this error, then please disregard it until I
> update it tomorrow.
>
> note this log suggests a workspace shared by all pr, so I guess this is
> obsolete now
>
> Cheers,
>
> Gilles
>
>
>
> On Thursday, July 21, 2016, Artem Polyakov wrote:
>
>> We s
I see the same error with `sm,self` and `vader,self` in the PR
https://github.com/open-mpi/ompi/pull/1883.
`openib` and `tcp` works fine. Seems like regression.
2016-07-21 20:11 GMT+06:00 Jeff Squyres (jsquyres) :
> On Jul 21, 2016, at 3:53 AM, Gilles Gouaillardet
> wrote:
> >
> > Folks,
> >
>
Yes I though so as well. I see that only 2 checks was passed when your PR
was merged so it might be.
2016-07-21 21:23 GMT+06:00 Ralph Castain :
> I’m checking this - could be something to do with the recent PMIx update
>
> On Jul 21, 2016, at 8:21 AM, Artem Polyakov wrote:
>
>
correction: 3 out of 5 checks passed.
2016-07-21 21:24 GMT+06:00 Artem Polyakov :
> Yes I though so as well. I see that only 2 checks was passed when your PR
> was merged so it might be.
>
> 2016-07-21 21:23 GMT+06:00 Ralph Castain :
>
>> I’m checking this - could be som
some such as I
> recall). I’m checking the builds now - suspect it has to do with the new
> PMIx_Get retrieval rules
> >>
> >>
> >>> On Jul 21, 2016, at 8:25 AM, Artem Polyakov
> wrote:
> >>>
> >>> correction: 3 out of 5 chec
up probably next week. I have to access
> UTK machine for that.
> * I did some test and yes, I have seen some openib hang in
> multithreaded case.
> Thank you,
> Arm
>
> From: devel < devel-boun...@lists.open-mpi.org > on behalf of Artem
> Polyakov < art
l do the 2.0.1rc in the next days as well.
>
> Is it possible to add me to the results repository at github or should I
> fork and request you to pull?
>
> Best
> Christoph
>
>
> - Original Message -
> From: "Artem Polyakov" >
> To: "Open M
ca написал:
> Arm repo is a good location until we converge to a well-defined set of
> tests.
>
> George.
>
>
> On Thu, Aug 25, 2016 at 1:44 PM, Artem Polyakov > wrote:
>
>> That's a good question. I have results myself and I don't know where to
>
pers meeting few
> weeks ago, but we barely define what we think will be necessary for trivial
> tests such as single threaded bandwidth. It might be worth having a regular
> phone call (in addition to the Tuesday morning) to make progress.
>
> George.
>
>
> On Thu, Aug 25,
ts such as single threaded bandwidth. It might be worth having a regular
>>> phone call (in addition to the Tuesday morning) to make progress.
>>>
>>> George.
>>>
>>>
>>> On Thu, Aug 25, 2016 at 9:37 PM, Artem Polyakov
>>> wrote:
>
> Let me know if you want one.
>
>
> > On Aug 26, 2016, at 8:46 AM, Artem Polyakov wrote:
> >
> > I've marked the first week.
> >
> > 2016-08-26 19:26 GMT+07:00 George Bosilca :
> > Let's go regular for a period and then adapt.
> >
> > Fo
Sufficient. Probably I missed it. No need to do anything.
2016-08-26 21:31 GMT+07:00 Jeff Squyres (jsquyres) :
> Just curious: is https://github.com/open-mpi/2016-summer-perf-testing not
> sufficient?
>
>
>
> > On Aug 26, 2016, at 10:28 AM, Artem Polyakov wrote:
> &
Howard,
can you link to commits you are referring?
Do you mean this one for example:
https://github.com/open-mpi/ompi/commit/15098161a331168c66b29a696522fe52c8b2d8f5
?
2016-12-01 15:28 GMT-08:00 Howard Pritchard :
> Hi Gilles
>
> I didn't see a merge commit for all these commits,
> hence my conc
All systems are different and it is hard to compete in coverage with our
set of Jenkins' :).
2016-12-01 14:51 GMT-08:00 r...@open-mpi.org :
> FWIW: I verified it myself, and it was fine on my systems
>
> On Dec 1, 2016, at 2:46 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>
But I guess that we can verify that things are not broken using other PR's.
Looks that all is good: https://github.com/open-mpi/ompi/pull/2493
2016-12-01 15:38 GMT-08:00 Artem Polyakov :
> All systems are different and it is hard to compete in coverage with our
> set of Jenkins'
en-mpi/ompi/pull/2488
>>
>> So please don’t jump to conclusions
>>
>> On Dec 1, 2016, at 3:49 PM, Artem Polyakov wrote:
>>
>> But I guess that we can verify that things are not broken using other
>> PR's.
>> Looks that all is good: https://gith
+1 to Paul.
I had to go git-bisect OMPI only several times but it always was a
non-trivial task. PR's are grouping commit's logically and are good for the
bookkeeping.
Also you never know what will a "trivial fix" turn into and in what
circumstances/configurations.
IMO all changes needs to go thro
With regard to timezone - we have developers in close timezones, so I don't
think this is a reasonable argument.
2016-12-01 16:49 GMT-08:00 Artem Polyakov :
> +1 to Paul.
>
> I had to go git-bisect OMPI only several times but it always was a
> non-trivial task. PR's
Brian, I'm going to push for the fix tonight. If won't work - we will do as
you advised.
2017-06-21 17:23 GMT-07:00 Barrett, Brian via devel <
devel@lists.open-mpi.org>:
> In the mean time, is it possible to disable the jobs that listen for pull
> requests on Open MPI’s repos? I’m trying to get
ent Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
--
- Best r
gt;
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
--
- Best regards, Artem Polyakov (Mobile mail)
___
devel mailing list
t couple of days from Mellanox?
>
> Thanks,
>
> Brian
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
--
- Best regards, Artem Polyakov (Mobile mail)
__
Brian,
Have you had a chance to put this on the wiki? If so - can you send the
link - I can't find it.
2017-07-19 16:47 GMT-07:00 Barrett, Brian via devel <
devel@lists.open-mpi.org>:
> I’ll update the wiki (and figure out where on our wiki to put more general
> information), but the basics are:
Jeff and others,
1. The benchmark was updated to support shared memory case.
2. The wiki was updated with the benchmark description:
https://github.com/open-mpi/ompi/wiki/Request-refactoring-test#benchmark-prototype
Let me know if we want to put this prototype to some general place. I think
it ma
tyle
> commenting/referencing.
>
>
> Arm
>
>
>
>
> On 7/28/16, 3:02 PM, "devel on behalf of Jeff Squyres (jsquyres)" <
> devel-boun...@lists.open-mpi.org on behalf of jsquy...@cisco.com> wrote:
>
> >On Jul 28, 2016, at 6:28 AM, Artem Polyakov wr
P.S. For the future reference we also need to keep launch scripts that were
used to be able to carefully reproduce. Jeff mentioned that on the wiki
page IFRC.
2016-07-29 12:42 GMT+07:00 Artem Polyakov :
> Thank you, Arm!
>
> Good to have vader results (I haven't tried it my
Hello,
I would like to introduce OMPI timing framework that was included into the
trunk yesterday (r32738). The code is new so if you'll hit some bugs - just
let me know.
The framework consists of the set of macro's and routines for internal OMPI
usage + standalone tool mpisync and few additional
?
>
> - opal_config.h should be the first include in opal/util/timings.c.
>
> - If timing support is not to be compiled in, then opal/util/timings.c
> should not be be compiled via the Makefile.am (rather than entirely #if'ed
> out).
>
> It looks like this work is about 9
think it isn't that hard for them to
> configure it.
>
>
> On Sep 18, 2014, at 7:16 AM, Artem Polyakov > wrote:
>
> Jeff, thank you for the feedback! All of mentioned issues are clear and I
> will fix them shortly.
>
> One important thing that needs additional
to use it.
>
> - There's "TODO" comments in opal/util/timings.c; should those be fixed?
>
> - opal_config.h should be the first include in opal/util/timings.c.
>
> - If timing support is not to be compiled in, then opal/util/timings.c
> should not be be compiled via
Hello, I have troubles with latest trunk if I use PMI1.
For example, if I use 2 nodes the application hangs. See backtraces from
both nodes below. From them I can see that second (non launching) node
hangs in bcol component selection. Here is the default setting of
bcol_base_string parameter:
bcol
th a tentative
> fix.
>
> Could you please give it a try and reports if it solves your problem ?
>
> Cheers
>
> Gilles
>
>
> Artem Polyakov wrote:
> Hello, I have troubles with latest trunk if I use PMI1.
>
> For example, if I use 2 nodes the application hangs. See b
ave the same problem.
>
but mine is with bcol, not coll framework. And as you can see modules
itself doesn't brake the program. Only some of their combinations. Also I
am curious why basesmuma module listed twice.
> Best regards,
> Elena
>
> On Fri, Oct 17, 2014 at 7:01 PM, Artem
I think this might be related to the configuration problem I was fixing
with Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240
2014-12-02 10:15 GMT+06:00 Ralph Castain :
> If it isn’t too much trouble, it would be good to confirm that it remains
> broken. I strongly suspe
was also
included into 1.8 branch. I am not sure that this is the same issue but
they looks similar.
>
>
> On Dec 1, 2014, at 9:40 PM, Artem Polyakov wrote:
>
> I think this might be related to the configuration problem I was fixing
> with Jeff few months ago. Refer here:
> https:/
> Thanks
>
>
> On Dec 2, 2014, at 3:17 AM, Artem Polyakov wrote:
>
>
>
> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>
>> Hmmm…if that is true, then it didn’t fix this problem as it is being
>> reported in the master.
>>
>
> I had this problem on my la
;> *If* you add that workaround (which is a whole separate discussion), I
>> would suggest adding a configure.m4 test to see if adding the additional
>> -llibs are necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and
>> then if that fails, AC_LINK_IFELSE again with the additio
let me add the config.log file, since it is
> too large, I can forward the output to you directly as well (as I did to
> Jeff).
> >>
> >> I honestly have not looked into the configure logic, I can just tell
> that OPAL_HAVE_LTDL_ADVISE is not set on my linux system for m
2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :
> On Dec 2, 2014, at 8:43 PM, Artem Polyakov wrote:
>
> > Jeff, your fix brakes my system again. Actually you just reverted my
> changes.
>
> No, I didn't just revert them -- I made changes. I did forget about the
Howard, does current mater fix your problems?
среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
>
> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) >:
>
>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov > > wrote:
>>
>> > Jeff, your fix br
the unified solution.
2014-12-03 10:23 GMT+06:00 Ralph Castain :
> It is working for me, but I’m not sure if that is because of these changes
> or if it always worked for me. I haven’t tested the slurm integration in
> awhile.
>
>
> On Dec 2, 2014, at 7:59 PM, Artem Polyakov wro
Sure, will do that asap.
>
>
> On Dec 3, 2014, at 5:56 AM, Artem Polyakov > wrote:
>
> > I finally found the clear reason of this strange situation!
> >
> > In ompi opal_setup_libltdl.m4 has the following content:
> > CPPFLAGS="-I$srcdir -I$srcd
:
> Thanks!
>
> On Dec 3, 2014, at 7:03 AM, Artem Polyakov wrote:
>
> >
> >
> > среда, 3 декабря 2014 г. пользователь Jeff Squyres (jsquyres) написал:
> > They were equivalent until yesterday. :-)
> > I see. Got that!
> >
> > I was going to file
2014-12-04 17:29 GMT+06:00 Jeff Squyres (jsquyres) :
> On Dec 3, 2014, at 11:35 PM, Artem Polyakov wrote:
>
> > Jeff, I must admit that I don't completely understand how your fix work.
> Can you explan me why this veriant was failing:
> >
> >
2015-03-26 17:58 GMT+06:00 Gianmario Pozzi :
> Hi everyone,
> I'm an italian M.Sc. student in Computer Engineering at Politecnico di
> Milano.
>
> My team and I are trying to integrate OpenMPI with a real time resource
> manager written by a group of students named BBQ (
> http://bosp.dei.polimi.i
P.S. also check ESS (orte/mca/ess) for environment setup.
2015-03-26 18:06 GMT+06:00 Artem Polyakov :
>
> 2015-03-26 17:58 GMT+06:00 Gianmario Pozzi :
>
>> Hi everyone,
>> I'm an italian M.Sc. student in Computer Engineering at Politecnico di
>> Milano.
&
Hello, is there any progress on this topic? This affects our PMIx
measurements.
2015-10-30 21:21 GMT+06:00 Ralph Castain :
> I’ve verified that the orte/util/listener thread is not being started, so
> I don’t think it should be involved in this problem.
>
> HTH
> Ralph
>
> On Oct 30, 2015, at 8:0
then
>>>> send to it, but the OS hasn’t yet set it up. In those cases, you can hang
>>>> the socket. However, I’ve tried adding some artificial delay, and while it
>>>> helped, it didn’t completely solve the problem.
>>>>
>>>> I have an idea
2015-11-09 22:42 GMT+06:00 Artem Polyakov :
> This is the very good point, Nysal!
>
> This is definitely a problem and I can say even more: avg. 3 from every 10
> tasks was affected by this bug. Once the PR (
> https://github.com/pmix/master/pull/8) was applied I was able to ru
80 matches
Mail list logo