Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
Turning off the enable_picky, I get it to compile with the following warnings: pget_elements_x_f.c:70: warning: no previous prototype for 'ompi_get_elements_x_f' pstatus_set_elements_x_f.c:70: warning: no previous prototype for 'ompi_status_set_elements_x_f' ptype_get_extent_x_f.c:69: warning: no previous prototype for 'ompi_type_get_extent_x_f' ptype_get_true_extent_x_f.c:69: warning: no previous prototype for 'ompi_type_get_true_extent_x_f' ptype_size_x_f.c:69: warning: no previous prototype for 'ompi_type_size_x_f' I also found that OpenShmem is still building by default. Is that intended? I thought you were only going to build if --with-shmem (or whatever option) was given. Looks like some cleanup is required On Aug 10, 2013, at 8:54 PM, Ralph Castain wrote: > FWIW, I couldn't get it to build - this is on a simple Xeon-based system > under CentOS 6.2: > > cc1: warnings being treated as errors > spml_yoda_getreq.c: In function 'mca_spml_yoda_get_completion': > spml_yoda_getreq.c:98: error: pointer targets in passing argument 1 of > 'opal_atomic_add_32' differ in signedness > ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected > 'volatile int32_t *' but argument is of type 'uint32_t *' > spml_yoda_getreq.c:98: error: signed and unsigned type in conditional > expression > cc1: warnings being treated as errors > spml_yoda_putreq.c: In function 'mca_spml_yoda_put_completion': > spml_yoda_putreq.c:81: error: pointer targets in passing argument 1 of > 'opal_atomic_add_32' differ in signedness > ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected > 'volatile int32_t *' but argument is of type 'uint32_t *' > spml_yoda_putreq.c:81: error: signed and unsigned type in conditional > expression > make[2]: *** [spml_yoda_getreq.lo] Error 1 > make[2]: *** Waiting for unfinished jobs > make[2]: *** [spml_yoda_putreq.lo] Error 1 > cc1: warnings being treated as errors > spml_yoda.c: In function 'mca_spml_yoda_put_internal': > spml_yoda.c:725: error: pointer targets in passing argument 1 of > 'opal_atomic_add_32' differ in signedness > ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected > 'volatile int32_t *' but argument is of type 'uint32_t *' > spml_yoda.c:725: error: signed and unsigned type in conditional expression > spml_yoda.c: In function 'mca_spml_yoda_get': > spml_yoda.c:1107: error: pointer targets in passing argument 1 of > 'opal_atomic_add_32' differ in signedness > ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected > 'volatile int32_t *' but argument is of type 'uint32_t *' > spml_yoda.c:1107: error: signed and unsigned type in conditional expression > make[2]: *** [spml_yoda.lo] Error 1 > make[1]: *** [all-recursive] Error 1 > > Only configure arguments: > > enable_picky=yes > enable_debug=yes > > > gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3) > > > > On Aug 10, 2013, at 7:21 PM, "Barrett, Brian W" wrote: > >> On 8/6/13 10:30 AM, "Joshua Ladd" wrote: >> >>> Dear OMPI Community, >>> >>> Please find on Bitbucket the latest round of OSHMEM changes based on >>> community feedback. Please git and test at your leisure. >>> >>> https://bitbucket.org/jladd_math/mlnx-oshmem.git >> >> Josh - >> >> In general, I think everything looks ok. However, the "right" thing >> doesn't happen if the CM PML is used (at least, when using the Portals 4 >> MTL). When configured with: >> >> ./configure --enable-mca-no-build=pml-ob1,pml-bfo,pml-v,btl,bml,mpool >> >> The build segfaults trying to run a SHMEM program: >> >> mpirun -np 2 ./bcast >> [shannon:90397] *** Process received signal *** >> [shannon:90397] Signal: Segmentation fault (11) >> [shannon:90397] Signal code: Address not mapped (1) >> [shannon:90397] Failing at address: (nil) >> [shannon:90398] *** Process received signal *** >> [shannon:90398] Signal: Segmentation fault (11) >> [shannon:90398] Signal code: Address not mapped (1) >> [shannon:90398] Failing at address: (nil) >> [shannon:90397] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0] >> [shannon:90397] *** End of error message *** >> [shannon:90398] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0] >> [shannon:90398] *** End of error message *** >> -- >> mpirun noticed that process rank 1 with PID 90398 on node shannon exited >> on signal 11 (Segmentation fault). >> -- >> >> >> >> Brian >> >> -- >> Brian W. Barrett >> Scalable System Software Group >> Sandia National Laboratories >> >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
Ralph - I think those warnings are just because of when they last synced with the trunk; it looks like they haven't updated in the last week, when those (and some usnic fixes) went in. More concerning is the --enable-picky stuff and the disabling of SHMEM in the right places. Brian On 8/11/13 11:24 AM, "Ralph Castain" wrote: >Turning off the enable_picky, I get it to compile with the following >warnings: > >pget_elements_x_f.c:70: warning: no previous prototype for >'ompi_get_elements_x_f' >pstatus_set_elements_x_f.c:70: warning: no previous prototype for >'ompi_status_set_elements_x_f' >ptype_get_extent_x_f.c:69: warning: no previous prototype for >'ompi_type_get_extent_x_f' >ptype_get_true_extent_x_f.c:69: warning: no previous prototype for >'ompi_type_get_true_extent_x_f' >ptype_size_x_f.c:69: warning: no previous prototype for >'ompi_type_size_x_f' > >I also found that OpenShmem is still building by default. Is that >intended? I thought you were only going to build if --with-shmem (or >whatever option) was given. > >Looks like some cleanup is required > >On Aug 10, 2013, at 8:54 PM, Ralph Castain wrote: > >> FWIW, I couldn't get it to build - this is on a simple Xeon-based >>system under CentOS 6.2: >> >> cc1: warnings being treated as errors >> spml_yoda_getreq.c: In function 'mca_spml_yoda_get_completion': >> spml_yoda_getreq.c:98: error: pointer targets in passing argument 1 of >>'opal_atomic_add_32' differ in signedness >> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected >>'volatile int32_t *' but argument is of type 'uint32_t *' >> spml_yoda_getreq.c:98: error: signed and unsigned type in conditional >>expression >> cc1: warnings being treated as errors >> spml_yoda_putreq.c: In function 'mca_spml_yoda_put_completion': >> spml_yoda_putreq.c:81: error: pointer targets in passing argument 1 of >>'opal_atomic_add_32' differ in signedness >> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected >>'volatile int32_t *' but argument is of type 'uint32_t *' >> spml_yoda_putreq.c:81: error: signed and unsigned type in conditional >>expression >> make[2]: *** [spml_yoda_getreq.lo] Error 1 >> make[2]: *** Waiting for unfinished jobs >> make[2]: *** [spml_yoda_putreq.lo] Error 1 >> cc1: warnings being treated as errors >> spml_yoda.c: In function 'mca_spml_yoda_put_internal': >> spml_yoda.c:725: error: pointer targets in passing argument 1 of >>'opal_atomic_add_32' differ in signedness >> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected >>'volatile int32_t *' but argument is of type 'uint32_t *' >> spml_yoda.c:725: error: signed and unsigned type in conditional >>expression >> spml_yoda.c: In function 'mca_spml_yoda_get': >> spml_yoda.c:1107: error: pointer targets in passing argument 1 of >>'opal_atomic_add_32' differ in signedness >> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note: expected >>'volatile int32_t *' but argument is of type 'uint32_t *' >> spml_yoda.c:1107: error: signed and unsigned type in conditional >>expression >> make[2]: *** [spml_yoda.lo] Error 1 >> make[1]: *** [all-recursive] Error 1 >> >> Only configure arguments: >> >> enable_picky=yes >> enable_debug=yes >> >> >> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3) >> >> >> >> On Aug 10, 2013, at 7:21 PM, "Barrett, Brian W" >>wrote: >> >>> On 8/6/13 10:30 AM, "Joshua Ladd" wrote: >>> Dear OMPI Community, Please find on Bitbucket the latest round of OSHMEM changes based on community feedback. Please git and test at your leisure. https://bitbucket.org/jladd_math/mlnx-oshmem.git >>> >>> Josh - >>> >>> In general, I think everything looks ok. However, the "right" thing >>> doesn't happen if the CM PML is used (at least, when using the Portals >>>4 >>> MTL). When configured with: >>> >>> ./configure --enable-mca-no-build=pml-ob1,pml-bfo,pml-v,btl,bml,mpool >>> >>> The build segfaults trying to run a SHMEM program: >>> >>> mpirun -np 2 ./bcast >>> [shannon:90397] *** Process received signal *** >>> [shannon:90397] Signal: Segmentation fault (11) >>> [shannon:90397] Signal code: Address not mapped (1) >>> [shannon:90397] Failing at address: (nil) >>> [shannon:90398] *** Process received signal *** >>> [shannon:90398] Signal: Segmentation fault (11) >>> [shannon:90398] Signal code: Address not mapped (1) >>> [shannon:90398] Failing at address: (nil) >>> [shannon:90397] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0] >>> [shannon:90397] *** End of error message *** >>> [shannon:90398] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0] >>> [shannon:90398] *** End of error message *** >>> >>> >>>-- >>> mpirun noticed that process rank 1 with PID 90398 on node shannon >>>exited >>> on signal 11 (Segmentation fault). >>> >>> >>>-- >>> >>> >>> >>> Brian >>> >>> -- >>> Brian W. Barrett >>> Sc
Re: [OMPI devel] Bad header guard in /opal/memoryhooks/memory.h
Thanks! Fixed in trunk and CMRd for 1.7.3 On Aug 9, 2013, at 1:07 AM, Michael Schlottke wrote: > Hi there, > > I don't know if this is the right place to post this, but it seems like the > header guard in /opal/memoryhooks/memory.h does not work as intended: > The header guard is written as > > #ifndef OPAL_MEMORY_MEMORY_H > #define OPAl_MEMORY_MEMORY_H > > where in the second line it probably should read "OPAL_…" and not "OPAl_…". > This is openmpi-1.7.2. > > Regards, > > Michael > > > -- > Michael Schlottke > > SimLab Highly Scalable Fluids & Solids Engineering > Jülich Aachen Research Alliance (JARA-HPC) > RWTH Aachen University > Wüllnerstraße 5a > 52062 Aachen > Germany > > Phone: +49 (241) 80 95188 > Fax: +49 (241) 80 92257 > Mail: m.schlot...@aia.rwth-aachen.de > Web: http://www.jara.org/jara-hpc > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
I can't speak to what you get from sacct, but I can say that things will definitely be different when launched directly via srun vs indirectly thru mpirun. The reason is that mpirun uses srun to launch the orte daemons, which then fork/exec all the application processes under them (as opposed to launching those app procs thru srun). This means two things: 1. Slurm has no direct knowledge or visibility into the application procs themselves when launched by mpirun. Slurm only sees the ORTE daemons. I'm sure that Slurm rolls up all the resources used by those daemons and their children, so the totals should include them 2. Since all Slurm can do is roll everything up, the resources shown in sacct will include those used by the daemons and mpirun as well as the application procs. Slurm doesn't include their daemons or the slurmctld in their accounting. so the two numbers will be significantly different. If you are attempting to limit overall resource usage, you may need to leave some slack for the daemons and mpirun. You should also see an extra "step" in the mpirun-launched job as mpirun itself generally takes the first step, and the launch of the daemons occupies a second step. As for the strange numbers you are seeing, it looks to me like you are hitting a mismatch of unsigned vs signed values. When adding them up, that could cause all kinds of erroneous behavior. On Aug 6, 2013, at 11:55 PM, Christopher Samuel wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/08/13 16:19, Christopher Samuel wrote: > >> Anyone seen anything similar, or any ideas on what could be going >> on? > > Sorry, this was with: > > # ACCOUNTING > JobAcctGatherType=jobacct_gather/linux > JobAcctGatherFrequency=30 > > Since those initial tests we've started enforcing memory limits (the > system is not yet in full production) and found that this causes jobs > to get killed. > > We tried the cgroups gathering method, but jobs still die with mpirun > and now the numbers don't seem to right for mpirun or srun either: > > mpirun (killed): > > [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize > JobID MaxRSS MaxVMSize > - -- -- > 94564 > 94564.batch-523362K 0 > 94564.0 394525K 0 > > srun: > > [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize > JobID MaxRSS MaxVMSize > - -- -- > 94565 > 94565.batch998K 0 > 94565.0 88663K 0 > > > All the best, > Chris > - -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93 > KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk > =jYrC > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel