H George, I did not manage to trigger the core dump in a simpler test case, but can reproduce the other parts: I can trigger a jsrun failing, and from that point on all subsequent jsruns will also fail.
To reproduce, perform the following steps: -------------------------------------------- $ bsub -W 1:00 -nnodes 32 -P BIP178 -Is /bin/bash # on the batch node $ cp /autofs/nccs-svm1_home1/merzky1/jsrun_test.tgz . $ tar zxf jsrun_test.tgz $ cd jsrun_test $ source runme.sh ------------------------------------------------ After that, you can watch your jsrun processes with `ps`. If you trigger the error, you will see non-empty `unit.*.err` files in that directory. I you don't see those, and all jsrun processes are done, you may need to remove all *.out and *.err files and try the last command again. I never needed more than 3 attempts to see failing jsruns, and usually it 'worked' on the first attempt. Best, Andre. On Tue, Feb 12, 2019 at 12:53 AM Andre Merzky <an...@merzky.net> wrote: > > On Mon, Feb 11, 2019 at 10:04 PM George Markomanolis via RT > <h...@nccs.gov> wrote: > > > > Hi Andre, > > > > I have no permissions to copy the files from your home. I would need the > > core > > file and the binary to check if I can extract something more. Was it > > possible > > to reproduce it with more simple cases? > > Apologies - I fixed the permission for the core file. As for the > binary: the binary is jsrun, which is a system utility - it is *not* > my own application: > > core.113220: ELF 64-bit LSB core file 64-bit PowerPC or cisco 7500, > version 1 (SYSV), SVR4-style, from > '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun -U > /ccs/home/merzky1/radical.pil', rea > l uid: 13416, effective uid: 13416, real gid: 24502, effective gid: > 24502, execfn: '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun', > platform: 'power9' > > > Best, Andre. > > > > regards, > > George > > > > On Wed Feb 06 17:59:08 2019, an...@merzky.net wrote: > > > > Hi George. > > > > comments inlined below. > > > > > > On Wed, Feb 6, 2019 at 7:35 PM George Markomanolis via RT <h...@nccs.gov> > > wrote: > > > > > > Hi Andre, > > > > > > Initially, could you unload xalt for testing before you submit your job? > > > module unload xalt > > > > Alas, the problem persists. I should note that I have been running > > similar workloads successfully over the last days (or at least more > > successful). Now it fails consistently with the error described here > > (I only saw one core dump though). > > > > I should check if this is workload dependent - I'll ping back if I > > see a difference in that respect. > > > > > > > The second error, we have seen it sometimes and disappears but we can't > > > reproduce it. Could you send us the submission script? > > > > The submission mechanism is unfortunately not a single script, but a > > rather involved framework. But basically we create a resource file > > like this: > > > > $ cat unit.000115/unit.000115.rs > > RS 0: { host: 3 cpu: 35 36 37 } > > RS 1: { host: 3 cpu: 38 39 40 } > > > > and then run with this command: > > > > $ grep jsrun unit.000115/unit.000115.sh > > /sw/summit/xalt/1.1.3/bin/jsrun -U > > > > /ccs/home/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000115//unit.000115.rs > > -a 1 -E "LD_LIBRARY_PATH" -E "PATH" - > > E "PYTHONPATH" -E "NODE_LFS_PATH" /bin/sleep "10" > > > > The resource files vary and can specify up to 100 nodes, and the > > workload can also vary 0 here it is just a test obviously. > > > > I'll try to reproduce this in a simple submission script. > > > > > I don't have access to the core file, could you copy it somewhere with > > access > > > as also the binary? > > > > I copied the core file into my home directory, which should be world > > readable. The executable is jsrun - otherwise I would not have > > bothered you guys :-) > > > > $ file > > rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220 > > rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220: > > ELF 64-bit LSB core file 64-bit PowerPC or cisco 7500, version 1 > > (SYSV), SVR4-style, from > > '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun -U > > /ccs/home/merzky1/radical.pil', real uid: 13416, effective uid: 13416, > > real gid: 24502, effective gid: 24502, execfn: > > '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun', platform: 'power9' > > > > > Just to be sure you have one job with 128 jsrun calls? > > > > Yes, aehm, this is a small scale test. We are working with a pilot > > system which launches many small tasks within a larger job allocation. > > We are just getting started on summit and are now testing jsrun > > capabilities. As said earlier; I had several runs at larger scale w/o > > seeing this specific problem. > > > > Let me know if you need more info or want me to run any tests. > > > > Thanks, Andre. > > > > > regards, > > > George > > > > > > On Wed Feb 06 05:26:32 2019, an...@merzky.net wrote: > > > > > > Your name > > Your name > > > > Andre Merzky Your username > > merzky1 Email address [1]an...@merzky.net Subject of your > > question/problem jsrun core dump Describe your question/problem Dear > > support, > > > > > > Andre Merzky > > > > > > > > > > I am > > executing 128 jsruns in a sufficiently large job. Out of those, > > > > > > > > > > > > > one > > fails with: > > > Your username > > > > > > > > 1. mailto:an...@merzky.net > > > > > > > > > > > > > ``` > > > merzky1 > > > > > > > > > > cat > > unit.000113/STDERR > > > > > > > > > > > > > > > /autofs/nccs-svm1_sw/summit/xalt/1.1.3/bin/xalt_helper_functions.sh: > > > Email address > > > > > > > > > > line > > 185: 113220 Segmentation fault (core dumped) $MY_CMD "$@" > > > > > > > > > > > > > ``` > > > [1]an...@merzky.net > > > > > > > > > > The > > core is in > > > > > > > > > > > > > > > /autofs/nccs-svm1_home1/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220, > > > Subject of your question/problem > > > > > > > > > > I will > > leave it there. > > > > > > jsrun core dump > > > > > > > > > > 97 > > more jsruns fail with: > > > > > > > > > > > > > ``` > > > Describe your question/problem > > > > > > > > > > $ cat > > unit.000112/*ERR > > > > > > > > > > > > > Error: > > Locate pipe file > > > Dear support, > > > > > > > > > > > > /tmp/jsm.batch4.13416/168421/JSM_rm_port_13416_168421 timed out. > > > > > > > > > > > > > Error > > message: No such file or directory > > > I am executing 128 jsruns in a sufficiently large job. Out of > > those, > > > > > > > > > > 02-06-2019 05:18:25:897 112266 main: Error initializing RM connection. > > > one fails with: > > > > > > > > > > > > Exiting. > > > > > > > > > > > > > ``` > > > ``` > > > > > > > > > > which > > I assume is caused by the first one failing. While I see jsruns > > > cat unit.000113/STDERR > > > > > > > > > > > > failing from time to time, this is the first one where subsequent > > > > > /autofs/nccs-svm1_sw/summit/xalt/1.1.3/bin/xalt_helper_functions.sh: > > > > > > > > > > instances do > > not succeed. > > > line 185: 113220 Segmentation fault (core dumped) $MY_CMD "$@" > > > ``` > > > > > > > > > > Let me > > know if you need more information. > > > The core is in > > > > > /autofs/nccs-svm1_home1/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220, > > > > > > > > > > Thanks, Andre > > > I will leave it there. > > > > > > 97 more jsruns fail with: > > > ``` > > > $ cat unit.000112/*ERR > > > Error: Locate pipe file > > > /tmp/jsm.batch4.13416/168421/JSM_rm_port_13416_168421 timed out. > > > Error message: No such file or directory > > > 02-06-2019 05:18:25:897 112266 main: Error initializing RM > > connection. > > > Exiting. > > > ``` > > > which I assume is caused by the first one failing. While I see > > jsruns > > > failing from time to time, this is the first one where subsequent > > > instances do not succeed. > > > > > > Let me know if you need more information. > > > > > > Thanks, Andre > > > > > > > > > > > > 1. mailto:an...@merzky.net > > > > > > > _______________________________________________ mtt-users mailing list mtt-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/mtt-users