Re: [OMPI devel] SM init failures

2009-03-30 Thread Christian Siebert
Hi, as you all have noticed already, ftruncate() does NOT extend the size of a file on all systems. Instead, the preferred way to set a file to a specific size is to call lseek() and then write() one byte (see e.g. [1]). Best regards, Christian [1] Richard Stevens: Advanced Programmi

Re: [OMPI devel] Error in the versions 1.3 and 1.3.1 of OpenMPI when using SLURM_OVERCOMMIT=1

2009-03-30 Thread Hartmut Häfner
Dear Support, the answer seems to be simple, but it also seems to be wrong! Below you can see the description how SLURM_OVERCOMMIT should operate. SLURM_CPUS_PER_TASK (default is 1) allows you to assign multiple CPUs to each (multithreaded) process in your job to improve performance. SRUN's

Re: [OMPI devel] Error in the versions 1.3 and 1.3.1 of OpenMPI when using SLURM_OVERCOMMIT=1

2009-03-30 Thread Ralph Castain
I should have been more careful/clear in my answer - we don't look at the SLURM_OVERCOMMIT variable. The srun cmd options are not utilized. If setting SLURM_OVERCOMMIT worked in version 1.2.8, I can assure you it was completely fortuitous - I wrote that code, and we never looked at that var

Re: [OMPI devel] [ewg] Seg fault running OpenMPI-1.3.1rc4

2009-03-30 Thread Pavel Shamis (Pasha)
I think you problem is related to this bug: https://svn.open-mpi.org/trac/ompi/ticket/1823 And it is resolved on the ompi-trunk. Pasha. Steve Wise wrote: When this happens, that node logs this type of message also in /var/log/messages: IMB-MPI1[8859]: segfault at 0018 rip 2b

[OMPI devel] mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

2009-03-30 Thread Alessandro Surace
Hi guys, I've a problem with the last stable build and the last nightly snapshot. When I run a job directly with mpirun no problem. If I try to submit it with lsf: bsub -a openmpi -m grid01 mpirun.lsf /mnt/ewd/mpi/fibonacci/fibonacci_mpi I get the follow error: mpirun: symbol lookup error: /usr/l

Re: [OMPI devel] ***SPAM*** Re: [ewg] Seg fault running OpenMPI-1.3.1rc4

2009-03-30 Thread Steve Wise
Hey Pasha, I just applied r20872 and retested, and I still hit this seg fault. So I think this is a new bug. Lemme pull the trunk and try that. Pavel Shamis (Pasha) wrote: I think you problem is related to this bug: https://svn.open-mpi.org/trac/ompi/ticket/1823 And it is resolved on

Re: [OMPI devel] ***SPAM*** Re: [ewg] Seg fault running OpenMPI-1.3.1rc4

2009-03-30 Thread Jeff Squyres
Can you send a gdb bt from a corresponding corefile? That would help immensely... The stack trace we get from glibc unfortunately does not show file/ line numbers. On Mar 30, 2009, at 10:56 AM, Steve Wise wrote: Hey Pasha, I just applied r20872 and retested, and I still hit this seg f

[OMPI devel] Error in VT

2009-03-30 Thread Leonardo Fialho
Hi, I'm experimenting the following errors while using Open MPI release 1.3.1 combined with VT. STAT P 2.258062 43.% 488.997562 0 STAT P 2.260121 44.% 485.672638 0 STAT P 2.262175 45.% 486.854935 0 RFG_Regions_stackPop(): Error: Stack underflow RFG_Regions_stackPop(): Error: Stack

Re: [OMPI devel] ***SPAM*** Re: [ewg] Seg fault running OpenMPI-1.3.1rc4

2009-03-30 Thread Pavel Shamis (Pasha)
Steve, If you will compile OMPI code with CFLAGS="-g" ,generate segfault core_file and send the core + IMB-MPI1 to me I will be able to understand the problem better. Regards, Pasha Steve Wise wrote: Hey Pasha, I just applied r20872 and retested, and I still hit this seg fault. So I thi

Re: [OMPI devel] SM init failures

2009-03-30 Thread Jeff Squyres
But don't we need the whole area to be zero filled? On Mar 28, 2009, at 5:02 PM, George Bosilca wrote: It is way to expensive to write the whole file. That's why I proposed to only write the last byte. This will force the OS to really map the file on the systems less POSIX compliant. geor

Re: [OMPI devel] Error in VT

2009-03-30 Thread Jeff Squyres
Can you send all the information listed here: http://www.open-mpi.org/community/help/ On Mar 30, 2009, at 11:46 AM, Leonardo Fialho wrote: Hi, I'm experimenting the following errors while using Open MPI release 1.3.1 combined with VT. STAT P 2.258062 43.% 488.997562 0 STAT P 2.26012

Re: [OMPI devel] ***SPAM*** Re: [ewg] Seg fault running OpenMPI-1.3.1rc4

2009-03-30 Thread Steve Wise
Hey Pasha, I'm tracking this with OFA bug 1579: https://bugs.openfabrics.org/show_bug.cgi?id=1579 I think Jeff Squyres is digging into it. But the core and IMB-MPI1 files are here: http://www.openfabrics.org/~swise/bug1579/core.8175.gz

Re: [OMPI devel] Error in VT

2009-03-30 Thread Leonardo Fialho
Hi Jeff, There are... Thanks a lot, Leonardo Jeff Squyres escribió: Can you send all the information listed here: http://www.open-mpi.org/community/help/ On Mar 30, 2009, at 11:46 AM, Leonardo Fialho wrote: Hi, I'm experimenting the following errors while using Open MPI release 1.3.1

Re: [OMPI devel] SM init failures

2009-03-30 Thread Iain Bason
On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote: But don't we need the whole area to be zero filled? It will be zero-filled on demand using the lseek/touch method. However, the OS may not reserve space for the skipped pages or disk blocks. Thus one could still get out of memory or file

Re: [OMPI devel] SM init failures

2009-03-30 Thread George Bosilca
Then it looks like the safest solution is the use either ftruncate or the lseek method and then touch the first byte of all memory pages. Unfortunately, I see two problems with this. First, there is a clear performance hit on the startup time. And second, we will have to find a pretty smart

Re: [OMPI devel] SM init failures

2009-03-30 Thread Jeff Squyres
On Mar 30, 2009, at 1:24 PM, Iain Bason wrote: > But don't we need the whole area to be zero filled? It will be zero-filled on demand using the lseek/touch method. Ok. However, the OS may not reserve space for the skipped pages or disk blocks. Thus one could still get out of memory or fil

Re: [OMPI devel] SM init failures

2009-03-30 Thread Tim Mattox
I've been lurking on this conversation, and I am again left with the impression that the underlying shared memory configuration based on sharing a file is flawed. Why not use a System V shared memory segment without a backing file as I described in ticket #1320? On Mon, Mar 30, 2009 at 1:34 PM, G

Re: [OMPI devel] SM init failures

2009-03-30 Thread Patrick Geoffray
George Bosilca wrote: performance hit on the startup time. And second, we will have to find a pretty smart way to do this or we will completely break the memory affinity stuff. I didn't look at the code, but I sure hope that the SM init code does touch each page to force allocation, otherwise

Re: [OMPI devel] SM init failures

2009-03-30 Thread Jeff Squyres
On Mar 30, 2009, at 1:40 PM, Patrick Geoffray wrote: > performance hit on the startup time. And second, we will have to find a > pretty smart way to do this or we will completely break the memory > affinity stuff. I didn't look at the code, but I sure hope that the SM init code does touch eac

Re: [OMPI devel] SM init failures

2009-03-30 Thread Jeff Squyres
It's half done, actually. But it was still going to be an option, not necessarily the only way to do it: http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/shm-sysv/ On Mar 30, 2009, at 1:40 PM, Tim Mattox wrote: I've been lurking on this conversation, and I am again left with the impre

Re: [OMPI devel] SM init failures

2009-03-30 Thread Eugene Loh
George Bosilca wrote: Then it looks like the safest solution is the use either ftruncate or the lseek method and then touch the first byte of all memory pages. Unfortunately, I see two problems with this. First, there is a clear performance hit on the startup time. And second, we will have to

Re: [OMPI devel] SM init failures

2009-03-30 Thread Patrick Geoffray
Jeff Squyres wrote: Why not? The "owning" process can do the touch; then it'll be affinity'ed properly. Right? Yes, that's what I meant by forcing allocation. From the thread, it looked like nobody touched the pages of the mapped file. If it's already done, no need to write in the whole fil

Re: [OMPI devel] SM init failures

2009-03-30 Thread Eugene Loh
Jeff Squyres wrote: It's half done, actually. But it was still going to be an option, not necessarily the only way to do it: http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/shm-sysv/ On Mar 30, 2009, at 1:40 PM, Tim Mattox wrote: I've been lurking on this conversation, and I am again

Re: [OMPI devel] SM init failures

2009-03-30 Thread Eugene Loh
Patrick Geoffray wrote: Jeff Squyres wrote: Why not? The "owning" process can do the touch; then it'll be affinity'ed properly. Right? Yes, that's what I meant by forcing allocation. From the thread, it looked like nobody touched the pages of the mapped file. If it's already done, no nee

Re: [OMPI devel] ***SPAM*** Re: [ewg] Seg faultrunning OpenMPI-1.3.1rc4

2009-03-30 Thread Jeff Squyres
Fixed in https://svn.open-mpi.org/trac/ompi/changeset/20896; thanks Steve. On Mar 30, 2009, at 12:45 PM, Steve Wise wrote: Hey Pasha, I'm tracking this with OFA bug 1579: https://bugs.openfabrics.org/show_bug.cgi?id=1579 I think Jeff Squyres is digging into it. But the core and IMB-MPI1 f

Re: [OMPI devel] SM init failures

2009-03-30 Thread Eugene Loh
Tim Mattox wrote: I think I remember setting up the MTT tests on Sif so that tests are run both with and without the coll_hierarch component selected. The coll_hierarch component stresses code paths and potential race conditions in its own way. So, if the problems are showing up more frequently

Re: [OMPI devel] SM init failures

2009-03-30 Thread Eugene Loh
Jeff Squyres wrote: On Mar 30, 2009, at 1:40 PM, Patrick Geoffray wrote: > we will have to find a > pretty smart way to do this or we will completely break the memory > affinity stuff. I didn't look at the code, but I sure hope that the SM init code does touch each page to force allocation,

Re: [OMPI devel] SM init failures

2009-03-30 Thread Jeff Squyres
FWIW, George found what looks like a race condition in the sm init code today -- it looks like we don't call maffinity anywhere in the sm btl startup, so we're not actually guaranteed that the memory is local to any particular process(or) (!). This race shouldn't cause segvs, though; it sh