On Apr 1, 2009, at 6:58 PM, Ralph Castain wrote:
IIRC, we certainly used to unlink the file after init. Are you sure
somebody changed that?
It looks like we unlink() it during btl sm component close
(effectively during MPI_FINALIZE), not before.
--
Jeff Squyres
Cisco Systems
IIRC, we certainly used to unlink the file after init. Are you sure
somebody changed that?
On Apr 1, 2009, at 4:29 PM, Jeff Squyres wrote:
So everyone hates SYSV. Ok. :-)
Given that part of the problems we've been having with mmap have
been due to filesystem issues, should we just unlin
So everyone hates SYSV. Ok. :-)
Given that part of the problems we've been having with mmap have been
due to filesystem issues, should we just unlink() the file once all
processes have mapped it? I believe we didn't do that originally for
two reasons:
- leave it around for debugging pu
On Tue, 2009-03-31 at 11:00 -0400, Jeff Squyres wrote:
> On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote:
> > System V shared memory used to be the main way to do shared memory on
> > MPICH and from my (little) experience, this was truly painful :
> > - Cleanup issues : does shmctl(IPC_RMID) s
On Mar 31, 2009, at 11:00 AM, Jeff Squyres wrote:
On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote:
Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.
System V shared memory used to be the main way to do shared memory on
MPICH and from my (li
Jeff Squyres wrote:
On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:
The thing I was wondering about was memory barriers. E.g., you
initialize stuff and then post the FIFO pointer. The other guy sees
the
FIFO pointer before the initialized memory.
We do do memory barriers during that SM s
On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:
The thing I was wondering about was memory barriers. E.g., you
initialize stuff and then post the FIFO pointer. The other guy sees
the
FIFO pointer before the initialized memory.
We do do memory barriers during that SM startup sequence. I
Jeff Squyres wrote:
On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:
> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm btl startup, so we're not actually guaranteed that the memory is
> local to any p
On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote:
Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.
System V shared memory used to be the main way to do shared memory on
MPICH and from my (little) experience, this was truly painful :
- Cleanu
On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:
> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm btl startup, so we're not actually guaranteed that the memory is
> local to any particular process(or)
Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.
System V shared memory used to be the main way to do shared memory on
MPICH and from my (little) experience, this was truly painful :
- Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ?
Jeff Squyres wrote:
FWIW, George found what looks like a race condition in the sm init
code today -- it looks like we don't call maffinity anywhere in the
sm btl startup, so we're not actually guaranteed that the memory is
local to any particular process(or) (!). This race shouldn't cause
FWIW, George found what looks like a race condition in the sm init
code today -- it looks like we don't call maffinity anywhere in the sm
btl startup, so we're not actually guaranteed that the memory is local
to any particular process(or) (!). This race shouldn't cause segvs,
though; it sh
Jeff Squyres wrote:
On Mar 30, 2009, at 1:40 PM, Patrick Geoffray wrote:
> we will have to find a
> pretty smart way to do this or we will completely break the memory
> affinity stuff.
I didn't look at the code, but I sure hope that the SM init code does
touch each page to force allocation,
Tim Mattox wrote:
I think I remember setting up the MTT tests on Sif so that tests
are run both with and without the coll_hierarch component selected.
The coll_hierarch component stresses code paths and potential
race conditions in its own way. So, if the problems are showing up
more frequently
Patrick Geoffray wrote:
Jeff Squyres wrote:
Why not? The "owning" process can do the touch; then it'll be
affinity'ed properly. Right?
Yes, that's what I meant by forcing allocation. From the thread, it
looked like nobody touched the pages of the mapped file. If it's
already done, no nee
Jeff Squyres wrote:
It's half done, actually. But it was still going to be an option,
not necessarily the only way to do it:
http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/shm-sysv/
On Mar 30, 2009, at 1:40 PM, Tim Mattox wrote:
I've been lurking on this conversation, and I am again
Jeff Squyres wrote:
Why not? The "owning" process can do the touch; then it'll be
affinity'ed properly. Right?
Yes, that's what I meant by forcing allocation. From the thread, it
looked like nobody touched the pages of the mapped file. If it's already
done, no need to write in the whole fil
George Bosilca wrote:
Then it looks like the safest solution is the use either ftruncate or
the lseek method and then touch the first byte of all memory pages.
Unfortunately, I see two problems with this. First, there is a clear
performance hit on the startup time. And second, we will have to
It's half done, actually. But it was still going to be an option, not
necessarily the only way to do it:
http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/shm-sysv/
On Mar 30, 2009, at 1:40 PM, Tim Mattox wrote:
I've been lurking on this conversation, and I am again left with the
impre
On Mar 30, 2009, at 1:40 PM, Patrick Geoffray wrote:
> performance hit on the startup time. And second, we will have to
find a
> pretty smart way to do this or we will completely break the memory
> affinity stuff.
I didn't look at the code, but I sure hope that the SM init code does
touch eac
George Bosilca wrote:
performance hit on the startup time. And second, we will have to find a
pretty smart way to do this or we will completely break the memory
affinity stuff.
I didn't look at the code, but I sure hope that the SM init code does
touch each page to force allocation, otherwise
I've been lurking on this conversation, and I am again left with the impression
that the underlying shared memory configuration based on sharing a file
is flawed. Why not use a System V shared memory segment without a
backing file as I described in ticket #1320?
On Mon, Mar 30, 2009 at 1:34 PM, G
On Mar 30, 2009, at 1:24 PM, Iain Bason wrote:
> But don't we need the whole area to be zero filled?
It will be zero-filled on demand using the lseek/touch method.
Ok.
However, the OS may not reserve space for the skipped pages or disk
blocks. Thus one could still get out of memory or fil
Then it looks like the safest solution is the use either ftruncate or
the lseek method and then touch the first byte of all memory pages.
Unfortunately, I see two problems with this. First, there is a clear
performance hit on the startup time. And second, we will have to find
a pretty smart
On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote:
But don't we need the whole area to be zero filled?
It will be zero-filled on demand using the lseek/touch method.
However, the OS may not reserve space for the skipped pages or disk
blocks. Thus one could still get out of memory or file
But don't we need the whole area to be zero filled?
On Mar 28, 2009, at 5:02 PM, George Bosilca wrote:
It is way to expensive to write the whole file. That's why I proposed
to only write the last byte. This will force the OS to really map the
file on the systems less POSIX compliant.
geor
Hi,
as you all have noticed already, ftruncate() does NOT extend the size
of a file on all systems. Instead, the preferred way to set a file to
a specific size is to call lseek() and then write() one byte (see e.g.
[1]).
Best regards,
Christian
[1] Richard Stevens: Advanced Programmi
It is way to expensive to write the whole file. That's why I proposed
to only write the last byte. This will force the OS to really map the
file on the systems less POSIX compliant.
george.
On Mar 28, 2009, at 13:50 , Jeff Squyres wrote:
How about just write()ing a bunch of 0's instead o
How about just write()ing a bunch of 0's instead of using ftruncate?
On Mar 27, 2009, at 11:09 PM, Eugene Loh wrote:
Paul H. Hargrove wrote:
> Quoting from a different manpage for ftruncate:
>[T]he POSIX standard allows two behaviours for ftruncate
>when length exceeds the file
Paul H. Hargrove wrote:
Quoting from a different manpage for ftruncate:
[T]he POSIX standard allows two behaviours for ftruncate
when length exceeds the file length [...]: either returning an
error, or
extending the file.
So, if that is to be trusted, it is not legal by PO
Quoting from a different manpage for ftruncate:
[T]he POSIX standard allows two behaviours for ftruncate
when length exceeds the file length [...]: either returning an
error, or
extending the file.
So, if that is to be trusted, it is not legal by POSIX to *silently* not
extend
Talking with Aurelien here @ UT we think we came-up with a possible
way to get such an error. Before explaining this let me set the bases.
There are 2 critical functions used in setting up the shared memory
file. One is ftruncate the other one mmap. Here are two snippets from
these function
Eugene,
I think I remember setting up the MTT tests on Sif so that tests
are run both with and without the coll_hierarch component selected.
The coll_hierarch component stresses code paths and potential
race conditions in its own way. So, if the problems are showing up
more frequently for the test
Josh Hursey wrote:
Sif is also running the coll_hierarch component on some of those
tests which has caused some additional problems. I don't know if that
is related or not.
Indeed. Many of the MTT stack traces (for both 1.3.1 and 1.3.2 and that
have seg faults and call out mca_btl_sm.so)
FWIW, when I was looking into this before, the problem was definitely
during MPI_INIT. I ran out of time before being able to track it down
further, but it was definitely something during the sm startup --
during add_procs, IIRC.
It *looked* like there was some kind of bogus value in the b
Hmmm...Eugene, you need to be a tad less sensitive. Nobody was
attempting to indict you or in any way attack you or your code.
What I was attempting to point out is that there are a number of sm
failures during sm init. I didn't single you out. I posted it to the
community because (a) it is
On Mar 26, 2009, at 6:41 PM, Ralph Castain wrote:
I suspect Josh or someone at IU could tell you the compiler. I
would be very surprised if it wasn't gcc, but I don't know what
version.
All the MTT runs on Sif are using gcc 4.1.2:
-bash-3.2$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat
Ralph Castain wrote:
You are correct - the Sun errors are in a version prior to the
insertion of the SM changes. We didn't relabel the version to 1.3.2
until -after- those changes went in, so you have to look for anything
with an r number >= 20839.
The sif errors are all in that group - I
You are correct - the Sun errors are in a version prior to the
insertion of the SM changes. We didn't relabel the version to 1.3.2
until -after- those changes went in, so you have to look for anything
with an r number >= 20839.
The sif errors are all in that group - I would suggest starting
Ralph Castain wrote:
It looks like the SM revisions we inserted into 1.3.2 are a great
detector for shared memory init failures - it segfaulted 143 times
last night on IU's sif computer, 34 times on Sun/Linux, and 3 times
on Sun/SunOS...almost every single time due to "Address not mapped"
Ralph Castain wrote:
Hi folks
Er, perhaps pronounced "Eugene". :^(
It looks like the SM revisions we inserted into 1.3.2 are a great
detector for shared memory init failures
How delicately put! I appreciate the gentleness.
- it segfaulted 143 times last night on IU's sif computer, 34
Hi folks
It looks like the SM revisions we inserted into 1.3.2 are a great
detector for shared memory init failures - it segfaulted 143 times
last night on IU's sif computer, 34 times on Sun/Linux, and 3 times on
Sun/SunOS...almost every single time due to "Address not mapped"
errors in t
43 matches
Mail list logo