Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-03 Thread Samuel K. Gutierrez

Hi all,

Does anyone know of a relatively portable solution for querying a  
given system for the shmctl behavior that I am relying on, or is this  
going to be a nightmare?  Because, if I am reading this thread  
correctly, the presence of shmget and Linux is not sufficient for  
determining an adequate level of sysv support.


Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On May 2, 2010, at 7:48 AM, N.M. Maclaren wrote:


On May 2 2010, Ashley Pittman wrote:

On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote:

As to performance there should be no difference in use between sys- 
V shared memory and file-backed shared memory, the instructions  
issued and the MMU flags for the page should both be the same so  
the performance should be identical.


Not necessarily, and possibly not so even for far-future Linuces.
On at least one system I used, the poxious kernel wrote the complete
file to disk before returning - all right, it did that for System V
shared memory, too, just to a 'hidden' file!  But, if I recall, on
another it did that only for file-backed shared memory - however, it's
a decade ago now and I may be misremembering.

Of course, that's a serious issue mainly for large segments.  I was
using multi-GB ones.  I don't know how big the ones you need are.

The one area you do need to keep an eye on for performance is on  
numa machines where it's important which process on a node touches  
each page first, you can end up using different areas (pages, not  
regions) for communicating in different directions between the same  
pair of processes. I don't believe this is any different to mmap  
backed shared memory though.


On some systems it may be, but in bizarre, inconsistent, undocumented
and unpredictable ways :-(  Also, there are usually several system  
(and
sometimes user) configuration options that change the behaviour, so  
you

have to allow for that.  My experience of trying to use those is that
different uses have incompatible requirements, and most of the  
critical

configuration parameters apply to ALL uses!

In my view, the configuration variability is the number one nightmare
for trying to write portable code that uses any form of shared memory.
ARMCI seem to agree.

Because of this, sysv support may be limited to Linux systems -  
that is,

until we can get a better sense of which systems provide the shmctl
IPC_RMID behavior that I am relying on.


And, I suggest, whether they have an evil gotcha on one of the areas  
that

Ashley Pittman noted.


Regards,
Nick Maclaren.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-02 Thread N.M. Maclaren

On May 2 2010, Ashley Pittman wrote:

On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote:

As to performance there should be no difference in use between sys-V 
shared memory and file-backed shared memory, the instructions issued and 
the MMU flags for the page should both be the same so the performance 
should be identical.


Not necessarily, and possibly not so even for far-future Linuces.
On at least one system I used, the poxious kernel wrote the complete
file to disk before returning - all right, it did that for System V
shared memory, too, just to a 'hidden' file!  But, if I recall, on
another it did that only for file-backed shared memory - however, it's
a decade ago now and I may be misremembering.

Of course, that's a serious issue mainly for large segments.  I was
using multi-GB ones.  I don't know how big the ones you need are.

The one area you do need to keep an eye on for performance is on numa 
machines where it's important which process on a node touches each page 
first, you can end up using different areas (pages, not regions) for 
communicating in different directions between the same pair of processes. 
I don't believe this is any different to mmap backed shared memory 
though.


On some systems it may be, but in bizarre, inconsistent, undocumented
and unpredictable ways :-(  Also, there are usually several system (and
sometimes user) configuration options that change the behaviour, so you
have to allow for that.  My experience of trying to use those is that
different uses have incompatible requirements, and most of the critical
configuration parameters apply to ALL uses!

In my view, the configuration variability is the number one nightmare
for trying to write portable code that uses any form of shared memory.
ARMCI seem to agree.


Because of this, sysv support may be limited to Linux systems - that is,
until we can get a better sense of which systems provide the shmctl
IPC_RMID behavior that I am relying on.


And, I suggest, whether they have an evil gotcha on one of the areas that
Ashley Pittman noted.


Regards,
Nick Maclaren.




Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-02 Thread Christopher Samuel
On 02/05/10 06:49, Ashley Pittman wrote:

> I think you should look into this a little deeper, it
> certainly used to be the case on Linux that setting
> IPC_RMID would also prevent any further processes from
> attaching to the segment.

That certainly appears to be the case in the current master
of the kernel, IPC_PRIVATE is set on the segment with the
comment:

 /* Do not find it any more */

That flag means that ipcget() - used by sys_shmget() -
take a different code path and now call ipcget_new()
rather than ipcget_public().

cheers,
Chris
-- 
  Christopher Samuel - Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computational Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.unimelb.edu.au/


Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-02 Thread Christopher Samuel
On 01/05/10 23:03, Samuel K. Gutierrez wrote:

> I call shmctl IPC_RMID immediately after one process has
> attached to the segment because, at least on Linux, this
> only marks the segment for destruction.

That's correct, looking at the kernel code (at least in the
current git master) the function that handles this - do_shm_rmid()
in ipc/shm.c - only destroys the segment if nobody is attached
to it, otherwise it marks the segment as IPC_PRIVATE to stop
others finding it and with SHM_DEST so that it is automatically
destroyed on the last detach.

cheers,
Chris
-- 
  Christopher Samuel - Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computational Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.unimelb.edu.au/


Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-02 Thread Ashley Pittman

On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote:
> As far as I can tell, calling shmctl IPC_RMID is immediately destroying
> the shared memory segment even though there is at least one process
> attached to it.  This is interesting and confusing because Solaris 10's
> behavior description of shmctl IPC_RMID is similar to that of Linux'.
> 
> I call shmctl IPC_RMID immediately after one process has attached to the
> segment because, at least on Linux, this only marks the segment for
> destruction.  The segment is only actually destroyed after all attached
> processes have terminated.  I'm relying on this behavior for resource
> cleanup upon application termination (normal/abnormal).

I think you should look into this a little deeper, it certainly used to be the 
case on Linux that setting IPC_RMID would also prevent any further processes 
from attaching to the segment.

You're right that minimising the window that the region exists for without that 
bit set is good, both in terms of wall-clock-time and lines of code, what we 
used to do here was to have all processes on a node perform a out-of-band 
intra-node barrier before creating the segment and another in-band barrier 
immediately after creating it.  Without this if one process on a node has 
problems and aborts during startup before it gets to the shared memory code 
then you are almost guaranteed to leave a un-attached segment behind.

As to performance there should be no difference in use between sys-V shared 
memory and file-backed shared memory, the instructions issued and the MMU flags 
for the page should both be the same so the performance should be identical.

The one area you do need to keep an eye on for performance is on numa machines 
where it's important which process on a node touches each page first, you can 
end up using different areas (pages, not regions) for communicating in 
different directions between the same pair of processes.  I don't believe this 
is any different to mmap backed shared memory though.

> Because of this, sysv support may be limited to Linux systems - that is,
> until we can get a better sense of which systems provide the shmctl
> IPC_RMID behavior that I am relying on.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-05-02 Thread Samuel K. Gutierrez
Hi Ethan,

Sorry about the lag.

As far as I can tell, calling shmctl IPC_RMID is immediately destroying
the shared memory segment even though there is at least one process
attached to it.  This is interesting and confusing because Solaris 10's
behavior description of shmctl IPC_RMID is similar to that of Linux'.

I call shmctl IPC_RMID immediately after one process has attached to the
segment because, at least on Linux, this only marks the segment for
destruction.  The segment is only actually destroyed after all attached
processes have terminated.  I'm relying on this behavior for resource
cleanup upon application termination (normal/abnormal).

Because of this, sysv support may be limited to Linux systems - that is,
until we can get a better sense of which systems provide the shmctl
IPC_RMID behavior that I am relying on.

Any other ideas are greatly appreciated.

Thanks for testing!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

> On Thu, Apr/29/2010 02:52:24PM, Samuel K. Gutierrez wrote:
>>  Hi Ethan,
>>  Bummer.  What does the following command show?
>>  sysctl -a | grep shm
>
> In this case, I think the Solaris equivalent to sysctl is prctl, e.g.,
>
>   $ prctl -i project group.staff
>   project: 10: group.staff
>   NAMEPRIVILEGE   VALUEFLAG   ACTION
> RECIPIENT
>   ...
>   project.max-shm-memory
>   privileged  3.92GB  -   deny

> -
>   system  16.0EBmax   deny

> -
>   project.max-shm-ids
>   privileged128   -   deny

> -
>   system  16.8M max   deny

> -
>   ...
>
> Is that the info you need?
>
> -Ethan
>
>>  Thanks!
>>  --
>>  Samuel K. Gutierrez
>>  Los Alamos National Laboratory
>>  On Apr 29, 2010, at 1:32 PM, Ethan Mallove wrote:
>> > Hi Samuel,
>> >
>> > I'm trying to run off your HG clone, but I'm seeing issues with
c_hello, e.g.,
>> >
>> >  $ mpirun -mca mpi_common_sm sysv --mca btl self,sm,tcp --host
>> > burl-ct-v440-2,burl-ct-v440-2 -np 2 ./c_hello
>> >  --
A system call failed during shared memory initialization that should not
have.  It is likely that your MPI job will now either abort or experience
performance degradation.
>> >
>> >Local host:  burl-ct-v440-2
>> >System call: shmat(2)
>> >Process: [[43408,1],1]
>> >Error:   Invalid argument (errno 22)
>> >  --
^Cmpirun: killing job...
>> >
>> >  $ uname -a
>> >  SunOS burl-ct-v440-2 5.10 Generic_118833-33 sun4u sparc
>> SUNW,Sun-Fire-V440
>> >
>> > The same test works okay if I s/sysv/mmap/.
>> >
>> > Regards,
>> > Ethan
>> >
>> >
>> > On Wed, Apr/28/2010 07:16:12AM, Samuel K. Gutierrez wrote:
>> >> Hi,
>> >>
>> >> Faster component initialization/finalization times is one of the
main
>> >> motivating factors of this work.  The general idea is to get away
>> from
>> >> creating a rather large backing file.  With respect to module
>> bandwidth
>> >> and
>> >> latency, mmap and sysv seem to be comparable - at least that is what
>> my
>> >> preliminary tests have shown.  As it stands, I have not come across
a
>> >> situation where the mmap SM component doesn't work or is slower.
>> >>
>> >> Hope that helps,
>> >>
>> >> --
>> >> Samuel K. Gutierrez
>> >> Los Alamos National Laboratory
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote:
>> >>
>> >>> On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrez
>> 
>> >>> wrote:
>>  With Jeff and Ralph's help, I have completed a System V shared
>> memory
>>  component for Open MPI.
>> >>>
>> >>> What is the motivation for this work ? Are there situations where
>> the
>> >>> mmap based SM component doesn't work or is slow(er) ?
>> >>>
>> >>> Kind regards,
>> >>> Bogdan
>> >>> ___
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >> ___
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  ___
>>  devel mailing list
>>  de...@open-mpi.org
>>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>





Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-04-29 Thread Samuel K. Gutierrez

Hi Ethan,

Bummer.  What does the following command show?

sysctl -a | grep shm

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Apr 29, 2010, at 1:32 PM, Ethan Mallove wrote:


Hi Samuel,

I'm trying to run off your HG clone, but I'm seeing issues with
c_hello, e.g.,

 $ mpirun -mca mpi_common_sm sysv --mca btl self,sm,tcp --host burl- 
ct-v440-2,burl-ct-v440-2 -np 2 ./c_hello
  
--

 A system call failed during shared memory initialization that should
 not have.  It is likely that your MPI job will now either abort or
 experience performance degradation.

   Local host:  burl-ct-v440-2
   System call: shmat(2)
   Process: [[43408,1],1]
   Error:   Invalid argument (errno 22)
  
--

 ^Cmpirun: killing job...

 $ uname -a
 SunOS burl-ct-v440-2 5.10 Generic_118833-33 sun4u sparc SUNW,Sun- 
Fire-V440


The same test works okay if I s/sysv/mmap/.

Regards,
Ethan


On Wed, Apr/28/2010 07:16:12AM, Samuel K. Gutierrez wrote:

Hi,

Faster component initialization/finalization times is one of the main
motivating factors of this work.  The general idea is to get away  
from
creating a rather large backing file.  With respect to module  
bandwidth and
latency, mmap and sysv seem to be comparable - at least that is  
what my

preliminary tests have shown.  As it stands, I have not come across a
situation where the mmap SM component doesn't work or is slower.

Hope that helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory





On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote:

On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrez 

wrote:
With Jeff and Ralph's help, I have completed a System V shared  
memory

component for Open MPI.


What is the motivation for this work ? Are there situations where  
the

mmap based SM component doesn't work or is slow(er) ?

Kind regards,
Bogdan
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-04-28 Thread Samuel K. Gutierrez

Hi,

Faster component initialization/finalization times is one of the main  
motivating factors of this work.  The general idea is to get away from  
creating a rather large backing file.  With respect to module  
bandwidth and latency, mmap and sysv seem to be comparable - at least  
that is what my preliminary tests have shown.  As it stands, I have  
not come across a  situation where the mmap SM component doesn't work  
or is slower.


Hope that helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory





On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote:

On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrez  
 wrote:

With Jeff and Ralph's help, I have completed a System V shared memory
component for Open MPI.


What is the motivation for this work ? Are there situations where the
mmap based SM component doesn't work or is slow(er) ?

Kind regards,
Bogdan
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing

2010-04-28 Thread Bogdan Costescu
On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrez  wrote:
> With Jeff and Ralph's help, I have completed a System V shared memory
> component for Open MPI.

What is the motivation for this work ? Are there situations where the
mmap based SM component doesn't work or is slow(er) ?

Kind regards,
Bogdan