Re: [OMPI devel] RFC: Add ompi-top tool

2008-12-13 Thread Jeff Squyres
This works for me.  LAM had a similar tool to query daemons and find  
the current state of running MPI procs (although it didn't get top- 
like statistics of the apps).


On Dec 12, 2008, at 3:20 PM, Ralph Castain wrote:



WHAT: Add new tool to retrieve/monitor process stats


WHY: Several of us have had user requests to provide a
convenient way of obtaining reports on memory usage and
other typical stats from their MPI procs. The notion was to
create a tool that would allow a user to specify multiple ranks
(which could be on any number of nodes), and have the tool
query mpirun to get the info. This would avoid the necessity
of users remotely logging into multiple nodes to run top, ps,
or other stat tools - and from having to use something heavy
like Totalview for such a small purpose.


WHERE: Involves the following:

1. new opal framework "opal/mca/pstat" with components
to support obtaining process stats from the different OS's.
Note that application procs do -not- open this framework.
The open/select functions are -only- in the orte_init procedures
for the HNP and orteds. This is because an app would never
have any reason to call this framework, so there is no reason
to open it.

2. new "orte-top" tool (also avail as ompi-top) that sends
the top request to the specified mpirun and prints out
the returned data. No fancy screen handling - just basic
printout

3. slight mods to orted_comm to receive and process the
new cmd

4. added new cmd flag define to orte/mca/odls/odls_types.h

5. added new base function to orte/mca/odls/base/ 
odls_base_default_fns.c

to lookup the specified child and call opal_pstat to get
the info


WHEN: I would like to do this before the holiday break, if
possible, given that Sun, Cisco, and IU are all aware and
supportive of this change. However, since a number of
community members are tied up with the MPI Forum next week,
I propose to see if there are any immediate concerns and, if so,
wait until after the holiday to more thoroughly discuss them.


TIMEOUT: Dec 23


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] shared-memory allocations

2008-12-13 Thread Richard Graham

>> > 
> 
> On 12/12/08 8:21 PM, "Eugene Loh"  wrote:
> 
> Richard Graham wrote:
> Re: [OMPI devel] shared-memory allocations The memory allocation is intended
to take into account that two separate procs may be touching the same memory, so
the intent is to reduce cache conflicts (false sharing)
> Got it.  I'm totally fine with that.  Separate cachelines.
> and put the memory close to the process that is using it.
> Problematic concept, but ... okay, I'll read on.
> When the code first went in, there was no explicit memory affinity
implemented, so first-touch was relied on to get the memory in the øcorrectø
location.
> 
> Okay.
> If I remember correctly, the head and the tail each are written to be a
different process, and is where the pointers and counters used to manage the
fifo are maintained.  They need to be close to the writer, and on separate cache
lines, to avoid false sharing.
> Why close to the writer (versus reader)?
> 
> Anyhow, so far as I can tell, the 2d structure ompi_fifo_t
fifo[receiver][sender] is organized by receiver.  That is, the main ompi_fifo_t
FIFO data structures are local to receivers.
> 
> But then, each FIFO is initialized (that is, circular buffers and associated
allocations) by senders.  E.g.,
https://svn.open-mpi.org/trac/ompi/browser/branches/v1.3/ompi/mca/btl/Smylers/bt
l_sm.c?version=19785#L537
> In the call to ompi_fifo_init(), all the circular buffer (CB) data structures
are allocated by the sender.  On different cachelines -- even different pages --
but all by the sender.

It does not make a difference who allocates it, what makes a difference is
who touches it first.

> 
> Specifically, one accesses FIFO on the receiver side then follow pointers to
the senders side.  Doesn't matter if you're talking head, tail, or queue.
> The queue itself is accessed most often by the reader,
> You mean because it's polling often, but writer writes only once?

Yes - it is polling volatile memory, so has to load from memory on every
read.

> so it should be closer to the reader.
> Are there measurements to substantiate this?  Seems to me that in a
cache-based system, a reader could poll on a remote location all it wanted and
there'd be traffic only if the cached copy were invalidated.  Conceivably, a
transfer could go cache-to-cache and not hit memory at all.  I tried some
measurements and found no difference for any location -- close to writer, close
to reader, or far from both.
> I honestly donøt remember much about the wrapper ø would have to go back to
the code to look at it.  If we no longer allow multiple fifo per pair, the
wrapper layer can go away ø it is there to manage multiple fifoøs per pair.
> 
> There is support for multiple circular buffers per FIFO.

The code is there, but I believe Gleb disabled using multiple fifo's, and
added a list to hold pending
messages, so now we are paying two overheads ...  I could be wrong here, but
am pretty sure I am not.
I don't know if George has touched the code since.


> As far as granularity of allocation ø it needs to be large enough to
accommodate the smallest shared memory hierarchy, so I suppose in the most
general case this may be the tertiary cache ?
> 
> I don't get this.  I understand how certain things should be on separate
cachelines.  Beyond that, we just figure out what should be local to a process
and allocate all those things together.  That takes us from 3*n*n allocations
(and pages) to just n of them.

Not sure what you point is here.  The cost per process is linear in the
total number of processes, so
overall the cost scales as the number of procs squared.  This was designed
for small smp's, to reduce
coordination costs between processes, and where memory costs are not large.
Once can go to very simple
schemes that are constant with respect to memory footprint, but then pay the
cost of multiple writers
to a single queue - this is what LA-MPI did.

> No reason not to allocate objects that need to be associated with the same
process on the same page, as long as one avoids false sharing.
> Got it.
> So seems like each process could have all of itøs receive fifoøs on the same
page, and these could share the also with either the heads, or the tails of each
queue.

Yes, this makes sense.

Rich

> 
> I will propose some specifics and run them by y'all.  I think I know enough to
get started.  Thanks for the comments.
> 
> 



Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-13 Thread George Bosilca

Brian,

I found a second problem with rebuilding the datatype on the remote.  
Originally, the displacement were wrongly computed. This is now fixed.  
However, the data at the end of the fence is still not correct on the  
remote.


I can confirm that the packed message contains only 0 instead of the  
real value, but I couldn't figure out how these 0 got there. The pack  
function works correctly for the MPI_Send function, I don't see any  
reason not to do the same for the MPI_Put. As you're the one-sided guy  
in ompi, can you take a look at the MPI_Put to see why the data is  
incorrect?


  george.

On Dec 11, 2008, at 19:14 , Brian Barrett wrote:

I think that's a reasonable solution.  However, the words "not it"  
come to mind.  Sorry, but I have way too much on my plate this  
month.  By the way, in case no one noticed, I had e-mailed my  
findings to devel.  Someone might want to reply to Dorian's e-mail  
on users.



Brian

On Dec 11, 2008, at 2:31 PM, George Bosilca wrote:


Brian,

You're right, the datatype is being too cautious with the  
boundaries when detecting the overlap. There is no good solution to  
detect the overlap except parsing the whole memory layout to check  
the status of every predefined type. As one can imagine this is a  
very expensive operation. This is reason I preferred to use the  
true extent and the size of the data to try to detect the overlap.  
This approach is a lot faster, but has a poor accuracy.


The best solution I can think of in short term is to remove  
completely the overlap check. This will have absolutely no impact  
on the way we pack the data, but can lead to unexpected results  
when we unpack and the data overlap. But I guess this can be  
considered as a user error, as the MPI standard clearly state that  
the result of such an operation is ... unexpected.


george.

On Dec 10, 2008, at 22:20 , Brian Barrett wrote:


Hi all -

I looked into this, and it appears to be datatype related.  If the  
displacements are set t o 3, 2, 1, 0, there the datatype will fail  
the type checks for one-sided because is_overlapped() returns 1  
for the datatype.  My reading of the standard seems to indicate  
this should not be.  I haven't looked into the problems with  
displacement set to 0, 1, 2, 3, but I'm guessing it has something  
to do with the reverse problem.


This looks like a datatype issue, so it's out of my realm of  
expertise.  Can someone else take a look?


Brian

Begin forwarded message:


From: doriankrause 
Date: December 10, 2008 4:07:55 PM MST
To: us...@open-mpi.org
Subject: [OMPI users] Onesided + derived datatypes
Reply-To: Open MPI Users 

Hi List,

I have a MPI program which uses one sided communication with  
derived
datatypes (MPI_Type_create_indexed_block). I developed the code  
with

MPICH2 and unfortunately didn't thought about trying it out with
OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm  
facing
some problems. On the most machines I get an SIGSEGV in  
MPI_Win_fence,
sometimes an invalid datatype shows up. I ran the program in  
Valgrind
and didn't get anything valuable. Since I can't see a reason for  
this
problem (at least if I understand the standard correctly), I  
wrote the

attached testprogram.

Here are my experiences:

* If I compile without ONESIDED defined, everything works and V1  
and V2

give the same results
* If I compile with ONESIDED and V2 defined (MPI_Type_contiguous)  
it works.
* ONESIDED + V1 + O2: No errors but obviously nothing is send?  
(Am I in

assuming that V1+O2 and V2 should be equivalent?)
* ONESIDED + V1 + O1:
[m02:03115] *** An error occurred in MPI_Put
[m02:03115] *** on win
[m02:03115] *** MPI_ERR_TYPE: invalid datatype
[m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye)

I didn't get a segfault as in the "real life example" but if  
ompitest.cc

is correct it means that OpenMPI is buggy when it comes to onesided
communication and (some) derived datatypes, so that it is  
probably not

of problem in my code.

I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same  
behaviour

can be be seen with gcc-3.3.1 and intel 10.1.

Please correct me if ompitest.cc contains errors. Otherwise I  
would be
glad to hear how I should report these problems to the develepors  
(if

they don't read this).

Thanks + best regards

Dorian








___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-13 Thread Brian Barrett
Sorry, I really won't have time to look until after Christmas.  I'll  
put it on the to-do list, but that's as soon as it has a prayer of  
reaching the top.


Brian

On Dec 13, 2008, at 1:02 PM, George Bosilca wrote:


Brian,

I found a second problem with rebuilding the datatype on the remote.  
Originally, the displacement were wrongly computed. This is now  
fixed. However, the data at the end of the fence is still not  
correct on the remote.


I can confirm that the packed message contains only 0 instead of the  
real value, but I couldn't figure out how these 0 got there. The  
pack function works correctly for the MPI_Send function, I don't see  
any reason not to do the same for the MPI_Put. As you're the one- 
sided guy in ompi, can you take a look at the MPI_Put to see why the  
data is incorrect?


 george.

On Dec 11, 2008, at 19:14 , Brian Barrett wrote:

I think that's a reasonable solution.  However, the words "not it"  
come to mind.  Sorry, but I have way too much on my plate this  
month.  By the way, in case no one noticed, I had e-mailed my  
findings to devel.  Someone might want to reply to Dorian's e-mail  
on users.



Brian

On Dec 11, 2008, at 2:31 PM, George Bosilca wrote:


Brian,

You're right, the datatype is being too cautious with the  
boundaries when detecting the overlap. There is no good solution  
to detect the overlap except parsing the whole memory layout to  
check the status of every predefined type. As one can imagine this  
is a very expensive operation. This is reason I preferred to use  
the true extent and the size of the data to try to detect the  
overlap. This approach is a lot faster, but has a poor accuracy.


The best solution I can think of in short term is to remove  
completely the overlap check. This will have absolutely no impact  
on the way we pack the data, but can lead to unexpected results  
when we unpack and the data overlap. But I guess this can be  
considered as a user error, as the MPI standard clearly state that  
the result of such an operation is ... unexpected.


george.

On Dec 10, 2008, at 22:20 , Brian Barrett wrote:


Hi all -

I looked into this, and it appears to be datatype related.  If  
the displacements are set t o 3, 2, 1, 0, there the datatype will  
fail the type checks for one-sided because is_overlapped()  
returns 1 for the datatype.  My reading of the standard seems to  
indicate this should not be.  I haven't looked into the problems  
with displacement set to 0, 1, 2, 3, but I'm guessing it has  
something to do with the reverse problem.


This looks like a datatype issue, so it's out of my realm of  
expertise.  Can someone else take a look?


Brian

Begin forwarded message:


From: doriankrause 
Date: December 10, 2008 4:07:55 PM MST
To: us...@open-mpi.org
Subject: [OMPI users] Onesided + derived datatypes
Reply-To: Open MPI Users 

Hi List,

I have a MPI program which uses one sided communication with  
derived
datatypes (MPI_Type_create_indexed_block). I developed the code  
with

MPICH2 and unfortunately didn't thought about trying it out with
OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm  
facing
some problems. On the most machines I get an SIGSEGV in  
MPI_Win_fence,
sometimes an invalid datatype shows up. I ran the program in  
Valgrind
and didn't get anything valuable. Since I can't see a reason for  
this
problem (at least if I understand the standard correctly), I  
wrote the

attached testprogram.

Here are my experiences:

* If I compile without ONESIDED defined, everything works and V1  
and V2

give the same results
* If I compile with ONESIDED and V2 defined  
(MPI_Type_contiguous) it works.
* ONESIDED + V1 + O2: No errors but obviously nothing is send?  
(Am I in

assuming that V1+O2 and V2 should be equivalent?)
* ONESIDED + V1 + O1:
[m02:03115] *** An error occurred in MPI_Put
[m02:03115] *** on win
[m02:03115] *** MPI_ERR_TYPE: invalid datatype
[m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye)

I didn't get a segfault as in the "real life example" but if  
ompitest.cc
is correct it means that OpenMPI is buggy when it comes to  
onesided
communication and (some) derived datatypes, so that it is  
probably not

of problem in my code.

I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same  
behaviour

can be be seen with gcc-3.3.1 and intel 10.1.

Please correct me if ompitest.cc contains errors. Otherwise I  
would be
glad to hear how I should report these problems to the  
develepors (if

they don't read this).

Thanks + best regards

Dorian








___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___

Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-13 Thread Jeff Squyres

No problem-o.

George -- can you please file a bug?


On Dec 13, 2008, at 3:11 PM, Brian Barrett wrote:

Sorry, I really won't have time to look until after Christmas.  I'll  
put it on the to-do list, but that's as soon as it has a prayer of  
reaching the top.


Brian

On Dec 13, 2008, at 1:02 PM, George Bosilca wrote:


Brian,

I found a second problem with rebuilding the datatype on the  
remote. Originally, the displacement were wrongly computed. This is  
now fixed. However, the data at the end of the fence is still not  
correct on the remote.


I can confirm that the packed message contains only 0 instead of  
the real value, but I couldn't figure out how these 0 got there.  
The pack function works correctly for the MPI_Send function, I  
don't see any reason not to do the same for the MPI_Put. As you're  
the one-sided guy in ompi, can you take a look at the MPI_Put to  
see why the data is incorrect?


george.

On Dec 11, 2008, at 19:14 , Brian Barrett wrote:

I think that's a reasonable solution.  However, the words "not it"  
come to mind.  Sorry, but I have way too much on my plate this  
month.  By the way, in case no one noticed, I had e-mailed my  
findings to devel.  Someone might want to reply to Dorian's e-mail  
on users.



Brian

On Dec 11, 2008, at 2:31 PM, George Bosilca wrote:


Brian,

You're right, the datatype is being too cautious with the  
boundaries when detecting the overlap. There is no good solution  
to detect the overlap except parsing the whole memory layout to  
check the status of every predefined type. As one can imagine  
this is a very expensive operation. This is reason I preferred to  
use the true extent and the size of the data to try to detect the  
overlap. This approach is a lot faster, but has a poor accuracy.


The best solution I can think of in short term is to remove  
completely the overlap check. This will have absolutely no impact  
on the way we pack the data, but can lead to unexpected results  
when we unpack and the data overlap. But I guess this can be  
considered as a user error, as the MPI standard clearly state  
that the result of such an operation is ... unexpected.


george.

On Dec 10, 2008, at 22:20 , Brian Barrett wrote:


Hi all -

I looked into this, and it appears to be datatype related.  If  
the displacements are set t o 3, 2, 1, 0, there the datatype  
will fail the type checks for one-sided because is_overlapped()  
returns 1 for the datatype.  My reading of the standard seems to  
indicate this should not be.  I haven't looked into the problems  
with displacement set to 0, 1, 2, 3, but I'm guessing it has  
something to do with the reverse problem.


This looks like a datatype issue, so it's out of my realm of  
expertise.  Can someone else take a look?


Brian

Begin forwarded message:


From: doriankrause 
Date: December 10, 2008 4:07:55 PM MST
To: us...@open-mpi.org
Subject: [OMPI users] Onesided + derived datatypes
Reply-To: Open MPI Users 

Hi List,

I have a MPI program which uses one sided communication with  
derived
datatypes (MPI_Type_create_indexed_block). I developed the code  
with

MPICH2 and unfortunately didn't thought about trying it out with
OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm  
facing
some problems. On the most machines I get an SIGSEGV in  
MPI_Win_fence,
sometimes an invalid datatype shows up. I ran the program in  
Valgrind
and didn't get anything valuable. Since I can't see a reason  
for this
problem (at least if I understand the standard correctly), I  
wrote the

attached testprogram.

Here are my experiences:

* If I compile without ONESIDED defined, everything works and  
V1 and V2

give the same results
* If I compile with ONESIDED and V2 defined  
(MPI_Type_contiguous) it works.
* ONESIDED + V1 + O2: No errors but obviously nothing is send?  
(Am I in

assuming that V1+O2 and V2 should be equivalent?)
* ONESIDED + V1 + O1:
[m02:03115] *** An error occurred in MPI_Put
[m02:03115] *** on win
[m02:03115] *** MPI_ERR_TYPE: invalid datatype
[m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye)

I didn't get a segfault as in the "real life example" but if  
ompitest.cc
is correct it means that OpenMPI is buggy when it comes to  
onesided
communication and (some) derived datatypes, so that it is  
probably not

of problem in my code.

I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same  
behaviour

can be be seen with gcc-3.3.1 and intel 10.1.

Please correct me if ompitest.cc contains errors. Otherwise I  
would be
glad to hear how I should report these problems to the  
develepors (if

they don't read this).

Thanks + best regards

Dorian








___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
de

Re: [OMPI devel] shared-memory allocations

2008-12-13 Thread Patrick Geoffray

Richard Graham wrote:
Yes - it is polling volatile memory, so has to load from memory on every 
read.


Actually, it will poll in cache, and only load from memory when the 
cache coherency protocol invalidates the cache line. Volatile semantic 
only prevents compiler optimizations.


It does not matter much where the pages are (closer to reader or 
receiver) on NUMAs, as long as they are equally distributed among all 
sockets (ie the choice is consistent). Cache prefetching is slightly 
more efficient on local socket, so closer to reader may be a bit better.


Patrick


Re: [OMPI devel] shared-memory allocations

2008-12-13 Thread Paul H. Hargrove

To expand slightly on Patrick's last comment:

>  Cache prefetching is slightly
> more efficient on local socket, so closer to reader may be a bit better.

Ideally one polls from cache, but in the event that the line is evicted the 
next poll after the eviction will pay a lower cost if the memory is near to 
the reader.


-Paul


Patrick Geoffray wrote:

Richard Graham wrote:
Yes - it is polling volatile memory, so has to load from memory on 
every read.


Actually, it will poll in cache, and only load from memory when the 
cache coherency protocol invalidates the cache line. Volatile semantic 
only prevents compiler optimizations.


It does not matter much where the pages are (closer to reader or 
receiver) on NUMAs, as long as they are equally distributed among all 
sockets (ie the choice is consistent). Cache prefetching is slightly 
more efficient on local socket, so closer to reader may be a bit better.


Patrick
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900