Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Jeff Hammond
No clarification necessary. Standard is not user guide. Semantics are clear
from what is defined. Users who don't like the interface can write a
library that does what they want.

Jeff

On Thursday, February 11, 2016, Nathan Hjelm  wrote:

>
> I should also say that I think this is something that may be worth
> clarifying in the standard. Either semantic is fine with me but there is
> no reason to change the behavior if it does not violate the standard.
>
> -Nathan
>
> On Thu, Feb 11, 2016 at 01:35:28PM -0700, Nathan Hjelm wrote:
> >
> > Jeff probably ran with MPICH. Open MPI's are consistent with our choice
> > of definition for size=0:
> >
> > query: me=1, them=0, size=0, disp=1, base=0x0
> > query: me=1, them=1, size=4, disp=1, base=0x1097e30f8
> > query: me=1, them=2, size=4, disp=1, base=0x1097e30fc
> > query: me=1, them=3, size=4, disp=1, base=0x1097e3100
> > query: me=2, them=0, size=0, disp=1, base=0x0
> > query: me=2, them=1, size=4, disp=1, base=0x109fe10f8
> > query: me=2, them=2, size=4, disp=1, base=0x109fe10fc
> > query: me=2, them=3, size=4, disp=1, base=0x109fe1100
> > query: me=2, them=PROC_NULL, size=4, disp=1, base=0x109fe10f8
> > query: me=3, them=0, size=0, disp=1, base=0x0
> > query: me=3, them=1, size=4, disp=1, base=0x1088f30f8
> > query: me=3, them=2, size=4, disp=1, base=0x1088f30fc
> > query: me=3, them=3, size=4, disp=1, base=0x1088f3100
> > query: me=3, them=PROC_NULL, size=4, disp=1, base=0x1088f30f8
> > query: me=0, them=0, size=0, disp=1, base=0x0
> > query: me=0, them=1, size=4, disp=1, base=0x105b890f8
> > query: me=0, them=2, size=4, disp=1, base=0x105b890fc
> > query: me=0, them=3, size=4, disp=1, base=0x105b89100
> > query: me=0, them=PROC_NULL, size=4, disp=1, base=0x105b890f8
> > query: me=1, them=PROC_NULL, size=4, disp=1, base=0x1097e30f8
> >
> > To be portable only use MPI_Win_shared_query and do not rely on the
> > return value of base if you pass size = 0.
> >
> > -Nathan
> >
> > On Thu, Feb 11, 2016 at 08:23:16PM +, Peter Wind wrote:
> > >Thanks Jeff, that was an interesting result. The pointers are here
> well
> > >defined, also for the zero size segment.
> > >However I can't reproduce your output. I still get null pointers
> (output
> > >below).
> > >(I tried both 1.8.5 and 1.10.2 versions)
> > >What could be the difference?
> > >Peter
> > >mpirun -np 4 a.out
> > >query: me=0, them=0, size=0, disp=1, base=(nil)
> > >query: me=0, them=1, size=4, disp=1, base=0x2aee280030d0
> > >query: me=0, them=2, size=4, disp=1, base=0x2aee280030d4
> > >query: me=0, them=3, size=4, disp=1, base=0x2aee280030d8
> > >query: me=0, them=PROC_NULL, size=4, disp=1, base=0x2aee280030d0
> > >query: me=1, them=0, size=0, disp=1, base=(nil)
> > >query: me=1, them=1, size=4, disp=1, base=0x2aabbb9ce0d0
> > >query: me=1, them=2, size=4, disp=1, base=0x2aabbb9ce0d4
> > >query: me=1, them=3, size=4, disp=1, base=0x2aabbb9ce0d8
> > >query: me=1, them=PROC_NULL, size=4, disp=1, base=0x2aabbb9ce0d0
> > >query: me=2, them=0, size=0, disp=1, base=(nil)
> > >query: me=2, them=1, size=4, disp=1, base=0x2b1579dd40d0
> > >query: me=2, them=2, size=4, disp=1, base=0x2b1579dd40d4
> > >query: me=2, them=3, size=4, disp=1, base=0x2b1579dd40d8
> > >query: me=2, them=PROC_NULL, size=4, disp=1, base=0x2b1579dd40d0
> > >query: me=3, them=0, size=0, disp=1, base=(nil)
> > >query: me=3, them=1, size=4, disp=1, base=0x2ac8d2c350d0
> > >query: me=3, them=2, size=4, disp=1, base=0x2ac8d2c350d4
> > >query: me=3, them=3, size=4, disp=1, base=0x2ac8d2c350d8
> > >query: me=3, them=PROC_NULL, size=4, disp=1, base=0x2ac8d2c350d0
> > >
> > >
> --
> > >
> > >  See attached.  Output below.  Note that the base you get for
> ranks 0 and
> > >  1 is the same, so you need to use the fact that size=0 at rank=0
> to know
> > >  not to dereference that pointer and expect to be writing into
> rank 0's
> > >  memory, since you will write into rank 1's.
> > >  I would probably add "if (size==0) base=NULL;" for good measure.
> > >  Jeff
> > >
> > >  $ mpirun -n 4 ./a.out
> > >
> > >  query: me=0, them=0, size=0, disp=1, base=0x10bd64000
> > >
> > >  query: me=0, them=1, size=4, disp=1, base=0x10bd64000
> > >
> > >  query: me=0, them=2, size=4, disp=1, base=0x10bd64004
> > >
> > >  query: me=0, them=3, size=4, disp=1, base=0x10bd64008
> > >
> > >  query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000
> > >
> > >  query: me=1, them=0, size=0, disp=1, base=0x102d3b000
> > >
> > >  query: me=1, them=1, size=4, disp=1, base=0x102d3b000
> > >
> > >  query: me=1, them=2, size=4, disp=1, base=0x102d3b004
> > >
> > >  query: me=1, them=3, size=4, disp=1, base=0x102d3b008
> > >
> > >  query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000
> > >
> > >  query: 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Jeff Hammond
Indeed, I ran with MPICH. But I like OpenMPI's choice better here, which is
why I said that I would explicitly set the pointer to bull when size is
zero.

Jeff

On Thursday, February 11, 2016, Nathan Hjelm  wrote:

>
> Jeff probably ran with MPICH. Open MPI's are consistent with our choice
> of definition for size=0:
>
> query: me=1, them=0, size=0, disp=1, base=0x0
> query: me=1, them=1, size=4, disp=1, base=0x1097e30f8
> query: me=1, them=2, size=4, disp=1, base=0x1097e30fc
> query: me=1, them=3, size=4, disp=1, base=0x1097e3100
> query: me=2, them=0, size=0, disp=1, base=0x0
> query: me=2, them=1, size=4, disp=1, base=0x109fe10f8
> query: me=2, them=2, size=4, disp=1, base=0x109fe10fc
> query: me=2, them=3, size=4, disp=1, base=0x109fe1100
> query: me=2, them=PROC_NULL, size=4, disp=1, base=0x109fe10f8
> query: me=3, them=0, size=0, disp=1, base=0x0
> query: me=3, them=1, size=4, disp=1, base=0x1088f30f8
> query: me=3, them=2, size=4, disp=1, base=0x1088f30fc
> query: me=3, them=3, size=4, disp=1, base=0x1088f3100
> query: me=3, them=PROC_NULL, size=4, disp=1, base=0x1088f30f8
> query: me=0, them=0, size=0, disp=1, base=0x0
> query: me=0, them=1, size=4, disp=1, base=0x105b890f8
> query: me=0, them=2, size=4, disp=1, base=0x105b890fc
> query: me=0, them=3, size=4, disp=1, base=0x105b89100
> query: me=0, them=PROC_NULL, size=4, disp=1, base=0x105b890f8
> query: me=1, them=PROC_NULL, size=4, disp=1, base=0x1097e30f8
>
> To be portable only use MPI_Win_shared_query and do not rely on the
> return value of base if you pass size = 0.
>
> -Nathan
>
> On Thu, Feb 11, 2016 at 08:23:16PM +, Peter Wind wrote:
> >Thanks Jeff, that was an interesting result. The pointers are here
> well
> >defined, also for the zero size segment.
> >However I can't reproduce your output. I still get null pointers
> (output
> >below).
> >(I tried both 1.8.5 and 1.10.2 versions)
> >What could be the difference?
> >Peter
> >mpirun -np 4 a.out
> >query: me=0, them=0, size=0, disp=1, base=(nil)
> >query: me=0, them=1, size=4, disp=1, base=0x2aee280030d0
> >query: me=0, them=2, size=4, disp=1, base=0x2aee280030d4
> >query: me=0, them=3, size=4, disp=1, base=0x2aee280030d8
> >query: me=0, them=PROC_NULL, size=4, disp=1, base=0x2aee280030d0
> >query: me=1, them=0, size=0, disp=1, base=(nil)
> >query: me=1, them=1, size=4, disp=1, base=0x2aabbb9ce0d0
> >query: me=1, them=2, size=4, disp=1, base=0x2aabbb9ce0d4
> >query: me=1, them=3, size=4, disp=1, base=0x2aabbb9ce0d8
> >query: me=1, them=PROC_NULL, size=4, disp=1, base=0x2aabbb9ce0d0
> >query: me=2, them=0, size=0, disp=1, base=(nil)
> >query: me=2, them=1, size=4, disp=1, base=0x2b1579dd40d0
> >query: me=2, them=2, size=4, disp=1, base=0x2b1579dd40d4
> >query: me=2, them=3, size=4, disp=1, base=0x2b1579dd40d8
> >query: me=2, them=PROC_NULL, size=4, disp=1, base=0x2b1579dd40d0
> >query: me=3, them=0, size=0, disp=1, base=(nil)
> >query: me=3, them=1, size=4, disp=1, base=0x2ac8d2c350d0
> >query: me=3, them=2, size=4, disp=1, base=0x2ac8d2c350d4
> >query: me=3, them=3, size=4, disp=1, base=0x2ac8d2c350d8
> >query: me=3, them=PROC_NULL, size=4, disp=1, base=0x2ac8d2c350d0
> >
> >
> --
> >
> >  See attached.  Output below.  Note that the base you get for ranks
> 0 and
> >  1 is the same, so you need to use the fact that size=0 at rank=0 to
> know
> >  not to dereference that pointer and expect to be writing into rank
> 0's
> >  memory, since you will write into rank 1's.
> >  I would probably add "if (size==0) base=NULL;" for good measure.
> >  Jeff
> >
> >  $ mpirun -n 4 ./a.out
> >
> >  query: me=0, them=0, size=0, disp=1, base=0x10bd64000
> >
> >  query: me=0, them=1, size=4, disp=1, base=0x10bd64000
> >
> >  query: me=0, them=2, size=4, disp=1, base=0x10bd64004
> >
> >  query: me=0, them=3, size=4, disp=1, base=0x10bd64008
> >
> >  query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000
> >
> >  query: me=1, them=0, size=0, disp=1, base=0x102d3b000
> >
> >  query: me=1, them=1, size=4, disp=1, base=0x102d3b000
> >
> >  query: me=1, them=2, size=4, disp=1, base=0x102d3b004
> >
> >  query: me=1, them=3, size=4, disp=1, base=0x102d3b008
> >
> >  query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000
> >
> >  query: me=2, them=0, size=0, disp=1, base=0x10aac1000
> >
> >  query: me=2, them=1, size=4, disp=1, base=0x10aac1000
> >
> >  query: me=2, them=2, size=4, disp=1, base=0x10aac1004
> >
> >  query: me=2, them=3, size=4, disp=1, base=0x10aac1008
> >
> >  query: me=2, them=PROC_NULL, size=4, disp=1, base=0x10aac1000
> >
> >  query: me=3, them=0, size=0, disp=1, base=0x100fa2000
> >
> >  query: me=3, them=1, size=4, disp=1, base=0x100fa2000
> >
> >  query: me=3, them=2, size=4, 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Nathan Hjelm

I should also say that I think this is something that may be worth
clarifying in the standard. Either semantic is fine with me but there is
no reason to change the behavior if it does not violate the standard.

-Nathan

On Thu, Feb 11, 2016 at 01:35:28PM -0700, Nathan Hjelm wrote:
> 
> Jeff probably ran with MPICH. Open MPI's are consistent with our choice
> of definition for size=0:
> 
> query: me=1, them=0, size=0, disp=1, base=0x0
> query: me=1, them=1, size=4, disp=1, base=0x1097e30f8
> query: me=1, them=2, size=4, disp=1, base=0x1097e30fc
> query: me=1, them=3, size=4, disp=1, base=0x1097e3100
> query: me=2, them=0, size=0, disp=1, base=0x0
> query: me=2, them=1, size=4, disp=1, base=0x109fe10f8
> query: me=2, them=2, size=4, disp=1, base=0x109fe10fc
> query: me=2, them=3, size=4, disp=1, base=0x109fe1100
> query: me=2, them=PROC_NULL, size=4, disp=1, base=0x109fe10f8
> query: me=3, them=0, size=0, disp=1, base=0x0
> query: me=3, them=1, size=4, disp=1, base=0x1088f30f8
> query: me=3, them=2, size=4, disp=1, base=0x1088f30fc
> query: me=3, them=3, size=4, disp=1, base=0x1088f3100
> query: me=3, them=PROC_NULL, size=4, disp=1, base=0x1088f30f8
> query: me=0, them=0, size=0, disp=1, base=0x0
> query: me=0, them=1, size=4, disp=1, base=0x105b890f8
> query: me=0, them=2, size=4, disp=1, base=0x105b890fc
> query: me=0, them=3, size=4, disp=1, base=0x105b89100
> query: me=0, them=PROC_NULL, size=4, disp=1, base=0x105b890f8
> query: me=1, them=PROC_NULL, size=4, disp=1, base=0x1097e30f8
> 
> To be portable only use MPI_Win_shared_query and do not rely on the
> return value of base if you pass size = 0.
> 
> -Nathan
> 
> On Thu, Feb 11, 2016 at 08:23:16PM +, Peter Wind wrote:
> >Thanks Jeff, that was an interesting result. The pointers are here well
> >defined, also for the zero size segment.
> >However I can't reproduce your output. I still get null pointers (output
> >below).
> >(I tried both 1.8.5 and 1.10.2 versions)
> >What could be the difference?
> >Peter
> >mpirun -np 4 a.out
> >query: me=0, them=0, size=0, disp=1, base=(nil)
> >query: me=0, them=1, size=4, disp=1, base=0x2aee280030d0
> >query: me=0, them=2, size=4, disp=1, base=0x2aee280030d4
> >query: me=0, them=3, size=4, disp=1, base=0x2aee280030d8
> >query: me=0, them=PROC_NULL, size=4, disp=1, base=0x2aee280030d0
> >query: me=1, them=0, size=0, disp=1, base=(nil)
> >query: me=1, them=1, size=4, disp=1, base=0x2aabbb9ce0d0
> >query: me=1, them=2, size=4, disp=1, base=0x2aabbb9ce0d4
> >query: me=1, them=3, size=4, disp=1, base=0x2aabbb9ce0d8
> >query: me=1, them=PROC_NULL, size=4, disp=1, base=0x2aabbb9ce0d0
> >query: me=2, them=0, size=0, disp=1, base=(nil)
> >query: me=2, them=1, size=4, disp=1, base=0x2b1579dd40d0
> >query: me=2, them=2, size=4, disp=1, base=0x2b1579dd40d4
> >query: me=2, them=3, size=4, disp=1, base=0x2b1579dd40d8
> >query: me=2, them=PROC_NULL, size=4, disp=1, base=0x2b1579dd40d0
> >query: me=3, them=0, size=0, disp=1, base=(nil)
> >query: me=3, them=1, size=4, disp=1, base=0x2ac8d2c350d0
> >query: me=3, them=2, size=4, disp=1, base=0x2ac8d2c350d4
> >query: me=3, them=3, size=4, disp=1, base=0x2ac8d2c350d8
> >query: me=3, them=PROC_NULL, size=4, disp=1, base=0x2ac8d2c350d0
> > 
> >  --
> > 
> >  See attached.  Output below.  Note that the base you get for ranks 0 
> > and
> >  1 is the same, so you need to use the fact that size=0 at rank=0 to 
> > know
> >  not to dereference that pointer and expect to be writing into rank 0's
> >  memory, since you will write into rank 1's.
> >  I would probably add "if (size==0) base=NULL;" for good measure.
> >  Jeff
> > 
> >  $ mpirun -n 4 ./a.out
> > 
> >  query: me=0, them=0, size=0, disp=1, base=0x10bd64000
> > 
> >  query: me=0, them=1, size=4, disp=1, base=0x10bd64000
> > 
> >  query: me=0, them=2, size=4, disp=1, base=0x10bd64004
> > 
> >  query: me=0, them=3, size=4, disp=1, base=0x10bd64008
> > 
> >  query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000
> > 
> >  query: me=1, them=0, size=0, disp=1, base=0x102d3b000
> > 
> >  query: me=1, them=1, size=4, disp=1, base=0x102d3b000
> > 
> >  query: me=1, them=2, size=4, disp=1, base=0x102d3b004
> > 
> >  query: me=1, them=3, size=4, disp=1, base=0x102d3b008
> > 
> >  query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000
> > 
> >  query: me=2, them=0, size=0, disp=1, base=0x10aac1000
> > 
> >  query: me=2, them=1, size=4, disp=1, base=0x10aac1000
> > 
> >  query: me=2, them=2, size=4, disp=1, base=0x10aac1004
> > 
> >  query: me=2, them=3, size=4, disp=1, base=0x10aac1008
> > 
> >  query: me=2, them=PROC_NULL, size=4, disp=1, base=0x10aac1000
> > 
> >  query: me=3, them=0, size=0, disp=1, base=0x100fa2000
> > 
> >  query: me=3, them=1, 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Nathan Hjelm

Jeff probably ran with MPICH. Open MPI's are consistent with our choice
of definition for size=0:

query: me=1, them=0, size=0, disp=1, base=0x0
query: me=1, them=1, size=4, disp=1, base=0x1097e30f8
query: me=1, them=2, size=4, disp=1, base=0x1097e30fc
query: me=1, them=3, size=4, disp=1, base=0x1097e3100
query: me=2, them=0, size=0, disp=1, base=0x0
query: me=2, them=1, size=4, disp=1, base=0x109fe10f8
query: me=2, them=2, size=4, disp=1, base=0x109fe10fc
query: me=2, them=3, size=4, disp=1, base=0x109fe1100
query: me=2, them=PROC_NULL, size=4, disp=1, base=0x109fe10f8
query: me=3, them=0, size=0, disp=1, base=0x0
query: me=3, them=1, size=4, disp=1, base=0x1088f30f8
query: me=3, them=2, size=4, disp=1, base=0x1088f30fc
query: me=3, them=3, size=4, disp=1, base=0x1088f3100
query: me=3, them=PROC_NULL, size=4, disp=1, base=0x1088f30f8
query: me=0, them=0, size=0, disp=1, base=0x0
query: me=0, them=1, size=4, disp=1, base=0x105b890f8
query: me=0, them=2, size=4, disp=1, base=0x105b890fc
query: me=0, them=3, size=4, disp=1, base=0x105b89100
query: me=0, them=PROC_NULL, size=4, disp=1, base=0x105b890f8
query: me=1, them=PROC_NULL, size=4, disp=1, base=0x1097e30f8

To be portable only use MPI_Win_shared_query and do not rely on the
return value of base if you pass size = 0.

-Nathan

On Thu, Feb 11, 2016 at 08:23:16PM +, Peter Wind wrote:
>Thanks Jeff, that was an interesting result. The pointers are here well
>defined, also for the zero size segment.
>However I can't reproduce your output. I still get null pointers (output
>below).
>(I tried both 1.8.5 and 1.10.2 versions)
>What could be the difference?
>Peter
>mpirun -np 4 a.out
>query: me=0, them=0, size=0, disp=1, base=(nil)
>query: me=0, them=1, size=4, disp=1, base=0x2aee280030d0
>query: me=0, them=2, size=4, disp=1, base=0x2aee280030d4
>query: me=0, them=3, size=4, disp=1, base=0x2aee280030d8
>query: me=0, them=PROC_NULL, size=4, disp=1, base=0x2aee280030d0
>query: me=1, them=0, size=0, disp=1, base=(nil)
>query: me=1, them=1, size=4, disp=1, base=0x2aabbb9ce0d0
>query: me=1, them=2, size=4, disp=1, base=0x2aabbb9ce0d4
>query: me=1, them=3, size=4, disp=1, base=0x2aabbb9ce0d8
>query: me=1, them=PROC_NULL, size=4, disp=1, base=0x2aabbb9ce0d0
>query: me=2, them=0, size=0, disp=1, base=(nil)
>query: me=2, them=1, size=4, disp=1, base=0x2b1579dd40d0
>query: me=2, them=2, size=4, disp=1, base=0x2b1579dd40d4
>query: me=2, them=3, size=4, disp=1, base=0x2b1579dd40d8
>query: me=2, them=PROC_NULL, size=4, disp=1, base=0x2b1579dd40d0
>query: me=3, them=0, size=0, disp=1, base=(nil)
>query: me=3, them=1, size=4, disp=1, base=0x2ac8d2c350d0
>query: me=3, them=2, size=4, disp=1, base=0x2ac8d2c350d4
>query: me=3, them=3, size=4, disp=1, base=0x2ac8d2c350d8
>query: me=3, them=PROC_NULL, size=4, disp=1, base=0x2ac8d2c350d0
> 
>  --
> 
>  See attached.  Output below.  Note that the base you get for ranks 0 and
>  1 is the same, so you need to use the fact that size=0 at rank=0 to know
>  not to dereference that pointer and expect to be writing into rank 0's
>  memory, since you will write into rank 1's.
>  I would probably add "if (size==0) base=NULL;" for good measure.
>  Jeff
> 
>  $ mpirun -n 4 ./a.out
> 
>  query: me=0, them=0, size=0, disp=1, base=0x10bd64000
> 
>  query: me=0, them=1, size=4, disp=1, base=0x10bd64000
> 
>  query: me=0, them=2, size=4, disp=1, base=0x10bd64004
> 
>  query: me=0, them=3, size=4, disp=1, base=0x10bd64008
> 
>  query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000
> 
>  query: me=1, them=0, size=0, disp=1, base=0x102d3b000
> 
>  query: me=1, them=1, size=4, disp=1, base=0x102d3b000
> 
>  query: me=1, them=2, size=4, disp=1, base=0x102d3b004
> 
>  query: me=1, them=3, size=4, disp=1, base=0x102d3b008
> 
>  query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000
> 
>  query: me=2, them=0, size=0, disp=1, base=0x10aac1000
> 
>  query: me=2, them=1, size=4, disp=1, base=0x10aac1000
> 
>  query: me=2, them=2, size=4, disp=1, base=0x10aac1004
> 
>  query: me=2, them=3, size=4, disp=1, base=0x10aac1008
> 
>  query: me=2, them=PROC_NULL, size=4, disp=1, base=0x10aac1000
> 
>  query: me=3, them=0, size=0, disp=1, base=0x100fa2000
> 
>  query: me=3, them=1, size=4, disp=1, base=0x100fa2000
> 
>  query: me=3, them=2, size=4, disp=1, base=0x100fa2004
> 
>  query: me=3, them=3, size=4, disp=1, base=0x100fa2008
> 
>  query: me=3, them=PROC_NULL, size=4, disp=1, base=0x100fa2000
> 
>  On Thu, Feb 11, 2016 at 8:55 AM, Jeff Hammond 
>  wrote:
> 
>On Thu, Feb 11, 2016 at 8:46 AM, Nathan Hjelm  wrote:
>>
>>
>> On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Peter Wind
You can be right semantically. But also the sentence "the first address in the 
memory segment of process i is consecutive with the last address in the memory 
segment of process i - 1" is not easy to interpret correctly for a zero size 
segment.

There may be good reasons not to allocate the pointer for zero size segment. 
What I try to say is, that a new user reading the documentation, will not 
expect this behaviour before trying it out.
Couldn't a small sentence in the documentation, like "the pointer should not be 
used for zero size segments" clarify this?

Peter

- Original Message -
> 
> On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind wrote:
> >I would add that the present situation is bound to give problems for
> >some
> >users.
> >It is natural to divide an array in segments, each process treating its
> >own segment, but needing to read adjacent segments too.
> >MPI_Win_allocate_shared seems to be designed for this.
> >This will work fine as long as no segment as size zero. It can also be
> >expected that most testing would be done with all segments larger than
> >zero.
> >The document adding "size = 0 is valid", would also make people
> >confident
> >that it will be consistent for that special case too.
> 
> Nope, that statement says its ok for a rank to specify that the local
> shared memory segment is 0 bytes. Nothing more. The standard
> unfortunately does not define what pointer value is returned for a rank
> that specifies size = 0. Not sure if the RMA working group intentionally
> left that undefine... Anyway, Open MPI does not appear to be out of
> compliance with the standard here.
> 
> To be safe you should use MPI_Win_shared_query as suggested. You can
> pass MPI_PROC_NULL as the rank to get the pointer for the first non-zero
> sized segment in the shared memory window.
> 
> -Nathan
> HPC-5, LANL
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28506.php


Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Peter Wind
Thanks Jeff, that was an interesting result. The pointers are here well 
defined, also for the zero size segment. 
However I can't reproduce your output. I still get null pointers (output 
below). 
(I tried both 1.8.5 and 1.10.2 versions) 
What could be the difference? 

Peter 

mpirun -np 4 a.out 
query: me=0, them=0, size=0, disp=1, base=(nil) 
query: me=0, them=1, size=4, disp=1, base=0x2aee280030d0 
query: me=0, them=2, size=4, disp=1, base=0x2aee280030d4 
query: me=0, them=3, size=4, disp=1, base=0x2aee280030d8 
query: me=0, them=PROC_NULL, size=4, disp=1, base=0x2aee280030d0 
query: me=1, them=0, size=0, disp=1, base=(nil) 
query: me=1, them=1, size=4, disp=1, base=0x2aabbb9ce0d0 
query: me=1, them=2, size=4, disp=1, base=0x2aabbb9ce0d4 
query: me=1, them=3, size=4, disp=1, base=0x2aabbb9ce0d8 
query: me=1, them=PROC_NULL, size=4, disp=1, base=0x2aabbb9ce0d0 
query: me=2, them=0, size=0, disp=1, base=(nil) 
query: me=2, them=1, size=4, disp=1, base=0x2b1579dd40d0 
query: me=2, them=2, size=4, disp=1, base=0x2b1579dd40d4 
query: me=2, them=3, size=4, disp=1, base=0x2b1579dd40d8 
query: me=2, them=PROC_NULL, size=4, disp=1, base=0x2b1579dd40d0 
query: me=3, them=0, size=0, disp=1, base=(nil) 
query: me=3, them=1, size=4, disp=1, base=0x2ac8d2c350d0 
query: me=3, them=2, size=4, disp=1, base=0x2ac8d2c350d4 
query: me=3, them=3, size=4, disp=1, base=0x2ac8d2c350d8 
query: me=3, them=PROC_NULL, size=4, disp=1, base=0x2ac8d2c350d0 

- Original Message -

> See attached. Output below. Note that the base you get for ranks 0 and 1 is
> the same, so you need to use the fact that size=0 at rank=0 to know not to
> dereference that pointer and expect to be writing into rank 0's memory,
> since you will write into rank 1's.

> I would probably add "if (size==0) base=NULL;" for good measure.

> Jeff

> $ mpirun -n 4 ./a.out

> query: me=0, them=0, size=0, disp=1, base=0x10bd64000

> query: me=0, them=1, size=4, disp=1, base=0x10bd64000

> query: me=0, them=2, size=4, disp=1, base=0x10bd64004

> query: me=0, them=3, size=4, disp=1, base=0x10bd64008

> query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000

> query: me=1, them=0, size=0, disp=1, base=0x102d3b000

> query: me=1, them=1, size=4, disp=1, base=0x102d3b000

> query: me=1, them=2, size=4, disp=1, base=0x102d3b004

> query: me=1, them=3, size=4, disp=1, base=0x102d3b008

> query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000

> query: me=2, them=0, size=0, disp=1, base=0x10aac1000

> query: me=2, them=1, size=4, disp=1, base=0x10aac1000

> query: me=2, them=2, size=4, disp=1, base=0x10aac1004

> query: me=2, them=3, size=4, disp=1, base=0x10aac1008

> query: me=2, them=PROC_NULL, size=4, disp=1, base=0x10aac1000

> query: me=3, them=0, size=0, disp=1, base=0x100fa2000

> query: me=3, them=1, size=4, disp=1, base=0x100fa2000

> query: me=3, them=2, size=4, disp=1, base=0x100fa2004

> query: me=3, them=3, size=4, disp=1, base=0x100fa2008

> query: me=3, them=PROC_NULL, size=4, disp=1, base=0x100fa2000

> On Thu, Feb 11, 2016 at 8:55 AM, Jeff Hammond < jeff.scie...@gmail.com >
> wrote:

> > On Thu, Feb 11, 2016 at 8:46 AM, Nathan Hjelm < hje...@lanl.gov > wrote:
> 
> > >
> 
> > >
> 
> > > On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind wrote:
> 
> > > > I would add that the present situation is bound to give problems for
> > > > some
> 
> > > > users.
> 
> > > > It is natural to divide an array in segments, each process treating its
> 
> > > > own segment, but needing to read adjacent segments too.
> 
> > > > MPI_Win_allocate_shared seems to be designed for this.
> 
> > > > This will work fine as long as no segment as size zero. It can also be
> 
> > > > expected that most testing would be done with all segments larger than
> 
> > > > zero.
> 
> > > > The document adding "size = 0 is valid", would also make people
> > > > confident
> 
> > > > that it will be consistent for that special case too.
> 
> > >
> 
> > > Nope, that statement says its ok for a rank to specify that the local
> 
> > > shared memory segment is 0 bytes. Nothing more. The standard
> 
> > > unfortunately does not define what pointer value is returned for a rank
> 
> > > that specifies size = 0. Not sure if the RMA working group intentionally
> 
> > > left that undefine... Anyway, Open MPI does not appear to be out of
> 
> > > compliance with the standard here.
> 
> > >
> 

> > MPI_Alloc_mem doesn't say what happens if you pass size=0 either. The RMA
> > working group intentionally tries to maintain consistency with the rest of
> > the MPI standard whenever possible, so we did not create a new semantic
> > here.
> 

> > MPI_Win_shared_query text includes this:
> 

> > "If all processes in the group attached to the window specified size = 0,
> > then the call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called
> > with size = 0."
> 

> > >
> 
> > > To be safe you should use MPI_Win_shared_query as suggested. You can
> 
> > > pass MPI_PROC_NULL as the 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Jeff Hammond
See attached.  Output below.  Note that the base you get for ranks 0 and 1
is the same, so you need to use the fact that size=0 at rank=0 to know not
to dereference that pointer and expect to be writing into rank 0's memory,
since you will write into rank 1's.

I would probably add "if (size==0) base=NULL;" for good measure.

Jeff

$ mpirun -n 4 ./a.out

query: me=0, them=0, size=0, disp=1, base=0x10bd64000

query: me=0, them=1, size=4, disp=1, base=0x10bd64000

query: me=0, them=2, size=4, disp=1, base=0x10bd64004

query: me=0, them=3, size=4, disp=1, base=0x10bd64008

query: me=0, them=PROC_NULL, size=4, disp=1, base=0x10bd64000

query: me=1, them=0, size=0, disp=1, base=0x102d3b000

query: me=1, them=1, size=4, disp=1, base=0x102d3b000

query: me=1, them=2, size=4, disp=1, base=0x102d3b004

query: me=1, them=3, size=4, disp=1, base=0x102d3b008

query: me=1, them=PROC_NULL, size=4, disp=1, base=0x102d3b000

query: me=2, them=0, size=0, disp=1, base=0x10aac1000

query: me=2, them=1, size=4, disp=1, base=0x10aac1000

query: me=2, them=2, size=4, disp=1, base=0x10aac1004

query: me=2, them=3, size=4, disp=1, base=0x10aac1008

query: me=2, them=PROC_NULL, size=4, disp=1, base=0x10aac1000

query: me=3, them=0, size=0, disp=1, base=0x100fa2000

query: me=3, them=1, size=4, disp=1, base=0x100fa2000

query: me=3, them=2, size=4, disp=1, base=0x100fa2004

query: me=3, them=3, size=4, disp=1, base=0x100fa2008

query: me=3, them=PROC_NULL, size=4, disp=1, base=0x100fa2000

On Thu, Feb 11, 2016 at 8:55 AM, Jeff Hammond 
wrote:

>
>
> On Thu, Feb 11, 2016 at 8:46 AM, Nathan Hjelm  wrote:
> >
> >
> > On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind wrote:
> > >I would add that the present situation is bound to give problems
> for some
> > >users.
> > >It is natural to divide an array in segments, each process treating
> its
> > >own segment, but needing to read adjacent segments too.
> > >MPI_Win_allocate_shared seems to be designed for this.
> > >This will work fine as long as no segment as size zero. It can also
> be
> > >expected that most testing would be done with all segments larger
> than
> > >zero.
> > >The document adding "size = 0 is valid", would also make people
> confident
> > >that it will be consistent for that special case too.
> >
> > Nope, that statement says its ok for a rank to specify that the local
> > shared memory segment is 0 bytes. Nothing more. The standard
> > unfortunately does not define what pointer value is returned for a rank
> > that specifies size = 0. Not sure if the RMA working group intentionally
> > left that undefine... Anyway, Open MPI does not appear to be out of
> > compliance with the standard here.
> >
>
> MPI_Alloc_mem doesn't say what happens if you pass size=0 either.  The RMA
> working group intentionally tries to maintain consistency with the rest of
> the MPI standard whenever possible, so we did not create a new semantic
> here.
>
> MPI_Win_shared_query text includes this:
>
> "If all processes in the group attached to the window specified size = 0,
> then the call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called
> with size = 0."
>
> >
> > To be safe you should use MPI_Win_shared_query as suggested. You can
> > pass MPI_PROC_NULL as the rank to get the pointer for the first non-zero
> > sized segment in the shared memory window.
>
> Indeed!  I forgot about that.  MPI_Win_shared_query solves this problem
> for the user brilliantly.
>
> Jeff
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
#include 
#include 

/* test zero size segment.
 run on at least 3 cpus
 mpirun -np 4 a.out */

int main(int argc, char** argv)
{
MPI_Init(NULL, NULL);

int wsize, wrank;
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );

MPI_Comm ncomm = MPI_COMM_NULL;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, );

MPI_Aint size = (wrank==0) ? 0 : sizeof(int);
MPI_Win win = MPI_WIN_NULL;
int * ptr = NULL;
MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, , );

int nsize, nrank;
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );

for (int r=0; r

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Jeff Hammond
On Thu, Feb 11, 2016 at 8:46 AM, Nathan Hjelm  wrote:
>
>
> On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind wrote:
> >I would add that the present situation is bound to give problems for
some
> >users.
> >It is natural to divide an array in segments, each process treating
its
> >own segment, but needing to read adjacent segments too.
> >MPI_Win_allocate_shared seems to be designed for this.
> >This will work fine as long as no segment as size zero. It can also
be
> >expected that most testing would be done with all segments larger
than
> >zero.
> >The document adding "size = 0 is valid", would also make people
confident
> >that it will be consistent for that special case too.
>
> Nope, that statement says its ok for a rank to specify that the local
> shared memory segment is 0 bytes. Nothing more. The standard
> unfortunately does not define what pointer value is returned for a rank
> that specifies size = 0. Not sure if the RMA working group intentionally
> left that undefine... Anyway, Open MPI does not appear to be out of
> compliance with the standard here.
>

MPI_Alloc_mem doesn't say what happens if you pass size=0 either.  The RMA
working group intentionally tries to maintain consistency with the rest of
the MPI standard whenever possible, so we did not create a new semantic
here.

MPI_Win_shared_query text includes this:

"If all processes in the group attached to the window specified size = 0,
then the call returns size = 0 and a baseptr as if MPI_ALLOC_MEM was called
with size = 0."

>
> To be safe you should use MPI_Win_shared_query as suggested. You can
> pass MPI_PROC_NULL as the rank to get the pointer for the first non-zero
> sized segment in the shared memory window.

Indeed!  I forgot about that.  MPI_Win_shared_query solves this problem for
the user brilliantly.

Jeff

--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Nathan Hjelm

On Thu, Feb 11, 2016 at 02:17:40PM +, Peter Wind wrote:
>I would add that the present situation is bound to give problems for some
>users.
>It is natural to divide an array in segments, each process treating its
>own segment, but needing to read adjacent segments too.
>MPI_Win_allocate_shared seems to be designed for this.
>This will work fine as long as no segment as size zero. It can also be
>expected that most testing would be done with all segments larger than
>zero.
>The document adding "size = 0 is valid", would also make people confident
>that it will be consistent for that special case too.

Nope, that statement says its ok for a rank to specify that the local
shared memory segment is 0 bytes. Nothing more. The standard
unfortunately does not define what pointer value is returned for a rank
that specifies size = 0. Not sure if the RMA working group intentionally
left that undefine... Anyway, Open MPI does not appear to be out of
compliance with the standard here.

To be safe you should use MPI_Win_shared_query as suggested. You can
pass MPI_PROC_NULL as the rank to get the pointer for the first non-zero
sized segment in the shared memory window.

-Nathan
HPC-5, LANL


pgp4wih319kzq.pgp
Description: PGP signature


Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Peter Wind
I would add that the present situation is bound to give problems for some 
users. 

It is natural to divide an array in segments, each process treating its own 
segment, but needing to read adjacent segments too. 
MPI_Win_allocate_shared seems to be designed for this. 
This will work fine as long as no segment as size zero. It can also be expected 
that most testing would be done with all segments larger than zero. 
The document adding "size = 0 is valid", would also make people confident that 
it will be consistent for that special case too. 
Then long down the road of the development of a particular code some special 
case will use a segment of size zero, and it will be hard to trace back this 
error to the mpi library. 

Peter 


- Original Message -



Yes, that is what I meant. 

Enclosed is a C example. 
The point is that the code would logically make sense for task 0, but since it 
asks for a segment of size=0, it only gets a null pointer, which cannot be used 
to access the shared parts. 

Peter 

- Original Message -


I think Peter's point is that if 
- the windows uses contiguous memory 
*and* 
- all tasks knows how much memory was allocated by all other tasks in the 
window 
then it could/should be possible to get rid of MPI_Win_shared_query 

that is likely true if no task allocates zero byte. 
now, if a task allocates zero byte, MPI_Win_allocate_shared could return a null 
pointer and hence makes MPI_Win_shared_query usage mandatory. 

in his example, task 0 allocates zero bytes, so he was expecting the returned 
pointer on task zero points to the memory allocated by task 1. 

if "may enable" should be read as "does enable", then returning a null pointer 
can be seen as a bug. 
if "may enable" can be read as "does not always enable", the returning a null 
pointer is compliant with the standard. 

I am clearly not good at reading/interpreting the standard, so using 
MPI_Win_shared_query is my recommended way to get it work. 
(feel free to call it "bulletproof", "overkill", or even "right") 

Cheers, 

Gilles 

On Thursday, February 11, 2016, Jeff Hammond < jeff.scie...@gmail.com > wrote: 





On Wed, Feb 10, 2016 at 8:44 AM, Peter Wind < peter.w...@met.no > wrote: 



I agree that in practice the best practice would be to use Win_shared_query. 

Still I am confused by this part in the documentation: 
"The allocated memory is contiguous across process ranks unless the info key 
alloc_shared_noncontig is specified. Contiguous across process ranks means that 
the first address in the memory segment of process i is consecutive with the 
last address in the memory segment of process i - 1. This may enable the user 
to calculate remote address offsets with local information only." 

Isn't this an encouragement to use the pointer of Win_allocate_shared directly? 





No, it is not. Win_allocate_shared only gives you the pointer to the portion of 
the allocation that is owned by the calling process. If you want to access the 
whole slab, call Win_shared_query(..,rank=0,..) and use the resulting baseptr. 

I attempted to modify your code to be more correct, but I don't know enough 
Fortran to get it right. If you can parse C examples, I'll provide some of 
those. 

Jeff 




Peter 





I don't know about bulletproof, but Win_shared_query is the *only* valid way to 
get the addresses of memory in other processes associated with a window. 

The default for Win_allocate_shared is contiguous memory, but it can and likely 
will be mapped differently into each process, in which case only relative 
offsets are transferrable. 

Jeff 

On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet < 
gilles.gouaillar...@gmail.com > wrote: 


Peter, 

The bulletproof way is to use MPI_Win_shared_query after 
MPI_Win_allocate_shared. 
I do not know if current behavior is a bug or a feature... 

Cheers, 

Gilles 


On Wednesday, February 10, 2016, Peter Wind < peter.w...@met.no > wrote: 


Hi, 

Under fortran, MPI_Win_allocate_shared is called with a window size of zero for 
some processes. 
The output pointer is then not valid for these processes (null pointer). 
Did I understood this wrongly? shouldn't the pointers be contiguous, so that 
for a zero sized window, the pointer should point to the start of the segment 
of the next rank? 
The documentation explicitly specifies "size = 0 is valid". 

Attached a small code, where rank=0 allocate a window of size zero. All the 
other ranks get valid pointers, except rank 0. 

Best regards, 
Peter 
___ 
users mailing list 
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/02/28485.php 




___ 
users mailing list 
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
Link to this post: 

Re: [OMPI users] shared memory zero size segment

2016-02-11 Thread Peter Wind
Yes, that is what I meant. 

Enclosed is a C example. 
The point is that the code would logically make sense for task 0, but since it 
asks for a segment of size=0, it only gets a null pointer, which cannot be used 
to access the shared parts. 

Peter 

- Original Message -

> I think Peter's point is that if
> - the windows uses contiguous memory
> *and*
> - all tasks knows how much memory was allocated by all other tasks in the
> window
> then it could/should be possible to get rid of MPI_Win_shared_query

> that is likely true if no task allocates zero byte.
> now, if a task allocates zero byte, MPI_Win_allocate_shared could return a
> null pointer and hence makes MPI_Win_shared_query usage mandatory.

> in his example, task 0 allocates zero bytes, so he was expecting the returned
> pointer on task zero points to the memory allocated by task 1.

> if "may enable" should be read as "does enable", then returning a null
> pointer can be seen as a bug.
> if "may enable" can be read as "does not always enable", the returning a null
> pointer is compliant with the standard.

> I am clearly not good at reading/interpreting the standard, so using
> MPI_Win_shared_query is my recommended way to get it work.
> (feel free to call it "bulletproof", "overkill", or even "right")

> Cheers,

> Gilles

> On Thursday, February 11, 2016, Jeff Hammond < jeff.scie...@gmail.com >
> wrote:

> > On Wed, Feb 10, 2016 at 8:44 AM, Peter Wind < peter.w...@met.no > wrote:
> 

> > > I agree that in practice the best practice would be to use
> > > Win_shared_query.
> > 
> 

> > > Still I am confused by this part in the documentation:
> > 
> 
> > > "The allocated memory is contiguous across process ranks unless the info
> > > key
> > > alloc_shared_noncontig is specified. Contiguous across process ranks
> > > means
> > > that the first address in the memory segment of process i is consecutive
> > > with the last address in the memory segment of process i - 1. This may
> > > enable the user to calculate remote address offsets with local
> > > information
> > > only."
> > 
> 

> > > Isn't this an encouragement to use the pointer of Win_allocate_shared
> > > directly?
> > 
> 

> > No, it is not. Win_allocate_shared only gives you the pointer to the
> > portion
> > of the allocation that is owned by the calling process. If you want to
> > access the whole slab, call Win_shared_query(..,rank=0,..) and use the
> > resulting baseptr.
> 

> > I attempted to modify your code to be more correct, but I don't know enough
> > Fortran to get it right. If you can parse C examples, I'll provide some of
> > those.
> 

> > Jeff
> 

> > > Peter
> > 
> 

> > > > I don't know about bulletproof, but Win_shared_query is the *only*
> > > > valid
> > > > way
> > > > to get the addresses of memory in other processes associated with a
> > > > window.
> > > 
> > 
> 

> > > > The default for Win_allocate_shared is contiguous memory, but it can
> > > > and
> > > > likely will be mapped differently into each process, in which case only
> > > > relative offsets are transferrable.
> > > 
> > 
> 

> > > > Jeff
> > > 
> > 
> 

> > > > On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet <
> > > > gilles.gouaillar...@gmail.com > wrote:
> > > 
> > 
> 

> > > > > Peter,
> > > > 
> > > 
> > 
> 

> > > > > The bulletproof way is to use MPI_Win_shared_query after
> > > > > MPI_Win_allocate_shared.
> > > > 
> > > 
> > 
> 
> > > > > I do not know if current behavior is a bug or a feature...
> > > > 
> > > 
> > 
> 

> > > > > Cheers,
> > > > 
> > > 
> > 
> 

> > > > > Gilles
> > > > 
> > > 
> > 
> 

> > > > > On Wednesday, February 10, 2016, Peter Wind < peter.w...@met.no >
> > > > > wrote:
> > > > 
> > > 
> > 
> 

> > > > > > Hi,
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > Under fortran, MPI_Win_allocate_shared is called with a window size
> > > > > > of
> > > > > > zero
> > > > > > for some processes.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > The output pointer is then not valid for these processes (null
> > > > > > pointer).
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Did I understood this wrongly? shouldn't the pointers be
> > > > > > contiguous,
> > > > > > so
> > > > > > that
> > > > > > for a zero sized window, the pointer should point to the start of
> > > > > > the
> > > > > > segment of the next rank?
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > The documentation explicitly specifies "size = 0 is valid".
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > Attached a small code, where rank=0 allocate a window of size zero.
> > > > > > All
> > > > > > the
> > > > > > other ranks get valid pointers, except rank 0.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > Best regards,
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Peter
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > ___
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > users mailing list
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > us...@open-mpi.org
> 

Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Gilles Gouaillardet
I think Peter's point is that if
- the windows uses contiguous memory
*and*
- all tasks knows how much memory was allocated by all other tasks in the
window
then it could/should be possible to get rid of MPI_Win_shared_query

that is likely true if no task allocates zero byte.
now, if a task allocates zero byte, MPI_Win_allocate_shared could return a
null pointer and hence makes MPI_Win_shared_query usage mandatory.

in his example, task 0 allocates zero bytes, so he was expecting the
returned pointer on task zero points to the memory allocated by task 1.

if "may enable" should be read as "does enable", then returning a null
pointer can be seen as a bug.
if "may enable" can be read as "does not always enable", the returning a
null pointer is compliant with the standard.

I am clearly not good at reading/interpreting the standard, so using
MPI_Win_shared_query is my recommended way to get it work.
(feel free to call it "bulletproof",  "overkill", or even "right")

Cheers,

Gilles

On Thursday, February 11, 2016, Jeff Hammond  wrote:

>
>
> On Wed, Feb 10, 2016 at 8:44 AM, Peter Wind  > wrote:
>
>> I agree that in practice the best practice would be to use
>> Win_shared_query.
>>
>> Still I am confused by this part in the documentation:
>> "The allocated memory is contiguous across process ranks unless the info
>> key *alloc_shared_noncontig* is specified. Contiguous across process
>> ranks means that the first address in the memory segment of process i is
>> consecutive with the last address in the memory segment of process i - 1.
>> This may enable the user to calculate remote address offsets with local
>> information only."
>>
>> Isn't this an encouragement to use the pointer of Win_allocate_shared
>> directly?
>>
>>
> No, it is not.  Win_allocate_shared only gives you the pointer to the
> portion of the allocation that is owned by the calling process.  If you
> want to access the whole slab, call Win_shared_query(..,rank=0,..) and use
> the resulting baseptr.
>
> I attempted to modify your code to be more correct, but I don't know
> enough Fortran to get it right.  If you can parse C examples, I'll provide
> some of those.
>
> Jeff
>
>
>> Peter
>>
>> --
>>
>> I don't know about bulletproof, but Win_shared_query is the *only* valid
>> way to get the addresses of memory in other processes associated with a
>> window.
>>
>> The default for Win_allocate_shared is contiguous memory, but it can and
>> likely will be mapped differently into each process, in which case only
>> relative offsets are transferrable.
>>
>> Jeff
>>
>> On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com
>> > wrote:
>>
>>> Peter,
>>>
>>> The bulletproof way is to use MPI_Win_shared_query after
>>> MPI_Win_allocate_shared.
>>> I do not know if current behavior is a bug or a feature...
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Wednesday, February 10, 2016, Peter Wind >> > wrote:
>>>
 Hi,

 Under fortran, MPI_Win_allocate_shared is called with a window size of
 zero for some processes.
 The output pointer is then not valid for these processes (null pointer).
 Did I understood this wrongly? shouldn't the pointers be contiguous, so
 that for a zero sized window, the pointer should point to the start of the
 segment of the next rank?
 The documentation explicitly specifies "size = 0 is valid".

 Attached a small code, where rank=0 allocate a window of size zero. All
 the other ranks get valid pointers, except rank 0.

 Best regards,
 Peter
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2016/02/28485.php

>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/02/28493.php
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com
>> 
>> http://jeffhammond.github.io/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/02/28496.php
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org 

Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Jeff Hammond
On Wed, Feb 10, 2016 at 8:44 AM, Peter Wind  wrote:

> I agree that in practice the best practice would be to use
> Win_shared_query.
>
> Still I am confused by this part in the documentation:
> "The allocated memory is contiguous across process ranks unless the info
> key *alloc_shared_noncontig* is specified. Contiguous across process
> ranks means that the first address in the memory segment of process i is
> consecutive with the last address in the memory segment of process i - 1.
> This may enable the user to calculate remote address offsets with local
> information only."
>
> Isn't this an encouragement to use the pointer of Win_allocate_shared
> directly?
>
>
No, it is not.  Win_allocate_shared only gives you the pointer to the
portion of the allocation that is owned by the calling process.  If you
want to access the whole slab, call Win_shared_query(..,rank=0,..) and use
the resulting baseptr.

I attempted to modify your code to be more correct, but I don't know enough
Fortran to get it right.  If you can parse C examples, I'll provide some of
those.

Jeff


> Peter
>
> --
>
> I don't know about bulletproof, but Win_shared_query is the *only* valid
> way to get the addresses of memory in other processes associated with a
> window.
>
> The default for Win_allocate_shared is contiguous memory, but it can and
> likely will be mapped differently into each process, in which case only
> relative offsets are transferrable.
>
> Jeff
>
> On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Peter,
>>
>> The bulletproof way is to use MPI_Win_shared_query after
>> MPI_Win_allocate_shared.
>> I do not know if current behavior is a bug or a feature...
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Wednesday, February 10, 2016, Peter Wind  wrote:
>>
>>> Hi,
>>>
>>> Under fortran, MPI_Win_allocate_shared is called with a window size of
>>> zero for some processes.
>>> The output pointer is then not valid for these processes (null pointer).
>>> Did I understood this wrongly? shouldn't the pointers be contiguous, so
>>> that for a zero sized window, the pointer should point to the start of the
>>> segment of the next rank?
>>> The documentation explicitly specifies "size = 0 is valid".
>>>
>>> Attached a small code, where rank=0 allocate a window of size zero. All
>>> the other ranks get valid pointers, except rank 0.
>>>
>>> Best regards,
>>> Peter
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/02/28485.php
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/02/28493.php
>>
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28496.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28497.php
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


sharetest.f90
Description: Binary data


Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Peter Wind
I agree that in practice the best practice would be to use Win_shared_query. 

Still I am confused by this part in the documentation: 
"The allocated memory is contiguous across process ranks unless the info key 
alloc_shared_noncontig is specified. Contiguous across process ranks means that 
the first address in the memory segment of process i is consecutive with the 
last address in the memory segment of process i - 1. This may enable the user 
to calculate remote address offsets with local information only." 

Isn't this an encouragement to use the pointer of Win_allocate_shared directly? 

Peter 

- Original Message -

> I don't know about bulletproof, but Win_shared_query is the *only* valid way
> to get the addresses of memory in other processes associated with a window.

> The default for Win_allocate_shared is contiguous memory, but it can and
> likely will be mapped differently into each process, in which case only
> relative offsets are transferrable.

> Jeff

> On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com > wrote:

> > Peter,
> 

> > The bulletproof way is to use MPI_Win_shared_query after
> > MPI_Win_allocate_shared.
> 
> > I do not know if current behavior is a bug or a feature...
> 

> > Cheers,
> 

> > Gilles
> 

> > On Wednesday, February 10, 2016, Peter Wind < peter.w...@met.no > wrote:
> 

> > > Hi,
> > 
> 

> > > Under fortran, MPI_Win_allocate_shared is called with a window size of
> > > zero
> > > for some processes.
> > 
> 
> > > The output pointer is then not valid for these processes (null pointer).
> > 
> 
> > > Did I understood this wrongly? shouldn't the pointers be contiguous, so
> > > that
> > > for a zero sized window, the pointer should point to the start of the
> > > segment of the next rank?
> > 
> 
> > > The documentation explicitly specifies "size = 0 is valid".
> > 
> 

> > > Attached a small code, where rank=0 allocate a window of size zero. All
> > > the
> > > other ranks get valid pointers, except rank 0.
> > 
> 

> > > Best regards,
> > 
> 
> > > Peter
> > 
> 
> > > ___
> > 
> 
> > > users mailing list
> > 
> 
> > > us...@open-mpi.org
> > 
> 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 
> > > Link to this post:
> > > http://www.open-mpi.org/community/lists/users/2016/02/28485.php
> > 
> 

> > ___
> 
> > users mailing list
> 
> > us...@open-mpi.org
> 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2016/02/28493.php
> 

> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/

> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28496.php

Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Jeff Hammond
I don't know about bulletproof, but Win_shared_query is the *only* valid
way to get the addresses of memory in other processes associated with a
window.

The default for Win_allocate_shared is contiguous memory, but it can and
likely will be mapped differently into each process, in which case only
relative offsets are transferrable.

Jeff

On Wed, Feb 10, 2016 at 4:19 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Peter,
>
> The bulletproof way is to use MPI_Win_shared_query after
> MPI_Win_allocate_shared.
> I do not know if current behavior is a bug or a feature...
>
> Cheers,
>
> Gilles
>
>
> On Wednesday, February 10, 2016, Peter Wind  wrote:
>
>> Hi,
>>
>> Under fortran, MPI_Win_allocate_shared is called with a window size of
>> zero for some processes.
>> The output pointer is then not valid for these processes (null pointer).
>> Did I understood this wrongly? shouldn't the pointers be contiguous, so
>> that for a zero sized window, the pointer should point to the start of the
>> segment of the next rank?
>> The documentation explicitly specifies "size = 0 is valid".
>>
>> Attached a small code, where rank=0 allocate a window of size zero. All
>> the other ranks get valid pointers, except rank 0.
>>
>> Best regards,
>> Peter
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/02/28485.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28493.php
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Gilles Gouaillardet
Peter,

The bulletproof way is to use MPI_Win_shared_query after
MPI_Win_allocate_shared.
I do not know if current behavior is a bug or a feature...

Cheers,

Gilles

On Wednesday, February 10, 2016, Peter Wind  wrote:

> Hi,
>
> Under fortran, MPI_Win_allocate_shared is called with a window size of
> zero for some processes.
> The output pointer is then not valid for these processes (null pointer).
> Did I understood this wrongly? shouldn't the pointers be contiguous, so
> that for a zero sized window, the pointer should point to the start of the
> segment of the next rank?
> The documentation explicitly specifies "size = 0 is valid".
>
> Attached a small code, where rank=0 allocate a window of size zero. All
> the other ranks get valid pointers, except rank 0.
>
> Best regards,
> Peter
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28485.php
>


Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Peter Wind
Sorry for that, here is the attachement!

Peter

- Original Message -
> Peter --
> 
> Somewhere along the way, your attachment got lost.  Could you re-send?
> 
> Thanks.
> 
> 
> > On Feb 10, 2016, at 5:56 AM, Peter Wind  wrote:
> > 
> > Hi,
> > 
> > Under fortran, MPI_Win_allocate_shared is called with a window size of zero
> > for some processes.
> > The output pointer is then not valid for these processes (null pointer).
> > Did I understood this wrongly? shouldn't the pointers be contiguous, so
> > that for a zero sized window, the pointer should point to the start of the
> > segment of the next rank?
> > The documentation explicitly specifies "size = 0 is valid".
> > 
> > Attached a small code, where rank=0 allocate a window of size zero. All the
> > other ranks get valid pointers, except rank 0.
> > 
> > Best regards,
> > Peter
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2016/02/28485.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28486.php
> 
program sharetest

! test zero size segment.
! run on at least 3 cpus
! mpirun -np 4 a.out

   use mpi

   use, intrinsic :: iso_c_binding,

   implicit none


   integer, parameter :: nsize = 20
   integer, pointer   :: array(:)
   integer:: num_procs
   integer:: ierr
   integer:: irank, irank_group
   integer:: win
   integer:: disp_unit
   type(c_ptr):: cp1
   type(c_ptr):: cp2

   integer(MPI_ADDRESS_KIND) :: win_size
   integer(MPI_ADDRESS_KIND) :: segment_size

   call MPI_Init(ierr)
   call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)
   call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)

   disp_unit = sizeof(1)

   win_size = irank*disp_unit

   call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, cp1, win, ierr)

!   write(*,*)'rank ', irank,', pointer ',cp1

  call c_f_pointer(cp1, array, [nsize])

77 format(4(A,I3))

   if(irank/=0)then
  array(1)=irank
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
  if(irank/=num_procs-1)then
 print 77, ' rank', irank, ':  array(1)', array(1),' shared with next rank: ',array(irank+1)
  else
 print 77, ' rank', irank, ':  array(1)', array(1),' shared with previous rank: ',array(0)
  endif
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
   else
 CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
 if(.not.associated(array))then
print 77, 'zero pointer found, rank', irank
 else
print 77, ' rank', irank, ' array associated '
print 77, ' rank', irank, ':  array(1) ', array(1),' shared with next rank: ',array(irank+1)
 endif
   endif


   call MPI_Finalize(ierr)

 end program sharetest


Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Peter Wind


- Original Message -
> Peter --
> 
> Somewhere along the way, your attachment got lost.  Could you re-send?
> 
> Thanks.
> 
> 
> > On Feb 10, 2016, at 5:56 AM, Peter Wind  wrote:
> > 
> > Hi,
> > 
> > Under fortran, MPI_Win_allocate_shared is called with a window size of zero
> > for some processes.
> > The output pointer is then not valid for these processes (null pointer).
> > Did I understood this wrongly? shouldn't the pointers be contiguous, so
> > that for a zero sized window, the pointer should point to the start of the
> > segment of the next rank?
> > The documentation explicitly specifies "size = 0 is valid".
> > 
> > Attached a small code, where rank=0 allocate a window of size zero. All the
> > other ranks get valid pointers, except rank 0.
> > 
> > Best regards,
> > Peter
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2016/02/28485.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28486.php
> 
program sharetest

! test zero size segment.
! run on at least 3 cpus
! mpirun -np 4 a.out

   use mpi

   use, intrinsic :: iso_c_binding,

   implicit none


   integer, parameter :: nsize = 20
   integer, pointer   :: array(:)
   integer:: num_procs
   integer:: ierr
   integer:: irank, irank_group
   integer:: win
   integer:: disp_unit
   type(c_ptr):: cp1
   type(c_ptr):: cp2

   integer(MPI_ADDRESS_KIND) :: win_size
   integer(MPI_ADDRESS_KIND) :: segment_size

   call MPI_Init(ierr)
   call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)
   call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)

   disp_unit = sizeof(1)

   win_size = irank*disp_unit

   call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, MPI_COMM_WORLD, cp1, win, ierr)

!   write(*,*)'rank ', irank,', pointer ',cp1

  call c_f_pointer(cp1, array, [nsize])

77 format(4(A,I3))

   if(irank/=0)then
  array(1)=irank
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
  if(irank/=num_procs-1)then
 print 77, ' rank', irank, ':  array(1)', array(1),' shared with next rank: ',array(irank+1)
  else
 print 77, ' rank', irank, ':  array(1)', array(1),' shared with previous rank: ',array(0)
  endif
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
   else
 CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
  CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
 if(.not.associated(array))then
print 77, 'zero pointer found, rank', irank
 else
print 77, ' rank', irank, ' array associated '
print 77, ' rank', irank, ':  array(1) ', array(1),' shared with next rank: ',array(irank+1)
 endif
   endif


   call MPI_Finalize(ierr)

 end program sharetest


Re: [OMPI users] shared memory zero size segment

2016-02-10 Thread Jeff Squyres (jsquyres)
Peter --

Somewhere along the way, your attachment got lost.  Could you re-send?

Thanks.


> On Feb 10, 2016, at 5:56 AM, Peter Wind  wrote:
> 
> Hi,
> 
> Under fortran, MPI_Win_allocate_shared is called with a window size of zero 
> for some processes.
> The output pointer is then not valid for these processes (null pointer).
> Did I understood this wrongly? shouldn't the pointers be contiguous, so that 
> for a zero sized window, the pointer should point to the start of the segment 
> of the next rank?
> The documentation explicitly specifies "size = 0 is valid".
> 
> Attached a small code, where rank=0 allocate a window of size zero. All the 
> other ranks get valid pointers, except rank 0.
> 
> Best regards,
> Peter
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28485.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] shared memory zero size segment

2016-02-10 Thread Peter Wind
Hi,

Under fortran, MPI_Win_allocate_shared is called with a window size of zero for 
some processes.
The output pointer is then not valid for these processes (null pointer).
Did I understood this wrongly? shouldn't the pointers be contiguous, so that 
for a zero sized window, the pointer should point to the start of the segment 
of the next rank?
The documentation explicitly specifies "size = 0 is valid".

Attached a small code, where rank=0 allocate a window of size zero. All the 
other ranks get valid pointers, except rank 0.

Best regards,
Peter


Re: [OMPI users] shared memory under fortran, bug?

2016-02-03 Thread Gilles Gouaillardet

Peter,

a patch is available at 
https://github.com/ggouaillardet/ompi-release/commit/0b62eabcae403b95274ce55973a7ce29483d0c98.patch


it is now under review

Cheers,

Gilles

On 2/2/2016 11:22 PM, Gilles Gouaillardet wrote:

Thanks Peter,

this is just a workaround for a bug we just identified, the fix will 
come soon


Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind > wrote:


That worked!

i.e with the changed you proposed the code gives the right result.

That was efficient work, thank you Gilles :)

Best wishes,
Peter




Thanks Peter,

that is quite unexpected ...

let s try an other workaround, can you replace

integer:: comm_group

with

integer:: comm_group, comm_tmp

and

call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)

with

call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr)

if (irank < (num_procs/2)) then

 comm_group = comm_tmp

else

 call MPI_Comm_dup(comm_tmp, comm_group, ierr)

endif

if it works, I will make a fix tomorrow when I can access my
workstation.
if not, can you please run
mpirun --mca osc_base_verbose 100 ...
and post the output ?

I will then try to reproduce the issue and investigate it

Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind 
wrote:

Thanks Gilles,

I get the following output (I guess it is not what you
wanted?).

Peter


$ mpirun --mca osc pt2pt -np 4 a.out

--
A requested component was not found, or was unable to be
opened.  This
means that this component is either not installed or is
unable to be
used on your system (e.g., sometimes this means that
shared libraries
that the component requires are unable to be
found/loaded).  Note that
Open MPI stopped checking at the first component that it
did not find.

Host:  stallo-2.local
Framework: osc
Component: pt2pt

--

--
It looks like MPI_INIT failed for some reason; your
parallel process is
likely to abort.  There are many reasons that a parallel
process can
fail during MPI_INIT; some of which are due to
configuration or environment
problems.  This failure appears to be an internal failure;
here's some
additional information (which may only be relevant to an
Open MPI
developer):

  ompi_osc_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
***and potentially your MPI job)
[stallo-2.local:38415] Local abort before MPI_INIT
completed successfully; not able to aggregate error
messages, and not able to guarantee that all other
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
***and potentially your MPI job)
[stallo-2.local:38418] Local abort before MPI_INIT
completed successfully; not able to aggregate error
messages, and not able to guarantee that all other
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
***and potentially your MPI job)
[stallo-2.local:38416] Local abort before MPI_INIT
completed successfully; not able to aggregate error
messages, and not able to guarantee that all other
processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
***and potentially your MPI job)
[stallo-2.local:38417] 

Re: [OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Gilles Gouaillardet
Thanks Peter,

this is just a workaround for a bug we just identified, the fix will come
soon

Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind  wrote:

> That worked!
>
> i.e with the changed you proposed the code gives the right result.
>
> That was efficient work, thank you Gilles :)
>
> Best wishes,
> Peter
>
>
> --
>
> Thanks Peter,
>
> that is quite unexpected ...
>
> let s try an other workaround, can you replace
>
> integer:: comm_group
>
> with
>
> integer:: comm_group, comm_tmp
>
>
> and
>
> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)
>
>
> with
>
>
> call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_tmp, ierr)
>
> if (irank < (num_procs/2)) then
>
> comm_group = comm_tmp
>
> else
>
> call MPI_Comm_dup(comm_tmp, comm_group, ierr)
>
> endif
>
>
>
> if it works, I will make a fix tomorrow when I can access my workstation.
> if not, can you please run
> mpirun --mca osc_base_verbose 100 ...
> and post the output ?
>
> I will then try to reproduce the issue and investigate it
>
> Cheers,
>
> Gilles
>
> On Tuesday, February 2, 2016, Peter Wind  wrote:
>
>> Thanks Gilles,
>>
>> I get the following output (I guess it is not what you wanted?).
>>
>> Peter
>>
>>
>> $ mpirun --mca osc pt2pt -np 4 a.out
>> --
>> A requested component was not found, or was unable to be opened.  This
>> means that this component is either not installed or is unable to be
>> used on your system (e.g., sometimes this means that shared libraries
>> that the component requires are unable to be found/loaded).  Note that
>> Open MPI stopped checking at the first component that it did not find.
>>
>> Host:  stallo-2.local
>> Framework: osc
>> Component: pt2pt
>> --
>> --
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_osc_base_open() failed
>>   --> Returned "Not found" (-13) instead of "Success" (0)
>> --
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [stallo-2.local:38415] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [stallo-2.local:38418] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [stallo-2.local:38416] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [stallo-2.local:38417] Local abort before MPI_INIT completed
>> successfully; not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>> ---
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> ---
>> --
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>>
>>   Process name: [[52507,1],0]
>>   Exit code:1
>> --
>> [stallo-2.local:38410] 3 more processes have sent help message
>> help-mca-base.txt / find-available:not-valid
>> [stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> [stallo-2.local:38410] 2 more processes have sent help message
>> help-mpi-runtime 

Re: [OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Peter Wind
Thanks Gilles, 

I get the following output (I guess it is not what you wanted?). 

Peter 

$ mpirun --mca osc pt2pt -np 4 a.out 
-- 
A requested component was not found, or was unable to be opened. This 
means that this component is either not installed or is unable to be 
used on your system (e.g., sometimes this means that shared libraries 
that the component requires are unable to be found/loaded). Note that 
Open MPI stopped checking at the first component that it did not find. 

Host: stallo-2.local 
Framework: osc 
Component: pt2pt 
-- 
-- 
It looks like MPI_INIT failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during MPI_INIT; some of which are due to configuration or environment 
problems. This failure appears to be an internal failure; here's some 
additional information (which may only be relevant to an Open MPI 
developer): 

ompi_osc_base_open() failed 
--> Returned "Not found" (-13) instead of "Success" (0) 
-- 
*** An error occurred in MPI_Init 
*** on a NULL communicator 
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 
*** and potentially your MPI job) 
[stallo-2.local:38415] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed! 
*** An error occurred in MPI_Init 
*** on a NULL communicator 
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 
*** and potentially your MPI job) 
[stallo-2.local:38418] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed! 
*** An error occurred in MPI_Init 
*** on a NULL communicator 
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 
*** and potentially your MPI job) 
[stallo-2.local:38416] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed! 
*** An error occurred in MPI_Init 
*** on a NULL communicator 
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 
*** and potentially your MPI job) 
[stallo-2.local:38417] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all other 
processes were killed! 
--- 
Primary job terminated normally, but 1 process returned 
a non-zero exit code.. Per user-direction, the job has been aborted. 
--- 
-- 
mpirun detected that one or more processes exited with non-zero status, thus 
causing 
the job to be terminated. The first process to do so was: 

Process name: [[52507,1],0] 
Exit code: 1 
-- 
[stallo-2.local:38410] 3 more processes have sent help message 
help-mca-base.txt / find-available:not-valid 
[stallo-2.local:38410] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages 
[stallo-2.local:38410] 2 more processes have sent help message help-mpi-runtime 
/ mpi_init:startup:internal-failure 

- Original Message -

> Peter,

> at first glance, your test program looks correct.

> can you please try to run
> mpirun --mca osc pt2pt -np 4 ...

> I might have identified a bug with the sm osc component.

> Cheers,

> Gilles

> On Tuesday, February 2, 2016, Peter Wind < peter.w...@met.no > wrote:

> > Enclosed is a short (< 100 lines) fortran code example that uses shared
> > memory.
> 
> > It seems to me it behaves wrongly if openmpi is used.
> 
> > Compiled with SGI/mpt , it gives the right result.
> 

> > To fail, the code must be run on a single node.
> 
> > It creates two groups of 2 processes each. Within each group memory is
> > shared.
> 
> > The error is that the two groups get the same memory allocated, but they
> > should not.
> 

> > Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel
> > 14.0
> 
> > all fail.
> 

> > The call:
> 
> > call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
> > comm_group,
> > cp1, win, ierr)
> 

> > Should allocate memory only within the group. But when the other group
> > allocates memory, the pointers from the two groups point to the same
> > address
> > in memory.
> 

> > Could you please confirm that this is the wrong behaviour?
> 

> > Best regards,
> 
> > Peter Wind
> 
> ___
> users mailing list
> 

Re: [OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Gilles Gouaillardet
Peter,

at first glance, your test program looks correct.

can you please try to run
mpirun --mca osc pt2pt -np 4 ...

I  might have identified a bug with the sm osc component.

Cheers,

Gilles

On Tuesday, February 2, 2016, Peter Wind  wrote:

> Enclosed is a short (< 100 lines) fortran code example that uses shared
> memory.
> It seems to me it behaves wrongly if openmpi is used.
> Compiled with SGI/mpt , it gives the right result.
>
> To fail, the code must be run on a single node.
> It creates two groups of 2 processes each. Within each group memory is
> shared.
> The error is that the two groups get the same memory allocated, but they
> should not.
>
> Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel
> 14.0
> all fail.
>
> The call:
>call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL,
> comm_group, cp1, win, ierr)
>
> Should allocate memory only within the group. But when the other group
> allocates memory, the pointers from the two groups point to the same
> address in memory.
>
> Could you please confirm that this is the wrong behaviour?
>
> Best regards,
> Peter Wind


[OMPI users] shared memory under fortran, bug?

2016-02-02 Thread Peter Wind
Enclosed is a short (< 100 lines) fortran code example that uses shared memory.
It seems to me it behaves wrongly if openmpi is used. 
Compiled with SGI/mpt , it gives the right result.

To fail, the code must be run on a single node.
It creates two groups of 2 processes each. Within each group memory is shared.
The error is that the two groups get the same memory allocated, but they should 
not.

Tested with openmpi 1.8.4, 1.8.5, 1.10.2 and gfortran, intel 13.0, intel 14.0
all fail.

The call:
   call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, comm_group, 
cp1, win, ierr)

Should allocate memory only within the group. But when the other group 
allocates memory, the pointers from the two groups point to the same address in 
memory.

Could you please confirm that this is the wrong behaviour? 

Best regards,
Peter Windprogram shmem_mpi

   !
   ! in this example two groups are created, within each group memory is shared.
   ! Still the other group get allocated the same adress space, which it shouldn't.
   !
   ! Run with 4 processes, mpirun -np 4 a.out


   use mpi

   use, intrinsic :: iso_c_binding, only : c_ptr, c_f_pointer

   implicit none
!   include 'mpif.h'

   integer, parameter :: nsize = 100
   integer, pointer   :: array(:)
   integer:: num_procs
   integer:: ierr
   integer:: irank, irank_group
   integer:: win
   integer:: comm = MPI_COMM_WORLD
   integer:: disp_unit
   type(c_ptr):: cp1
   type(c_ptr):: cp2
   integer:: comm_group

   integer(MPI_ADDRESS_KIND) :: win_size
   integer(MPI_ADDRESS_KIND) :: segment_size

   call MPI_Init(ierr)
   call MPI_Comm_size(comm, num_procs, ierr)
   call MPI_Comm_rank(comm, irank, ierr)

   disp_unit = sizeof(1)
   call MPI_COMM_SPLIT(comm, irank*2/num_procs, irank, comm_group, ierr)
   call MPI_Comm_rank(comm_group, irank_group, ierr)
!   print *, 'irank=', irank, ' group rank=', irank_group

   if (irank_group == 0) then
  win_size = nsize*disp_unit
   else
  win_size = 0
   endif

   call MPI_Win_allocate_shared(win_size, disp_unit, MPI_INFO_NULL, comm_group, cp1, win, ierr)
   call MPI_Win_fence(0, win, ierr)

   call MPI_Win_shared_query(win, 0, segment_size, disp_unit, cp2, ierr)

   call MPI_Win_fence(0, win, ierr)
   CALL MPI_BARRIER(comm, ierr)! allocations finished
!   print *, 'irank=', irank, ' size ', segment_size

   call c_f_pointer(cp2, array, [nsize])

   array(1)=0;array(2)=0
   CALL MPI_BARRIER(comm, ierr)!
77 format(4(A,I3))
   if(irank

Re: [OMPI users] shared memory performance

2015-07-24 Thread Gilles Gouaillardet
Cristian,

one more thing...
two containers on the same host cannot communicate with the sm btl.
you might want to mpirun with --mca btl tcp,self on one physical machine
without container,
in order to asses the performance degradation due to using tcp btl and
without any containerization effect.

Cheers,

Gilles

On Friday, July 24, 2015, Harald Servat  wrote:

> Dear Cristian,
>
>   according to your configuration:
>
>   a) - 8 Linux containers on the same machine configured with 2 cores
>   b) - 8 physical machines
>   c) - 1 physical machine
>
>   a) and c) have exactly the same physical computational resources despite
> the fact that a) is being virtualized and the processors are oversubscribed
> (2 virtual cores per physical core). I'm not an expert on virtualization,
> but since a) and c) are exactly the same hardware (in terms of core and
> memory hierarchy), and your application seems memory bounded, I'd expect to
> see what you tabulated and b) is faster because you have 8 times the memory
> cache.
>
> Regards
> PS Your name in the mail is different, maybe you'd like to fix it.
>
> On 22/07/15 10:42, Crisitan RUIZ wrote:
>
>> Thank you for your answer Harald
>>
>> Actually I was already using TAU before but it seems that it is not
>> maintained any more and there are problems when instrumenting
>> applications with the version 1.8.5 of OpenMPI.
>>
>> I was using the OpenMPI 1.6.5 before to test the execution of HPC
>> application on Linux containers. I tested the performance of NAS
>> benchmarks in three different configurations:
>>
>> - 8 Linux containers on the same machine configured with 2 cores
>> - 8 physical machines
>> - 1 physical machine
>>
>> So, as I already described it, each machine counts with 2 processors (8
>> cores each). I instrumented and run all NAS benchmark in these three
>> configurations and I got the results that I attached in this email.
>> In the table "native" corresponds to using 8 physical machines and "SM"
>> corresponds to 1 physical machine using the sm module, time is given in
>> miliseconds.
>>
>> What surprise me in the results is that using containers in the worse
>> case have equal communication time than just using plain mpi processes.
>> Even though the containers use virtual network interfaces to communicate
>> between them. Probably this behaviour is due to process binding and
>> locality, I wanted to redo the test using OpenMPI version 1.8.5 but
>> unfourtunately I couldn't sucessfully instrument the applications. I was
>> looking for another MPI profiler but I couldn't find any. HPCToolkit
>> looks like it is not maintain anymore, Vampir does not maintain any more
>> the tool that instrument the application.  I will probably give a try to
>> Paraver.
>>
>>
>>
>> Best regards,
>>
>> Cristian Ruiz
>>
>>
>>
>> On 07/22/2015 09:44 AM, Harald Servat wrote:
>>
>>>
>>> Cristian,
>>>
>>>   you might observe super-speedup heres because in 8 nodes you have 8
>>> times the cache you have in only 1 node. You can also validate that by
>>> checking for cache miss activity using the tools that I mentioned in
>>> my other email.
>>>
>>> Best regards.
>>>
>>> On 22/07/15 09:42, Crisitan RUIZ wrote:
>>>
 Sorry, I've just discovered that I was using the wrong command to run on
 8 machines. I have to get rid of the "-np 8"

 So, I corrected the command and I used:

 mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
 --allow-run-as-root mg.C.8

 And got these results

 8 cores:

 Mop/s total = 19368.43


 8 machines

 Mop/s total = 96094.35


 Why is the performance of mult-node almost 4 times better than
 multi-core? Is this normal behavior?

 On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:

>
>  Hello,
>
> I'm running OpenMPI 1.8.5 on a cluster with the following
> characteristics:
>
> Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
> cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.
>
> When I run the NAS benchmarks using 8 cores in the same machine, I'm
> getting almost the same performance as using 8 machines running a mpi
> process per machine.
>
> I used the following commands:
>
> for running multi-node:
>
> mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
> --allow-run-as-root mg.C.8
>
> for running in with 8 cores:
>
> mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8
>
>
> I got the following results:
>
> 8 cores:
>
>  Mop/s total = 19368.43
>
> 8 machines:
>
> Mop/s total = 19326.60
>
>
> The results are similar for other benchmarks. Is this behavior normal?
> I was expecting to see a better behavior using 8 cores given that they
> use 

Re: [OMPI users] shared memory performance

2015-07-24 Thread Harald Servat

Dear Cristian,

  according to your configuration:

  a) - 8 Linux containers on the same machine configured with 2 cores
  b) - 8 physical machines
  c) - 1 physical machine

  a) and c) have exactly the same physical computational resources 
despite the fact that a) is being virtualized and the processors are 
oversubscribed (2 virtual cores per physical core). I'm not an expert on 
virtualization, but since a) and c) are exactly the same hardware (in 
terms of core and memory hierarchy), and your application seems memory 
bounded, I'd expect to see what you tabulated and b) is faster because 
you have 8 times the memory cache.


Regards
PS Your name in the mail is different, maybe you'd like to fix it.

On 22/07/15 10:42, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not
maintained any more and there are problems when instrumenting
applications with the version 1.8.5 of OpenMPI.

I was using the OpenMPI 1.6.5 before to test the execution of HPC
application on Linux containers. I tested the performance of NAS
benchmarks in three different configurations:

- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8
cores each). I instrumented and run all NAS benchmark in these three
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM"
corresponds to 1 physical machine using the sm module, time is given in
miliseconds.

What surprise me in the results is that using containers in the worse
case have equal communication time than just using plain mpi processes.
Even though the containers use virtual network interfaces to communicate
between them. Probably this behaviour is due to process binding and
locality, I wanted to redo the test using OpenMPI version 1.8.5 but
unfourtunately I couldn't sucessfully instrument the applications. I was
looking for another MPI profiler but I couldn't find any. HPCToolkit
looks like it is not maintain anymore, Vampir does not maintain any more
the tool that instrument the application.  I will probably give a try to
Paraver.



Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8
times the cache you have in only 1 node. You can also validate that by
checking for cache miss activity using the tools that I mentioned in
my other email.

Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are 

Re: [OMPI users] shared memory performance

2015-07-22 Thread David Shrader

Hello Cristian,

TAU is still under active development and the developers respond fairly 
fast to emails. The latest version, 2.24.1, came out just two months 
ago. Check out https://www.cs.uoregon.edu/research/tau/home.php for more 
information.


If you are running in to issues getting the latest version of TAU to 
work with Open MPI 1.8.x, check out the "Contact" page from the above 
URL. The developers are very helpful.


Thanks,
David

On 07/22/2015 02:42 AM, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not 
maintained any more and there are problems when instrumenting 
applications with the version 1.8.5 of OpenMPI.


I was using the OpenMPI 1.6.5 before to test the execution of HPC 
application on Linux containers. I tested the performance of NAS 
benchmarks in three different configurations:


- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors 
(8 cores each). I instrumented and run all NAS benchmark in these 
three configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and 
"SM" corresponds to 1 physical machine using the sm module, time is 
given in miliseconds.


What surprise me in the results is that using containers in the worse 
case have equal communication time than just using plain mpi 
processes. Even though the containers use virtual network interfaces 
to communicate between them. Probably this behaviour is due to process 
binding and locality, I wanted to redo the test using OpenMPI version 
1.8.5 but unfourtunately I couldn't sucessfully instrument the 
applications. I was looking for another MPI profiler but I couldn't 
find any. HPCToolkit looks like it is not maintain anymore, Vampir 
does not maintain any more the tool that instrument the application.  
I will probably give a try to Paraver.




Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that 
by checking for cache miss activity using the tools that I mentioned 
in my other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:
Sorry, I've just discovered that I was using the wrong command to 
run on

8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.


Re: [OMPI users] shared memory performance

2015-07-22 Thread Gus Correa

Hi Christian, list

I haven't been following the shared memory details of OMPI lately,
but my recollection from some time ago is that in the 1.8 series the
default (and recommended) shared memory transport btl switched from
"sm" to "vader", which is the latest greatest.

In this case, I guess the mpirun options would be:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,vader,tcp


I am not even sure if with "vader" the "self" btl is needed,
as it was the case with "sm".

An OMPI developer could jump into this conversation and clarify.
Thank you.

I hope this helps,
Gus Correa


On 07/22/2015 04:42 AM, Crisitan RUIZ wrote:

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not
maintained any more and there are problems when instrumenting
applications with the version 1.8.5 of OpenMPI.

I was using the OpenMPI 1.6.5 before to test the execution of HPC
application on Linux containers. I tested the performance of NAS
benchmarks in three different configurations:

- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8
cores each). I instrumented and run all NAS benchmark in these three
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM"
corresponds to 1 physical machine using the sm module, time is given in
miliseconds.

What surprise me in the results is that using containers in the worse
case have equal communication time than just using plain mpi processes.
Even though the containers use virtual network interfaces to communicate
between them. Probably this behaviour is due to process binding and
locality, I wanted to redo the test using OpenMPI version 1.8.5 but
unfourtunately I couldn't sucessfully instrument the applications. I was
looking for another MPI profiler but I couldn't find any. HPCToolkit
looks like it is not maintain anymore, Vampir does not maintain any more
the tool that instrument the application.  I will probably give a try to
Paraver.



Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8
times the cache you have in only 1 node. You can also validate that by
checking for cache miss activity using the tools that I mentioned in
my other email.

Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this 

Re: [OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ

Thank you for your answer Harald

Actually I was already using TAU before but it seems that it is not 
maintained any more and there are problems when instrumenting 
applications with the version 1.8.5 of OpenMPI.


I was using the OpenMPI 1.6.5 before to test the execution of HPC 
application on Linux containers. I tested the performance of NAS 
benchmarks in three different configurations:


- 8 Linux containers on the same machine configured with 2 cores
- 8 physical machines
- 1 physical machine

So, as I already described it, each machine counts with 2 processors (8 
cores each). I instrumented and run all NAS benchmark in these three 
configurations and I got the results that I attached in this email.
In the table "native" corresponds to using 8 physical machines and "SM" 
corresponds to 1 physical machine using the sm module, time is given in 
miliseconds.


What surprise me in the results is that using containers in the worse 
case have equal communication time than just using plain mpi processes. 
Even though the containers use virtual network interfaces to communicate 
between them. Probably this behaviour is due to process binding and 
locality, I wanted to redo the test using OpenMPI version 1.8.5 but 
unfourtunately I couldn't sucessfully instrument the applications. I was 
looking for another MPI profiler but I couldn't find any. HPCToolkit 
looks like it is not maintain anymore, Vampir does not maintain any more 
the tool that instrument the application.  I will probably give a try to 
Paraver.




Best regards,

Cristian Ruiz



On 07/22/2015 09:44 AM, Harald Servat wrote:


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that by 
checking for cache miss activity using the tools that I mentioned in 
my other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27298.php




Re: [OMPI users] shared memory performance

2015-07-22 Thread Gilles Gouaillardet

Christian,

one explanation could be that the benchmark is memory bound, so running 
on more sockets means higher memory bandwidth means better performance.


an other explanation is that on one node, you are running one openmp 
thread per mpi task, and on 8 nodes, you are running 8 openmp threads 
per tasks


Cheers,

Gilles

On 7/22/2015 4:42 PM, Crisitan RUIZ wrote:
Sorry, I've just discovered that I was using the wrong command to run 
on 8 machines. I have to get rid of the "-np 8"


So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than 
multi-core? Is this normal behavior?


On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following 
characteristics:


Each node is equipped with two Intel Xeon E5-2630v3 processors (with 
8 cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior 
normal? I was expecting to see a better behavior using 8 cores given 
that they use directly the memory to communicate.


Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27297.php







Re: [OMPI users] shared memory performance

2015-07-22 Thread Harald Servat


Cristian,

  you might observe super-speedup heres because in 8 nodes you have 8 
times the cache you have in only 1 node. You can also validate that by 
checking for cache miss activity using the tools that I mentioned in my 
other email.


Best regards.

On 22/07/15 09:42, Crisitan RUIZ wrote:

Sorry, I've just discovered that I was using the wrong command to run on
8 machines. I have to get rid of the "-np 8"

So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than
multi-core? Is this normal behavior?

On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following
characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal?
I was expecting to see a better behavior using 8 cores given that they
use directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27297.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: [OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ
Sorry, I've just discovered that I was using the wrong command to run on 
8 machines. I have to get rid of the "-np 8"


So, I corrected the command and I used:

mpirun --machinefile machine_mpi_bug.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


And got these results

8 cores:

Mop/s total = 19368.43


8 machines

Mop/s total = 96094.35


Why is the performance of mult-node almost 4 times better than 
multi-core? Is this normal behavior?


On 07/22/2015 09:19 AM, Crisitan RUIZ wrote:


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following 
characteristics:


Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? 
I was expecting to see a better behavior using 8 cores given that they 
use directly the memory to communicate.


Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27295.php




Re: [OMPI users] shared memory performance

2015-07-22 Thread Harald Servat


Dear Cristian,

  as you probably know C class is one of the large classes for the NAS 
benchmarks. That is likely to mean that the application is taking much 
more time to do the actual computation rather than communication. This 
could explain why you see this little difference between the two 
execution: because communication is so little compared with the rest.


  In order to validate this reasoning, you can profile or trace the 
application using some of the performance tools available out there 
(Vampir, Paraver, TAU, HPCToolkit, Scalasca,...) and see which how is 
the communication compared to the computation.


Regards.

On 22/07/15 09:19, Crisitan RUIZ wrote:


  Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.

When I run the NAS benchmarks using 8 cores in the same machine, I'm
getting almost the same performance as using 8 machines running a mpi
process per machine.

I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp
--allow-run-as-root mg.C.8

for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

  Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? I
was expecting to see a better behavior using 8 cores given that they use
directly the memory to communicate.

Thank you,

Cristian
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/07/27295.php



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


[OMPI users] shared memory performance

2015-07-22 Thread Crisitan RUIZ


 Hello,

I'm running OpenMPI 1.8.5 on a cluster with the following characteristics:

Each node is equipped with two Intel Xeon E5-2630v3 processors (with 8 
cores each), 128 GB of RAM and a 10 Gigabit Ethernet adapter.


When I run the NAS benchmarks using 8 cores in the same machine, I'm 
getting almost the same performance as using 8 machines running a mpi 
process per machine.


I used the following commands:

for running multi-node:

mpirun -np 8 --machinefile machine_file.txt --mca btl self,sm,tcp 
--allow-run-as-root mg.C.8


for running in with 8 cores:

mpirun -np 8  --mca btl self,sm,tcp --allow-run-as-root mg.C.8


I got the following results:

8 cores:

 Mop/s total = 19368.43

8 machines:

Mop/s total = 19326.60


The results are similar for other benchmarks. Is this behavior normal? I 
was expecting to see a better behavior using 8 cores given that they use 
directly the memory to communicate.


Thank you,

Cristian


Re: [OMPI users] Shared Memory - Eager VS Rendezvous

2012-05-23 Thread Simone Pellegrini

On 05/23/2012 03:05 PM, Jeff Squyres wrote:

On May 23, 2012, at 6:05 AM, Simone Pellegrini wrote:


If process A sends a message to process B and the eager protocol is used then I 
assume that the message is written into a shared memory area and picked up by 
the receiver when the receive operation is posted.

Open MPI has a few different shared memory protocols.

For short messages, they always follow what you mention above: CICO.

For large messages, we either use a pipelined CICO (as you surmised below) or 
use direct memory mapping if you have the Linux knem kernel module installed.  
More below.


When the rendezvous is utilized however the message still need to end up in the 
shared memory area somehow. I don't think any RDMA-like transfer exists for 
shared memory communications.

Just to clarify: RDMA = Remote Direct Memory Access, and the "remote" usually 
refers to a different physical address space (e.g., a different server).

In Open MPI's case, knem can use a direct memory copy between two processes.


Therefore you need to buffer this message somehow, however I   assume that 
you don't buffer the whole thing but use some type of pipelined protocol so 
that you reduce the size of the buffer you need to keep in the shared memory.

Correct.  For large messages, when using CICO, we copy the first fragment and 
the necessary meta data to the shmem block.  When the receiver ACKs the first 
fragment, we pipeline CICO the rest of the large message through the shmem 
block.  With the sender and receiver (more or less) simultaneously writing and 
reading to the circular shmem block, we probably won't fill it up -- meaning 
that the sender hypothetically won't need to block.

I'm skipping a bunch of details, but that's the general idea.


Is it completely wrong? It would be nice if someone could point me somewhere I 
can find more details about this. In the OpenMPI tuning page there are several 
details regarding the protocol utilized for IB but very little for SM.

Good point.  I'll see if we can get some more info up there.


I think I found the answer to my question on Jeff Squyres  blog:
http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/

However now I have a new question, how do I know if my machine uses the 
copyin/copyout mechanism or the direct mapping?

You need the Linux knem module.  See the OMPI README and do a text search for 
"knem".


Thanks a lot for the clarification.
however I still have hard time to explain the following phenomena.

I have a very simple code performing a ping/pong between 2 processes 
which are allocated on the same computing node. Each process is bound to 
a different CPU via affinity settings.


I perform this operation with 3 cache scenarios
1) Cache is completely invalidate before the send/recv (both at the 
sender and receiver side)
2) Cache is preloaded before the send/recv operation and it's in 
"exclusive" state.
3) Cache is preloaded before the send/recv operation but this time cache 
lines are in a "modified" state


Now scenario 2 has a speedup over scenario 1 as expected. However 
scenario 3 is much slower then 1. I observed this for both knem and xpmem.
I assume someone is forcing the modified cache lines to be written into 
the memory before the copy is performed. Probably because the segment is 
assigned to a volatile pointer so somehow the stuff in cache has to be 
written into main memory.


Instead when the OpenMPI CICO protocol is used 2 and 3 have the exact 
same speedup over 1. Therefore I assume that in this way no-one forces 
the write-through of dirty cache lines. I am questioning my self on this 
issue since yesterday and it's quite difficult to understand without 
knowing all the internal details.


Is this an expected behaviour also for you or you find it surprising? :)

cheers, Simone







Re: [OMPI users] Shared Memory - Eager VS Rendezvous

2012-05-23 Thread Gutierrez, Samuel K

On May 23, 2012, at 7:05 AM, Jeff Squyres wrote:

> On May 23, 2012, at 6:05 AM, Simone Pellegrini wrote:
> 
>>> If process A sends a message to process B and the eager protocol is used 
>>> then I assume that the message is written into a shared memory area and 
>>> picked up by the receiver when the receive operation is posted. 
> 
> Open MPI has a few different shared memory protocols.
> 
> For short messages, they always follow what you mention above: CICO.
> 
> For large messages, we either use a pipelined CICO (as you surmised below) or 
> use direct memory mapping if you have the Linux knem kernel module installed. 
>  More below.
> 
>>> When the rendezvous is utilized however the message still need to end up in 
>>> the shared memory area somehow. I don't think any RDMA-like transfer exists 
>>> for shared memory communications. 
> 
> Just to clarify: RDMA = Remote Direct Memory Access, and the "remote" usually 
> refers to a different physical address space (e.g., a different server).  
> 
> In Open MPI's case, knem can use a direct memory copy between two processes.

In addition, the vader BTL (XPMEM BTL) also provides similar functionality - 
provided the XPMEM kernel module and user library are available on the system.

Based on my limited experience with the two, I've noticed that knem is 
well-suited for Intel architectures, while XPMEM is best for AMD architectures.

Samuel K. Gutierrez
Los Alamos National Laboratory

> 
>>> Therefore you need to buffer this message somehow, however I   assume 
>>> that you don't buffer the whole thing but use some type of pipelined 
>>> protocol so that you reduce the size of the buffer you need to keep in the 
>>> shared memory. 
> 
> Correct.  For large messages, when using CICO, we copy the first fragment and 
> the necessary meta data to the shmem block.  When the receiver ACKs the first 
> fragment, we pipeline CICO the rest of the large message through the shmem 
> block.  With the sender and receiver (more or less) simultaneously writing 
> and reading to the circular shmem block, we probably won't fill it up -- 
> meaning that the sender hypothetically won't need to block.
> 
> I'm skipping a bunch of details, but that's the general idea.
> 
>>> Is it completely wrong? It would be nice if someone could point me 
>>> somewhere I can find more details about this. In the OpenMPI tuning page 
>>> there are several details regarding the protocol utilized for IB but very 
>>> little for SM. 
> 
> Good point.  I'll see if we can get some more info up there.
> 
>> I think I found the answer to my question on Jeff Squyres  blog:
>> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/
>> 
>> However now I have a new question, how do I know if my machine uses the 
>> copyin/copyout mechanism or the direct mapping? 
> 
> You need the Linux knem module.  See the OMPI README and do a text search for 
> "knem".
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory - Eager VS Rendezvous

2012-05-23 Thread Jeff Squyres
On May 23, 2012, at 6:05 AM, Simone Pellegrini wrote:

>> If process A sends a message to process B and the eager protocol is used 
>> then I assume that the message is written into a shared memory area and 
>> picked up by the receiver when the receive operation is posted. 

Open MPI has a few different shared memory protocols.

For short messages, they always follow what you mention above: CICO.

For large messages, we either use a pipelined CICO (as you surmised below) or 
use direct memory mapping if you have the Linux knem kernel module installed.  
More below.

>> When the rendezvous is utilized however the message still need to end up in 
>> the shared memory area somehow. I don't think any RDMA-like transfer exists 
>> for shared memory communications. 

Just to clarify: RDMA = Remote Direct Memory Access, and the "remote" usually 
refers to a different physical address space (e.g., a different server).  

In Open MPI's case, knem can use a direct memory copy between two processes.  

>> Therefore you need to buffer this message somehow, however I   assume 
>> that you don't buffer the whole thing but use some type of pipelined 
>> protocol so that you reduce the size of the buffer you need to keep in the 
>> shared memory. 

Correct.  For large messages, when using CICO, we copy the first fragment and 
the necessary meta data to the shmem block.  When the receiver ACKs the first 
fragment, we pipeline CICO the rest of the large message through the shmem 
block.  With the sender and receiver (more or less) simultaneously writing and 
reading to the circular shmem block, we probably won't fill it up -- meaning 
that the sender hypothetically won't need to block.

I'm skipping a bunch of details, but that's the general idea.

>> Is it completely wrong? It would be nice if someone could point me somewhere 
>> I can find more details about this. In the OpenMPI tuning page there are 
>> several details regarding the protocol utilized for IB but very little for 
>> SM. 

Good point.  I'll see if we can get some more info up there.

> I think I found the answer to my question on Jeff Squyres  blog:
> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/
> 
> However now I have a new question, how do I know if my machine uses the 
> copyin/copyout mechanism or the direct mapping? 

You need the Linux knem module.  See the OMPI README and do a text search for 
"knem".

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Shared Memory - Eager VS Rendezvous

2012-05-23 Thread Simone Pellegrini

I think I found the answer to my question on Jeff Squyres  blog:
http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/

However now I have a new question, how do I know if my machine uses the 
copyin/copyout mechanism or the direct mapping?


Assuming that I am running on OpenMPI 1.5.x installed on top of a linux 
Kernel 2.6.32?


cheers, Simone

On 05/22/2012 05:29 PM, Simone Pellegrini wrote:

Dear all,
I would like to have a confirmation on the assumptions I have on how 
OpenMPI implements the rendezvous protocol for shared memory.


If process A sends a message to process B and the eager protocol is 
used then I assume that the message is written into a shared memory 
area and picked up by the receiver when the receive operation is posted.


When the rendezvous is utilized however the message still need to end 
up in the shared memory area somehow. I don't think any RDMA-like 
transfer exists for shared memory communications. Therefore you need 
to buffer this message somehow, however I assume that you don't buffer 
the whole thing but use some type of pipelined protocol so that you 
reduce the size of the buffer you need to keep in the shared memory.


Is it completely wrong? It would be nice if someone could point me 
somewhere I can find more details about this. In the OpenMPI tuning 
page there are several details regarding the protocol utilized for IB 
but very little for SM.


thanks in advance,
Simone P.






[OMPI users] Shared Memory - Eager VS Rendezvous

2012-05-22 Thread Simone Pellegrini

Dear all,
I would like to have a confirmation on the assumptions I have on how 
OpenMPI implements the rendezvous protocol for shared memory.


If process A sends a message to process B and the eager protocol is used 
then I assume that the message is written into a shared memory area and 
picked up by the receiver when the receive operation is posted.


When the rendezvous is utilized however the message still need to end up 
in the shared memory area somehow. I don't think any RDMA-like transfer 
exists for shared memory communications. Therefore you need to buffer 
this message somehow, however I assume that you don't buffer the whole 
thing but use some type of pipelined protocol so that you reduce the 
size of the buffer you need to keep in the shared memory.


Is it completely wrong? It would be nice if someone could point me 
somewhere I can find more details about this. In the OpenMPI tuning page 
there are several details regarding the protocol utilized for IB but 
very little for SM.


thanks in advance,
Simone P.




Re: [OMPI users] Shared Memory Collectives

2011-12-19 Thread Nilesh Mahajan
Hi,
I am trying to implement the following collectives in MPI
sharedmemory, Alltoall, Broadcast, Reduce with zero copy
optimizations.So for Reduce, my compiler allocates all the send
buffers in sharedmemory (mmap anonymous), and allocates only one
receive buffer againin shared memory. Then all the processes reduce to
the root buffer ina data parallel manner. Now it looks like openmpi is
doing somethingsimilar except that they must copy from/to the
send/receive buffers.So my implementation of reduce should perform
better for large buffersizes. But that is not the case. Anybody knows
why? Any pointers arewelcome.
Also the openmpi reduce performance has large variations. I run
reducewith different array sizes with np = 8 50 times and for a single
arraysize, I find that there is a significant number of outliers.
Didanybody face similar problems?
Thanks,Nilesh.


Re: [OMPI users] Shared memory optimizations in OMPI

2011-11-22 Thread Jeff Squyres
All the shared memory code is in the "sm" BTL (byte transfer layer) component: 
ompi/mca/btl/sm.  All the TCP MPI code is in the "tcp" BTL component: 
ompi/mca/btl/tcp.  Think of "ob1" as the MPI engine that is the bottom of 
MPI_SEND, MPI_RECV, and friends.  It takes a message to be sent, determines how 
many BTLs can be used to send it, fragments the message as appropriate, and 
chooses from one of several different protocols to actually send the message.  
It then hands off the fragments of that message to the underlying BTLs to 
effect the actual transfer.

So ob1 has no knowledge of shared memory of TCP directly -- it relies on the 
BTLs to say "yes, I can reach peer X at priority Y".  For example, both TCP and 
sm will respond that they can reach a peer that is on the same server node.  
But sm will have a higher priority, so it will get all the fragments destined 
for that process, and TCP will be ignored.

Remember: all of this is setup during MPI_INIT.  During MPI_SEND (and friends), 
ob1 (and r2, the BML (BTL multiplexing layer)) is just looking up arrays of 
pointers and invoking function pointers that were previously setup.

So you can look into ob1, but be aware that it's all done by function pointers 
and indirection.  

Your best bet might well be to look at individual function names in the TCP and 
SM BTLs and set breakpoints on those.  The file ompi/mca/btl/btl.h provides 
descriptions of what each of the publicly exported functions from each of the 
BTL components do; this will give you information about what the functions in 
the TCP and SM BTLs are doing.


On Nov 22, 2011, at 10:12 AM, Shamik Ganguly wrote:

> Thanks a lot Jeff.
> 
> PIN is a dynamic binary instrumentation tool from Intel. It runs on top of 
> the Binary in the MPI node. When its given function calls to instrument, it 
> will insert trappings before/after that funtion call in the binary of the 
> program you are instrumenting and you can insert your own functions. 
> 
> I am doing some memory address profiling on benchmarks running on MPI and I 
> was using PIN to get the Load/Store addresses. Furthermore I needed to know 
> which LD/ST were coming from actual socket communication and which are coming 
> from shared memory optimizations. So i needed to know which functions/where 
> exactly were they taking that decision so that I can instrument the 
> appropriate MPI library function call (the actual low level function, not the 
> API like MPI_Sends/Recvs) in PIN. Hence I guess I am actually zooming down to 
> a 1000 ft view :)
> 
> Any suggestion is welcome. I will go into the ob1 directory and try to hunt 
> around to see how exactly its being done.
> 
> Regards,
> Shamik
> 
> On Tue, Nov 22, 2011 at 10:08 AM, Shamik Ganguly  
> wrote:
> Thanks a lot Jeff.
> 
> PIN is a dynamic binary instrumentation tool from Intel. It runs on top of 
> the Binary in the MPI node. When its given function calls to instrument, it 
> will insert trappings before/after that funtion call in the binary of the 
> program you are instrumenting and you can insert your own functions. 
> 
> I am doing some memory address profiling on benchmarks running on MPI and I 
> was using PIN to get the Load/Store addresses. Furthermore I needed to know 
> which LD/ST were coming from actual socket communication and which are coming 
> from shared memory optimizations. So i needed to know which functions/where 
> exactly were they taking that decision so that I can instrument the 
> appropriate MPI library function call (the actual low level function, not the 
> API like MPI_Sends/Recvs) in PIN. Hence I guess I am actually zooming down to 
> a 1000 ft view :)
> 
> I will go into the ob1 directory and try to hunt around to see how exactly 
> its being done.
> 
> Regards,
> Shamik
> 
> 
> On Tue, Nov 22, 2011 at 9:04 AM, Jeff Squyres  wrote:
> On Nov 22, 2011, at 1:09 AM, Shamik Ganguly wrote:
> 
> > I want to trace when the MPI  library prevents an MPI_Send from going to 
> > the socket and makes it access shared memory because the target node is on 
> > the same chip (CMP). I want to use PIN to trace this. Can you please give 
> > me some pointers about which functions are taking this decision so that I 
> > can instrument the appropriate library calls in PIN?
> 
> What's PIN?
> 
> The decision is made in the ob1 PML plugin.  Way back during MPI_INIT, each 
> MPI process creates lists of BTLs to use to contact each MPI process peer.
> 
> When a process is on the same *node* (e.g., a single server) -- not just the 
> same processor socket/chip -- the shared memory BTL is given preference to 
> all other BTLs by use of a priority mechanism.  Hence, the "sm" BTL is put at 
> the front of the BML lists (BML = BTL multiplexing layer -- it's essentially 
> just list management for BTLs).
> 
> Later, when MPI_SEND comes through, it uses the already-setup BML lists to 
> determine which BTL(s) to use to send a 

Re: [OMPI users] Shared memory optimizations in OMPI

2011-11-22 Thread Shamik Ganguly
Thanks a lot Jeff.

PIN is a dynamic binary instrumentation tool from Intel. It runs on top of
the Binary in the MPI node. When its given function calls to instrument, it
will insert trappings before/after that funtion call in the binary of the
program you are instrumenting and you can insert your own functions.

I am doing some memory address profiling on benchmarks running on MPI and I
was using PIN to get the Load/Store addresses. Furthermore I needed to know
which LD/ST were coming from actual socket communication and which are
coming from shared memory optimizations. So i needed to know which
functions/where exactly were they taking that decision so that I can
instrument the appropriate MPI library function call (the actual low level
function, not the API like MPI_Sends/Recvs) in PIN. Hence I guess I am
actually zooming down to a 1000 ft view :)

Any suggestion is welcome. I will go into the ob1 directory and try to hunt
around to see how exactly its being done.

Regards,
Shamik

On Tue, Nov 22, 2011 at 10:08 AM, Shamik Ganguly
wrote:

> Thanks a lot Jeff.
>
> PIN is a dynamic binary instrumentation tool from Intel. It runs on top of
> the Binary in the MPI node. When its given function calls to instrument, it
> will insert trappings before/after that funtion call in the binary of the
> program you are instrumenting and you can insert your own functions.
>
> I am doing some memory address profiling on benchmarks running on MPI and
> I was using PIN to get the Load/Store addresses. Furthermore I needed to
> know which LD/ST were coming from actual socket communication and which are
> coming from shared memory optimizations. So i needed to know which
> functions/where exactly were they taking that decision so that I can
> instrument the appropriate MPI library function call (the actual low level
> function, not the API like MPI_Sends/Recvs) in PIN. Hence I guess I am
> actually zooming down to a 1000 ft view :)
>
> I will go into the ob1 directory and try to hunt around to see how exactly
> its being done.
>
> Regards,
> Shamik
>
>
> On Tue, Nov 22, 2011 at 9:04 AM, Jeff Squyres  wrote:
>
>> On Nov 22, 2011, at 1:09 AM, Shamik Ganguly wrote:
>>
>> > I want to trace when the MPI  library prevents an MPI_Send from going
>> to the socket and makes it access shared memory because the target node is
>> on the same chip (CMP). I want to use PIN to trace this. Can you please
>> give me some pointers about which functions are taking this decision so
>> that I can instrument the appropriate library calls in PIN?
>>
>> What's PIN?
>>
>> The decision is made in the ob1 PML plugin.  Way back during MPI_INIT,
>> each MPI process creates lists of BTLs to use to contact each MPI process
>> peer.
>>
>> When a process is on the same *node* (e.g., a single server) -- not just
>> the same processor socket/chip -- the shared memory BTL is given preference
>> to all other BTLs by use of a priority mechanism.  Hence, the "sm" BTL is
>> put at the front of the BML lists (BML = BTL multiplexing layer -- it's
>> essentially just list management for BTLs).
>>
>> Later, when MPI_SEND comes through, it uses the already-setup BML lists
>> to determine which BTL(s) to use to send a message.
>>
>> That's the 50,000 foot view.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Shamik Ganguly
>
>


-- 
Shamik Ganguly
2nd year, MS (CSE-Hardware), University of Michigan, Ann Arbor
B.Tech.(E), IITKGP (2008)


Re: [OMPI users] Shared memory optimizations in OMPI

2011-11-22 Thread Jeff Squyres
On Nov 22, 2011, at 1:09 AM, Shamik Ganguly wrote:

> I want to trace when the MPI  library prevents an MPI_Send from going to the 
> socket and makes it access shared memory because the target node is on the 
> same chip (CMP). I want to use PIN to trace this. Can you please give me some 
> pointers about which functions are taking this decision so that I can 
> instrument the appropriate library calls in PIN?

What's PIN?

The decision is made in the ob1 PML plugin.  Way back during MPI_INIT, each MPI 
process creates lists of BTLs to use to contact each MPI process peer.  

When a process is on the same *node* (e.g., a single server) -- not just the 
same processor socket/chip -- the shared memory BTL is given preference to all 
other BTLs by use of a priority mechanism.  Hence, the "sm" BTL is put at the 
front of the BML lists (BML = BTL multiplexing layer -- it's essentially just 
list management for BTLs).  

Later, when MPI_SEND comes through, it uses the already-setup BML lists to 
determine which BTL(s) to use to send a message.

That's the 50,000 foot view.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Shared memory optimizations in OMPI

2011-11-22 Thread Shamik Ganguly
Hi,

I want to trace when the MPI  library prevents an MPI_Send from going to
the socket and makes it access shared memory because the target node is on
the same chip (CMP). I want to use PIN to trace this. Can you please give
me some pointers about which functions are taking this decision so that I
can instrument the appropriate library calls in PIN?

Regards,
-- 
Shamik Ganguly
2nd year, MS (CSE-Hardware), University of Michigan, Ann Arbor
B.Tech.(E), IITKGP (2008)

P.S.- I am resending this since I had posted this from a different email id
from what I subscribed with, I hope there is no duplication.


Re: [OMPI users] Shared-memory problems

2011-11-03 Thread Ralph Castain
I'm afraid this isn't correct. You definitely don't want the session directory 
in /dev/shm as this will almost always cause problems.

We look thru a progression of envars to find where to put the session directory:

1. the MCA param orte_tmpdir_base

2. the envar OMPI_PREFIX_ENV

3. the envar TMPDIR

4. the envar TEMP

5. the envar TMP

Check all those to see if one is set to /dev/shm. If so, you have a problem to 
resolve. For performance reasons, you probably don't want the session directory 
sitting on a network mounted location. What you need is a good local directory 
- anything you have permission to write in will work fine. Just set one of the 
above to point to it.


On Nov 3, 2011, at 10:04 AM, Durga Choudhury wrote:

> Since /tmp is mounted across a network and /dev/shm is (always) local,
> /dev/shm seems to be the right place for shared memory transactions.
> If you create temporary files using mktemp is it being created in
> /dev/shm or /tmp?
> 
> 
> On Thu, Nov 3, 2011 at 11:50 AM, Bogdan Costescu  wrote:
>> On Thu, Nov 3, 2011 at 15:54, Blosch, Edwin L  
>> wrote:
>>> -/dev/shm is 12 GB and has 755 permissions
>>> ...
>>> % ls –l output:
>>> 
>>> drwxr-xr-x  2 root root 40 Oct 28 09:14 shm
>> 
>> This is your problem: it should be something like drwxrwxrwt. It might
>> depend on the distribution, f.e. the following show this to be a bug:
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=533897
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=317329
>> 
>> and surely you can find some more on the subject with your favorite
>> search engine. Another source could be a paranoid sysadmin who has
>> changed the default (most likely correct) setting the distribution
>> came with - not only OpenMPI but any application using shmem would be
>> affected..
>> 
>> Cheers,
>> Bogdan
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared-memory problems

2011-11-03 Thread Durga Choudhury
Since /tmp is mounted across a network and /dev/shm is (always) local,
/dev/shm seems to be the right place for shared memory transactions.
If you create temporary files using mktemp is it being created in
/dev/shm or /tmp?


On Thu, Nov 3, 2011 at 11:50 AM, Bogdan Costescu  wrote:
> On Thu, Nov 3, 2011 at 15:54, Blosch, Edwin L  wrote:
>> -    /dev/shm is 12 GB and has 755 permissions
>> ...
>> % ls –l output:
>>
>> drwxr-xr-x  2 root root 40 Oct 28 09:14 shm
>
> This is your problem: it should be something like drwxrwxrwt. It might
> depend on the distribution, f.e. the following show this to be a bug:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=533897
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=317329
>
> and surely you can find some more on the subject with your favorite
> search engine. Another source could be a paranoid sysadmin who has
> changed the default (most likely correct) setting the distribution
> came with - not only OpenMPI but any application using shmem would be
> affected..
>
> Cheers,
> Bogdan
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Shared-memory problems

2011-11-03 Thread Bogdan Costescu
On Thu, Nov 3, 2011 at 15:54, Blosch, Edwin L  wrote:
> -    /dev/shm is 12 GB and has 755 permissions
> ...
> % ls –l output:
>
> drwxr-xr-x  2 root root 40 Oct 28 09:14 shm

This is your problem: it should be something like drwxrwxrwt. It might
depend on the distribution, f.e. the following show this to be a bug:

https://bugzilla.redhat.com/show_bug.cgi?id=533897
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=317329

and surely you can find some more on the subject with your favorite
search engine. Another source could be a paranoid sysadmin who has
changed the default (most likely correct) setting the distribution
came with - not only OpenMPI but any application using shmem would be
affected..

Cheers,
Bogdan



Re: [OMPI users] Shared-memory problems

2011-11-03 Thread Ralph Castain

On Nov 3, 2011, at 8:54 AM, Blosch, Edwin L wrote:

> Can anyone guess what the problem is here?  I was under the impression that 
> OpenMPI (1.4.4) would look for /tmp and would create its shared-memory 
> backing file there, i.e. if you don’t set orte_tmpdir_base to anything.

That is correct

>  
> Well, there IS a /tmp and yet it appears that OpenMPI has chosen to use 
> /dev/shm.  Why?

Looks like a bug to me - it shouldn't be doing that. Will have to take a look - 
first I've heard of that behavior.


>  
> And, next question, why doesn’t it work?  Here are the oddities of this 
> cluster:
> -the cluster is ‘diskless’
> -/tmp is an NFS mount
> -/dev/shm is 12 GB and has 755 permissions
>  
> FilesystemSize  Used Avail Use% Mounted on
> tmpfs  12G  164K   12G   1% /dev/shm
>  
> % ls –l output:
> drwxr-xr-x  2 root root 40 Oct 28 09:14 shm
>  
>  
> The error message below suggests that OpenMPI (1.4.4) has somehow 
> auto-magically decided to use /dev/shm and is failing to be able to us e it, 
> for some reason.
>  
> Thanks for whatever help you can offer,
>  
> Ed
>  
>  
> e8315:02942] opal_os_dirpath_create: Error: Unable to create the 
> sub-directory (/dev/shm/openmpi-sessions-estenfte@e8315_0) of 
> (/dev/shm/openmpi-sessions-estenfte@e8315_0/8474/0/1), mkdir failed [1]
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c 
> at line 106
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c 
> at line 399
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_orted.c at line 206
> --
> It looks like orte_init failed for some reason; your parallel process is
> 
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>  
>   orte_session_dir failed
>   --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at 
> line 136
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c 
> at line 132
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>  
>   orte_ess_set_name failed
>   --> Returned value Error (-1) instead of ORTE_SUCCESS
> --
> [e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file orted/orted_main.c 
> at line 325
>  
>  
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Shared-memory problems

2011-11-03 Thread Blosch, Edwin L
Can anyone guess what the problem is here?  I was under the impression that 
OpenMPI (1.4.4) would look for /tmp and would create its shared-memory backing 
file there, i.e. if you don't set orte_tmpdir_base to anything.

Well, there IS a /tmp and yet it appears that OpenMPI has chosen to use 
/dev/shm.  Why?

And, next question, why doesn't it work?  Here are the oddities of this cluster:

-the cluster is 'diskless'

-/tmp is an NFS mount

-/dev/shm is 12 GB and has 755 permissions

FilesystemSize  Used Avail Use% Mounted on
tmpfs  12G  164K   12G   1% /dev/shm

% ls -l output:
drwxr-xr-x  2 root root 40 Oct 28 09:14 shm


The error message below suggests that OpenMPI (1.4.4) has somehow 
auto-magically decided to use /dev/shm and is failing to be able to use it, for 
some reason.

Thanks for whatever help you can offer,

Ed


e8315:02942] opal_os_dirpath_create: Error: Unable to create the sub-directory 
(/dev/shm/openmpi-sessions-estenfte@e8315_0) of 
(/dev/shm/openmpi-sessions-estenfte@e8315_0/8474/0/1), mkdir failed [1]
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c at 
line 106
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c at 
line 399
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 206
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at 
line 136
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 132
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file orted/orted_main.c at 
line 325





Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Tim Prince

On 3/30/2011 10:08 AM, Eugene Loh wrote:

Michele Marena wrote:

I've launched my app with mpiP both when two processes are on
different node and when two processes are on the same node.

The process 0 is the manager (gathers the results only), processes 1
and 2 are workers (compute).

This is the case processes 1 and 2 are on different nodes (runs in 162s).
@--- MPI Time (seconds)
---
Task AppTime MPITime MPI%
0 162 162 99.99
1 162 30.2 18.66
2 162 14.7 9.04
* 486 207 42.56

The case when processes 1 and 2 are on the same node (runs in 260s).
@--- MPI Time (seconds)
---
Task AppTime MPITime MPI%
0 260 260 99.99
1 260 39.7 15.29
2 260 26.4 10.17
* 779 326 41.82

I think there's a contention problem on the memory bus.

Right. Process 0 spends all its time in MPI, presumably waiting on
workers. The workers spend about the same amount of time on MPI
regardless of whether they're placed together or not. The big difference
is that the workers are much slower in non-MPI tasks when they're
located on the same node. The issue has little to do with MPI. The
workers are hogging local resources and work faster when placed on
different nodes.

However, the message size is 4096 * sizeof(double). Maybe I are wrong
in this point. Is the message size too huge for shared memory?

No. That's not very large at all.


Not even large enough to expect the non-temporal storage issue about 
cache eviction to arise.



--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Eugene Loh




Michele Marena wrote:
I've launched my app with mpiP both when two processes are
on different node and when two processes are on the same node.
  
  
  The process 0 is the manager (gathers the results only),
processes 1 and 2 are  workers (compute).
  
  
  This is the case processes 1 and 2 are on different nodes (runs
in 162s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        162        162    99.99
     1        162       30.2    18.66
     2        162       14.7     9.04
     *        486        207    42.56
  
  
  The case when processes 1 and 2 are on the same node (runs in
260s).
  
  @--- MPI Time (seconds)
---
  Task    AppTime    MPITime     MPI%
     0        260        260    99.99
     1        260       39.7    15.29
     2        260       26.4    10.17
     *        779        326    41.82
  
  
  
  I think there's a contention problem on the memory bus.
  

Right.  Process 0 spends all its time in MPI, presumably waiting on
workers.  The workers spend about the same amount of time on MPI
regardless of whether they're placed together or not.  The big
difference is that the workers are much slower in non-MPI tasks when
they're located on the same node.  The issue has little to do with
MPI.  The workers are hogging local resources and work faster when
placed on different nodes.

  
  However, the message size is 4096 * sizeof(double). Maybe I are
wrong in this point. Is the message size too huge for shared memory?
  

No.  That's not very large at all.

  
  
  

>>> On Mar 27, 2011, at 10:33 AM, Ralph
Castain wrote:
>>>
>>> >http://www.open-mpi.org/faq/?category=perftools

  
  
  





Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Michele Marena
Hi Jeff,
I thank you for your help,
I've launched my app with mpiP both when two processes are on different node
and when two processes are on the same node.

The process 0 is the manager (gathers the results only), processes 1 and 2
are  workers (compute).

This is the case processes 1 and 2 are on different nodes (runs in 162s).

---
@--- MPI Time (seconds) ---
---
TaskAppTimeMPITime MPI%
   016216299.99
   1162   30.218.66
   2162   14.7 9.04
   *48620742.56
---
@--- Aggregate Time (top twenty, descending, milliseconds) 
---
Call Site   TimeApp%MPI% COV
Barrier 5   1.28e+05   26.24   61.640.00
Barrier142.3e+044.74   11.130.00
Barrier 6   2.29e+044.72   11.080.00
Barrier17   1.77e+043.658.581.41
Recv3   1.15e+042.375.580.00
Recv   30   2.26e+030.471.090.00
Recv   123080.060.150.00
Recv   262860.060.140.00
Recv   282520.050.120.00
Recv   312460.050.120.00
Isend  351110.020.050.00
Isend  341080.020.050.00
Isend  181070.020.050.00
Isend  191050.020.050.00
Isend   9   57.70.010.030.05
Isend  32   39.70.010.020.00
Barrier25   38.70.010.021.39
Isend  11   38.60.010.020.00
Recv   16   34.10.010.020.00
Recv   27   26.50.010.010.00
---
@--- Aggregate Sent Message Size (top twenty, descending, bytes) --
---
Call Site  Count  Total   Avrg  Sent%
Isend   9   4096   1.34e+08   3.28e+04  58.73
Isend  34   1200   1.85e+07   1.54e+04   8.07
Isend  35   1200   1.85e+07   1.54e+04   8.07
Isend  18   1200   1.85e+07   1.54e+04   8.07
Isend  19   1200   1.85e+07   1.54e+04   8.07
Isend  32240   3.69e+06   1.54e+04   1.61
Isend  11240   3.69e+06   1.54e+04   1.61
Isend  15180   3.44e+06   1.91e+04   1.51
Isend  33 61  2e+06   3.28e+04   0.87
Isend  10 61  2e+06   3.28e+04   0.87
Isend  29 61  2e+06   3.28e+04   0.87
Isend  22 61  2e+06   3.28e+04   0.87
Isend  37180   1.72e+06   9.57e+03   0.75
Isend  24  2 16  8   0.00
Isend  20  2 16  8   0.00
Send8  1  4  4   0.00
Send1  1  4  4   0.00

The case when processes 1 and 2 are on the same node (runs in 260s).
---
@--- MPI Time (seconds) ---
---
TaskAppTimeMPITime MPI%
   026026099.99
   1260   39.715.29
   2260   26.410.17
   *77932641.82

---
@--- Aggregate Time (top twenty, descending, milliseconds) 
---
Call Site   TimeApp%MPI% COV
Barrier 5   2.23e+05   28.64   68.500.00
Barrier 6   2.49e+043.207.660.00
Barrier14   2.31e+042.967.090.00
Recv   28   1.35e+041.734.140.00
Recv   16   1.32e+041.704.060.00
Barrier17   1.22e+041.563.741.41
Recv3   1.16e+041.483.550.00
Recv   26   1.67e+030.210.510.00
Recv   309400.120.290.00
Recv   

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-30 Thread Jeff Squyres
How many messages are you sending, and how large are they?  I.e., if your 
message passing is tiny, then the network transport may not be the bottleneck 
here.


On Mar 28, 2011, at 9:41 AM, Michele Marena wrote:

> I run ompi_info --param btl sm and this is the output
> 
>  MCA btl: parameter "btl_base_debug" (current value: "0")
>   If btl_base_debug is 1 standard debug is output, if 
> > 1 verbose debug is output
>  MCA btl: parameter "btl" (current value: )
>   Default selection set of components for the btl 
> framework ( means "use all components that can be found")
>  MCA btl: parameter "btl_base_verbose" (current value: "0")
>   Verbosity level for the btl framework (0 = no 
> verbosity)
>  MCA btl: parameter "btl_sm_free_list_num" (current value: 
> "8")
>  MCA btl: parameter "btl_sm_free_list_max" (current value: 
> "-1")
>  MCA btl: parameter "btl_sm_free_list_inc" (current value: 
> "64")
>  MCA btl: parameter "btl_sm_exclusivity" (current value: 
> "65535")
>  MCA btl: parameter "btl_sm_latency" (current value: "100")
>  MCA btl: parameter "btl_sm_max_procs" (current value: "-1")
>  MCA btl: parameter "btl_sm_sm_extra_procs" (current value: 
> "2")
>  MCA btl: parameter "btl_sm_mpool" (current value: "sm")
>  MCA btl: parameter "btl_sm_eager_limit" (current value: 
> "4096")
>  MCA btl: parameter "btl_sm_max_frag_size" (current value: 
> "32768")
>  MCA btl: parameter "btl_sm_size_of_cb_queue" (current value: 
> "128")
>  MCA btl: parameter "btl_sm_cb_lazy_free_freq" (current 
> value: "120")
>  MCA btl: parameter "btl_sm_priority" (current value: "0")
>  MCA btl: parameter "btl_base_warn_component_unused" (current 
> value: "1")
>   This parameter is used to turn on warning messages 
> when certain NICs are not used
> 
> 
> 2011/3/28 Ralph Castain 
> The fact that this exactly matches the time you measured with shared memory 
> is suspicious. My guess is that you aren't actually using shared memory at 
> all.
> 
> Does your "ompi_info" output show shared memory as being available? Jeff or 
> others may be able to give you some params that would let you check to see if 
> sm is actually being used between those procs.
> 
> 
> 
> On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:
> 
>> What happens with 2 processes on the same node with tcp?
>> With --mca btl self,tcp my app runs in 23s.
>> 
>> 2011/3/28 Jeff Squyres (jsquyres) 
>> Ah, I didn't catch before that there were more variables than just tcp vs. 
>> shmem. 
>> 
>> What happens with 2 processes on the same node with tcp?
>> 
>> Eg, when both procs are on the same node, are you thrashing caches or memory?
>> 
>> Sent from my phone. No type good. 
>> 
>> On Mar 28, 2011, at 6:27 AM, "Michele Marena"  
>> wrote:
>> 
>>> However, I thank you Tim, Ralh and Jeff.
>>> My sequential application runs in 24s (wall clock time).
>>> My parallel application runs in 13s with two processes on different nodes.
>>> With shared memory, when two processes are on the same node, my app runs in 
>>> 23s.
>>> I'm not understand why.
>>> 
>>> 2011/3/28 Jeff Squyres 
>>> If your program runs faster across 3 processes, 2 of which are local to 
>>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
>>> something is very, very strange.
>>> 
>>> Tim cites all kinds of things that can cause slowdowns, but it's still 
>>> very, very odd that simply enabling using the shared memory communications 
>>> channel in Open MPI *slows your overall application down*.
>>> 
>>> How much does your application slow down in wall clock time?  Seconds?  
>>> Minutes?  Hours?  (anything less than 1 second is in the noise)
>>> 
>>> 
>>> 
>>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>> 
>>> >
>>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>>> >
>>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> >>> Hi,
>>> >>> My application performs good without shared memory utilization, but with
>>> >>> shared memory I get performance worst than without of it.
>>> >>> Do I make a mistake? Don't I pay attention to something?
>>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>>> >>> in the local filesystem.
>>> >>>
>>> >>
>>> >> I guess you mean shared memory message passing.   Among relevant 
>>> >> parameters may be the message size where your implementation switches 
>>> >> from cached copy to non-temporal (if you are on a platform where that 
>>> >> terminology is used).  If built with Intel compilers, for example, the 
>>> >> copy may be performed by intel_fast_memcpy, with a 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
I run ompi_info --param btl sm and this is the output

 MCA btl: parameter "btl_base_debug" (current value: "0")
  If btl_base_debug is 1 standard debug is output,
if > 1 verbose debug is output
 MCA btl: parameter "btl" (current value: )
  Default selection set of components for the btl
framework ( means "use all components that can be found")
 MCA btl: parameter "btl_base_verbose" (current value: "0")
  Verbosity level for the btl framework (0 = no
verbosity)
 MCA btl: parameter "btl_sm_free_list_num" (current value:
"8")
 MCA btl: parameter "btl_sm_free_list_max" (current value:
"-1")
 MCA btl: parameter "btl_sm_free_list_inc" (current value:
"64")
 MCA btl: parameter "btl_sm_exclusivity" (current value:
"65535")
 MCA btl: parameter "btl_sm_latency" (current value: "100")
 MCA btl: parameter "btl_sm_max_procs" (current value: "-1")
 MCA btl: parameter "btl_sm_sm_extra_procs" (current value:
"2")
 MCA btl: parameter "btl_sm_mpool" (current value: "sm")
 MCA btl: parameter "btl_sm_eager_limit" (current value:
"4096")
 MCA btl: parameter "btl_sm_max_frag_size" (current value:
"32768")
 MCA btl: parameter "btl_sm_size_of_cb_queue" (current
value: "128")
 MCA btl: parameter "btl_sm_cb_lazy_free_freq" (current
value: "120")
 MCA btl: parameter "btl_sm_priority" (current value: "0")
 MCA btl: parameter "btl_base_warn_component_unused"
(current value: "1")
  This parameter is used to turn on warning messages
when certain NICs are not used


2011/3/28 Ralph Castain 

> The fact that this exactly matches the time you measured with shared memory
> is suspicious. My guess is that you aren't actually using shared memory at
> all.
>
> Does your "ompi_info" output show shared memory as being available? Jeff or
> others may be able to give you some params that would let you check to see
> if sm is actually being used between those procs.
>
>
>
> On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:
>
> What happens with 2 processes on the same node with tcp?
> With --mca btl self,tcp my app runs in 23s.
>
> 2011/3/28 Jeff Squyres (jsquyres) 
>
>> Ah, I didn't catch before that there were more variables than just tcp vs.
>> shmem.
>>
>> What happens with 2 processes on the same node with tcp?
>>
>> Eg, when both procs are on the same node, are you thrashing caches or
>> memory?
>>
>> Sent from my phone. No type good.
>>
>> On Mar 28, 2011, at 6:27 AM, "Michele Marena" 
>> wrote:
>>
>> However, I thank you Tim, Ralh and Jeff.
>> My sequential application runs in 24s (wall clock time).
>> My parallel application runs in 13s with two processes on different nodes.
>> With shared memory, when two processes are on the same node, my app runs
>> in 23s.
>> I'm not understand why.
>>
>> 2011/3/28 Jeff Squyres < jsquy...@cisco.com>
>>
>>> If your program runs faster across 3 processes, 2 of which are local to
>>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>>> something is very, very strange.
>>>
>>> Tim cites all kinds of things that can cause slowdowns, but it's still
>>> very, very odd that simply enabling using the shared memory communications
>>> channel in Open MPI *slows your overall application down*.
>>>
>>> How much does your application slow down in wall clock time?  Seconds?
>>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>>
>>>
>>>
>>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>>
>>> >
>>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>>> >
>>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> >>> Hi,
>>> >>> My application performs good without shared memory utilization, but
>>> with
>>> >>> shared memory I get performance worst than without of it.
>>> >>> Do I make a mistake? Don't I pay attention to something?
>>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it
>>> is
>>> >>> in the local filesystem.
>>> >>>
>>> >>
>>> >> I guess you mean shared memory message passing.   Among relevant
>>> parameters may be the message size where your implementation switches from
>>> cached copy to non-temporal (if you are on a platform where that terminology
>>> is used).  If built with Intel compilers, for example, the copy may be
>>> performed by intel_fast_memcpy, with a default setting which uses
>>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>>> smallest L2 cache for that architecture.
>>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>>> itself invoke non-temporal, but there appear to be several useful articles
>>> not connected with 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Ralph Castain
The fact that this exactly matches the time you measured with shared memory is 
suspicious. My guess is that you aren't actually using shared memory at all.

Does your "ompi_info" output show shared memory as being available? Jeff or 
others may be able to give you some params that would let you check to see if 
sm is actually being used between those procs.



On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:

> What happens with 2 processes on the same node with tcp?
> With --mca btl self,tcp my app runs in 23s.
> 
> 2011/3/28 Jeff Squyres (jsquyres) 
> Ah, I didn't catch before that there were more variables than just tcp vs. 
> shmem. 
> 
> What happens with 2 processes on the same node with tcp?
> 
> Eg, when both procs are on the same node, are you thrashing caches or memory?
> 
> Sent from my phone. No type good. 
> 
> On Mar 28, 2011, at 6:27 AM, "Michele Marena"  wrote:
> 
>> However, I thank you Tim, Ralh and Jeff.
>> My sequential application runs in 24s (wall clock time).
>> My parallel application runs in 13s with two processes on different nodes.
>> With shared memory, when two processes are on the same node, my app runs in 
>> 23s.
>> I'm not understand why.
>> 
>> 2011/3/28 Jeff Squyres 
>> If your program runs faster across 3 processes, 2 of which are local to each 
>> other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
>> something is very, very strange.
>> 
>> Tim cites all kinds of things that can cause slowdowns, but it's still very, 
>> very odd that simply enabling using the shared memory communications channel 
>> in Open MPI *slows your overall application down*.
>> 
>> How much does your application slow down in wall clock time?  Seconds?  
>> Minutes?  Hours?  (anything less than 1 second is in the noise)
>> 
>> 
>> 
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>> 
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant 
>> >> parameters may be the message size where your implementation switches 
>> >> from cached copy to non-temporal (if you are on a platform where that 
>> >> terminology is used).  If built with Intel compilers, for example, the 
>> >> copy may be performed by intel_fast_memcpy, with a default setting which 
>> >> uses non-temporal when the message exceeds about some preset size, e.g. 
>> >> 50% of smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't 
>> >> itself invoke non-temporal, but there appear to be several useful 
>> >> articles not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile 
>> >> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether 
>> >> this is due to excessive eviction of data from cache; not a simple 
>> >> question, as most recent CPUs have 3 levels of cache, and your 
>> >> application may require more or less data which was in use prior to the 
>> >> message receipt, and may use immediately only a small piece of a large 
>> >> message.
>> >
>> > There were several papers published in earlier years about shared memory 
>> > performance in the 1.2 series.There were known problems with that 
>> > implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been 
>> > updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > http://www.open-mpi.org/faq/?category=sm
>> >
>> > http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
What happens with 2 processes on the same node with tcp?
With --mca btl self,tcp my app runs in 23s.

2011/3/28 Jeff Squyres (jsquyres) 

> Ah, I didn't catch before that there were more variables than just tcp vs.
> shmem.
>
> What happens with 2 processes on the same node with tcp?
>
> Eg, when both procs are on the same node, are you thrashing caches or
> memory?
>
> Sent from my phone. No type good.
>
> On Mar 28, 2011, at 6:27 AM, "Michele Marena" 
> wrote:
>
> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in
> 23s.
> I'm not understand why.
>
> 2011/3/28 Jeff Squyres < jsquy...@cisco.com>
>
>> If your program runs faster across 3 processes, 2 of which are local to
>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>> something is very, very strange.
>>
>> Tim cites all kinds of things that can cause slowdowns, but it's still
>> very, very odd that simply enabling using the shared memory communications
>> channel in Open MPI *slows your overall application down*.
>>
>> How much does your application slow down in wall clock time?  Seconds?
>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>
>>
>>
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but
>> with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant
>> parameters may be the message size where your implementation switches from
>> cached copy to non-temporal (if you are on a platform where that terminology
>> is used).  If built with Intel compilers, for example, the copy may be
>> performed by intel_fast_memcpy, with a default setting which uses
>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>> smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>> itself invoke non-temporal, but there appear to be several useful articles
>> not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile
>> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether
>> this is due to excessive eviction of data from cache; not a simple question,
>> as most recent CPUs have 3 levels of cache, and your application may require
>> more or less data which was in use prior to the message receipt, and may use
>> immediately only a small piece of a large message.
>> >
>> > There were several papers published in earlier years about shared memory
>> performance in the 1.2 series.There were known problems with that
>> implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been
>> updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > 
>> http://www.open-mpi.org/faq/?category=sm
>> >
>> > 
>> http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>>  jsquy...@cisco.com
>> For corporate legal information go to:
>>  
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>>  us...@open-mpi.org
>>  
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> 

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Tim Prince

On 3/28/2011 3:29 AM, Michele Marena wrote:

Each node have two processors (no dual-core).

which seems to imply that the 2 processors share memory space and a 
single memory buss, and the question is not about what I originally guessed.


--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Jeff Squyres (jsquyres)
Ah, I didn't catch before that there were more variables than just tcp vs. 
shmem. 

What happens with 2 processes on the same node with tcp?

Eg, when both procs are on the same node, are you thrashing caches or memory?

Sent from my phone. No type good. 

On Mar 28, 2011, at 6:27 AM, "Michele Marena"  wrote:

> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in 
> 23s.
> I'm not understand why.
> 
> 2011/3/28 Jeff Squyres 
> If your program runs faster across 3 processes, 2 of which are local to each 
> other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
> something is very, very strange.
> 
> Tim cites all kinds of things that can cause slowdowns, but it's still very, 
> very odd that simply enabling using the shared memory communications channel 
> in Open MPI *slows your overall application down*.
> 
> How much does your application slow down in wall clock time?  Seconds?  
> Minutes?  Hours?  (anything less than 1 second is in the noise)
> 
> 
> 
> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
> 
> >
> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> >
> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
> >>> Hi,
> >>> My application performs good without shared memory utilization, but with
> >>> shared memory I get performance worst than without of it.
> >>> Do I make a mistake? Don't I pay attention to something?
> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
> >>> in the local filesystem.
> >>>
> >>
> >> I guess you mean shared memory message passing.   Among relevant 
> >> parameters may be the message size where your implementation switches from 
> >> cached copy to non-temporal (if you are on a platform where that 
> >> terminology is used).  If built with Intel compilers, for example, the 
> >> copy may be performed by intel_fast_memcpy, with a default setting which 
> >> uses non-temporal when the message exceeds about some preset size, e.g. 
> >> 50% of smallest L2 cache for that architecture.
> >> A quick search for past posts seems to indicate that OpenMPI doesn't 
> >> itself invoke non-temporal, but there appear to be several useful articles 
> >> not connected with OpenMPI.
> >> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
> >> oprofile, Vtune, ) to pin this down.
> >> If shared message slows your application down, the question is whether 
> >> this is due to excessive eviction of data from cache; not a simple 
> >> question, as most recent CPUs have 3 levels of cache, and your application 
> >> may require more or less data which was in use prior to the message 
> >> receipt, and may use immediately only a small piece of a large message.
> >
> > There were several papers published in earlier years about shared memory 
> > performance in the 1.2 series.There were known problems with that 
> > implementation, which is why it was heavily revised for the 1.3/4 series.
> >
> > You might also look at the following links, though much of it has been 
> > updated to the 1.3/4 series as we don't really support 1.2 any more:
> >
> > http://www.open-mpi.org/faq/?category=sm
> >
> > http://www.open-mpi.org/faq/?category=perftools
> >
> >
> >>
> >> --
> >> Tim Prince
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
Each node have two processors (no dual-core).

2011/3/28 Michele Marena 

> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in
> 23s.
> I'm not understand why.
>
> 2011/3/28 Jeff Squyres 
>
>> If your program runs faster across 3 processes, 2 of which are local to
>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>> something is very, very strange.
>>
>> Tim cites all kinds of things that can cause slowdowns, but it's still
>> very, very odd that simply enabling using the shared memory communications
>> channel in Open MPI *slows your overall application down*.
>>
>> How much does your application slow down in wall clock time?  Seconds?
>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>
>>
>>
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but
>> with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant
>> parameters may be the message size where your implementation switches from
>> cached copy to non-temporal (if you are on a platform where that terminology
>> is used).  If built with Intel compilers, for example, the copy may be
>> performed by intel_fast_memcpy, with a default setting which uses
>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>> smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>> itself invoke non-temporal, but there appear to be several useful articles
>> not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile
>> (gprof, oprofile, Vtune, ) to pin this down.
>> >> If shared message slows your application down, the question is whether
>> this is due to excessive eviction of data from cache; not a simple question,
>> as most recent CPUs have 3 levels of cache, and your application may require
>> more or less data which was in use prior to the message receipt, and may use
>> immediately only a small piece of a large message.
>> >
>> > There were several papers published in earlier years about shared memory
>> performance in the 1.2 series.There were known problems with that
>> implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been
>> updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > http://www.open-mpi.org/faq/?category=sm
>> >
>> > http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Michele Marena
However, I thank you Tim, Ralh and Jeff.
My sequential application runs in 24s (wall clock time).
My parallel application runs in 13s with two processes on different nodes.
With shared memory, when two processes are on the same node, my app runs in
23s.
I'm not understand why.

2011/3/28 Jeff Squyres 

> If your program runs faster across 3 processes, 2 of which are local to
> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
> something is very, very strange.
>
> Tim cites all kinds of things that can cause slowdowns, but it's still
> very, very odd that simply enabling using the shared memory communications
> channel in Open MPI *slows your overall application down*.
>
> How much does your application slow down in wall clock time?  Seconds?
>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>
>
>
> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>
> >
> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> >
> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
> >>> Hi,
> >>> My application performs good without shared memory utilization, but
> with
> >>> shared memory I get performance worst than without of it.
> >>> Do I make a mistake? Don't I pay attention to something?
> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
> >>> in the local filesystem.
> >>>
> >>
> >> I guess you mean shared memory message passing.   Among relevant
> parameters may be the message size where your implementation switches from
> cached copy to non-temporal (if you are on a platform where that terminology
> is used).  If built with Intel compilers, for example, the copy may be
> performed by intel_fast_memcpy, with a default setting which uses
> non-temporal when the message exceeds about some preset size, e.g. 50% of
> smallest L2 cache for that architecture.
> >> A quick search for past posts seems to indicate that OpenMPI doesn't
> itself invoke non-temporal, but there appear to be several useful articles
> not connected with OpenMPI.
> >> In case guesses aren't sufficient, it's often necessary to profile
> (gprof, oprofile, Vtune, ) to pin this down.
> >> If shared message slows your application down, the question is whether
> this is due to excessive eviction of data from cache; not a simple question,
> as most recent CPUs have 3 levels of cache, and your application may require
> more or less data which was in use prior to the message receipt, and may use
> immediately only a small piece of a large message.
> >
> > There were several papers published in earlier years about shared memory
> performance in the 1.2 series.There were known problems with that
> implementation, which is why it was heavily revised for the 1.3/4 series.
> >
> > You might also look at the following links, though much of it has been
> updated to the 1.3/4 series as we don't really support 1.2 any more:
> >
> > http://www.open-mpi.org/faq/?category=sm
> >
> > http://www.open-mpi.org/faq/?category=perftools
> >
> >
> >>
> >> --
> >> Tim Prince
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Jeff Squyres
If your program runs faster across 3 processes, 2 of which are local to each 
other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then 
something is very, very strange.

Tim cites all kinds of things that can cause slowdowns, but it's still very, 
very odd that simply enabling using the shared memory communications channel in 
Open MPI *slows your overall application down*.

How much does your application slow down in wall clock time?  Seconds?  
Minutes?  Hours?  (anything less than 1 second is in the noise)



On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:

> 
> On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> 
>> On 3/27/2011 2:26 AM, Michele Marena wrote:
>>> Hi,
>>> My application performs good without shared memory utilization, but with
>>> shared memory I get performance worst than without of it.
>>> Do I make a mistake? Don't I pay attention to something?
>>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>>> in the local filesystem.
>>> 
>> 
>> I guess you mean shared memory message passing.   Among relevant parameters 
>> may be the message size where your implementation switches from cached copy 
>> to non-temporal (if you are on a platform where that terminology is used).  
>> If built with Intel compilers, for example, the copy may be performed by 
>> intel_fast_memcpy, with a default setting which uses non-temporal when the 
>> message exceeds about some preset size, e.g. 50% of smallest L2 cache for 
>> that architecture.
>> A quick search for past posts seems to indicate that OpenMPI doesn't itself 
>> invoke non-temporal, but there appear to be several useful articles not 
>> connected with OpenMPI.
>> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
>> oprofile, Vtune, ) to pin this down.
>> If shared message slows your application down, the question is whether this 
>> is due to excessive eviction of data from cache; not a simple question, as 
>> most recent CPUs have 3 levels of cache, and your application may require 
>> more or less data which was in use prior to the message receipt, and may use 
>> immediately only a small piece of a large message.
> 
> There were several papers published in earlier years about shared memory 
> performance in the 1.2 series.There were known problems with that 
> implementation, which is why it was heavily revised for the 1.3/4 series.
> 
> You might also look at the following links, though much of it has been 
> updated to the 1.3/4 series as we don't really support 1.2 any more:
> 
> http://www.open-mpi.org/faq/?category=sm
> 
> http://www.open-mpi.org/faq/?category=perftools
> 
> 
>> 
>> -- 
>> Tim Prince
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Ralph Castain

On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:

> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> Hi,
>> My application performs good without shared memory utilization, but with
>> shared memory I get performance worst than without of it.
>> Do I make a mistake? Don't I pay attention to something?
>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> in the local filesystem.
>> 
> 
> I guess you mean shared memory message passing.   Among relevant parameters 
> may be the message size where your implementation switches from cached copy 
> to non-temporal (if you are on a platform where that terminology is used).  
> If built with Intel compilers, for example, the copy may be performed by 
> intel_fast_memcpy, with a default setting which uses non-temporal when the 
> message exceeds about some preset size, e.g. 50% of smallest L2 cache for 
> that architecture.
> A quick search for past posts seems to indicate that OpenMPI doesn't itself 
> invoke non-temporal, but there appear to be several useful articles not 
> connected with OpenMPI.
> In case guesses aren't sufficient, it's often necessary to profile (gprof, 
> oprofile, Vtune, ) to pin this down.
> If shared message slows your application down, the question is whether this 
> is due to excessive eviction of data from cache; not a simple question, as 
> most recent CPUs have 3 levels of cache, and your application may require 
> more or less data which was in use prior to the message receipt, and may use 
> immediately only a small piece of a large message.

There were several papers published in earlier years about shared memory 
performance in the 1.2 series.There were known problems with that 
implementation, which is why it was heavily revised for the 1.3/4 series.

You might also look at the following links, though much of it has been updated 
to the 1.3/4 series as we don't really support 1.2 any more:

http://www.open-mpi.org/faq/?category=sm

http://www.open-mpi.org/faq/?category=perftools


> 
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Tim Prince

On 3/27/2011 2:26 AM, Michele Marena wrote:

Hi,
My application performs good without shared memory utilization, but with
shared memory I get performance worst than without of it.
Do I make a mistake? Don't I pay attention to something?
I know OpenMPI uses /tmp directory to allocate shared memory and it is
in the local filesystem.



I guess you mean shared memory message passing.   Among relevant 
parameters may be the message size where your implementation switches 
from cached copy to non-temporal (if you are on a platform where that 
terminology is used).  If built with Intel compilers, for example, the 
copy may be performed by intel_fast_memcpy, with a default setting which 
uses non-temporal when the message exceeds about some preset size, e.g. 
50% of smallest L2 cache for that architecture.
A quick search for past posts seems to indicate that OpenMPI doesn't 
itself invoke non-temporal, but there appear to be several useful 
articles not connected with OpenMPI.
In case guesses aren't sufficient, it's often necessary to profile 
(gprof, oprofile, Vtune, ) to pin this down.
If shared message slows your application down, the question is whether 
this is due to excessive eviction of data from cache; not a simple 
question, as most recent CPUs have 3 levels of cache, and your 
application may require more or less data which was in use prior to the 
message receipt, and may use immediately only a small piece of a large 
message.


--
Tim Prince


Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Michele Marena
This is my machinefile
node-1-16 slots=2
node-1-17 slots=2
node-1-18 slots=2
node-1-19 slots=2
node-1-20 slots=2
node-1-21 slots=2
node-1-22 slots=2
node-1-23 slots=2

Each cluster node has 2 processors. I launch my application with 3
processes, one on node-1-16 (manager) and two on node-1-17(workers). Two
processes on node-1-17 communicate each other.

2011/3/27 Michele Marena 

> Hi,
> My application performs good without shared memory utilization, but with
> shared memory I get performance worst than without of it.
> Do I make a mistake? Don't I pay attention to something?
> I know OpenMPI uses /tmp directory to allocate shared memory and it is in
> the local filesystem.
>
> I thank you.
> Michele.
>


[OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Michele Marena
Hi,
My application performs good without shared memory utilization, but with
shared memory I get performance worst than without of it.
Do I make a mistake? Don't I pay attention to something?
I know OpenMPI uses /tmp directory to allocate shared memory and it is in
the local filesystem.

I thank you.
Michele.


Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Michele Marena
Yes, It works fine without shared memory. I thank you Ralph. I will check
the code for logical mistakes, otherwise I choose the option suggested by
you.

2011/3/26 Ralph Castain 

> Your other option is to simply not use shared memory. TCP contains loopback
> support, so you can run with just
>
> -mca btl self,tcp
>
> and shared memory won't be used. It will run a tad slower that way, but at
> least your app will complete.
>
>
> On Mar 26, 2011, at 2:30 PM, Reuti wrote:
>
> > Am 26.03.2011 um 21:16 schrieb Michele Marena:
> >
> >> No, I can't. I'm not a administrator of the cluster and I'm not the
> owner.
> >
> > You can recompile your private version of Open MPI and install it in
> $HOME/local/openmpi-1.4.3 or alike and set paths accordingly during
> compilation of your source and execution.
> >
> > -- Reuti
> >
> >
> >> 2011/3/26 Ralph Castain 
> >> Can you update to a more recent version? That version is several years
> out-of-date - we don't even really support it any more.
> >>
> >>
> >> On Mar 26, 2011, at 1:04 PM, Michele Marena wrote:
> >>
> >>> Yes, the syntax is wrong in the email, but I write it correctly when I
> launch mpirun. When some communicating processes are on the same node the
> application don't terminate, otherwise the application terminate and its
> results are correct. My OpenMPI version is 1.2.7.
> >>>
> >>> 2011/3/26 Ralph Castain 
> >>>
> >>> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
> >>>
>  Hi,
>  I've a problem with shared memory. When my application runs using pure
> message passing (one process for node), it terminates and returns correct
> results. When 2 processes share a node and use shared memory for exchanges
> messages, my application don't terminate. At shell I write "mpirun -nolocal
> --mca self,sm,tcp ... (forces interface to eth0)... -np (number of
> processes) executable arguments".
> >>>
> >>> That's not right. If you want the apps to use shared memory, you don't
> have to do anything. However, if you do want to specify, then the correct
> syntax is
> >>>
> >>> mpirun -mca btl self,sm,tcp
> >>>
> >>>
> 
>  I hope you help me.
>  I thank you.
>  Michele ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain
Your other option is to simply not use shared memory. TCP contains loopback 
support, so you can run with just

-mca btl self,tcp

and shared memory won't be used. It will run a tad slower that way, but at 
least your app will complete.


On Mar 26, 2011, at 2:30 PM, Reuti wrote:

> Am 26.03.2011 um 21:16 schrieb Michele Marena:
> 
>> No, I can't. I'm not a administrator of the cluster and I'm not the owner.
> 
> You can recompile your private version of Open MPI and install it in 
> $HOME/local/openmpi-1.4.3 or alike and set paths accordingly during 
> compilation of your source and execution.
> 
> -- Reuti
> 
> 
>> 2011/3/26 Ralph Castain 
>> Can you update to a more recent version? That version is several years 
>> out-of-date - we don't even really support it any more.
>> 
>> 
>> On Mar 26, 2011, at 1:04 PM, Michele Marena wrote:
>> 
>>> Yes, the syntax is wrong in the email, but I write it correctly when I 
>>> launch mpirun. When some communicating processes are on the same node the 
>>> application don't terminate, otherwise the application terminate and its 
>>> results are correct. My OpenMPI version is 1.2.7.
>>> 
>>> 2011/3/26 Ralph Castain 
>>> 
>>> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
>>> 
 Hi,
 I've a problem with shared memory. When my application runs using pure 
 message passing (one process for node), it terminates and returns correct 
 results. When 2 processes share a node and use shared memory for exchanges 
 messages, my application don't terminate. At shell I write "mpirun 
 -nolocal --mca self,sm,tcp ... (forces interface to eth0)... -np (number 
 of processes) executable arguments".
>>> 
>>> That's not right. If you want the apps to use shared memory, you don't have 
>>> to do anything. However, if you do want to specify, then the correct syntax 
>>> is
>>> 
>>> mpirun -mca btl self,sm,tcp
>>> 
>>> 
 
 I hope you help me.
 I thank you.
 Michele ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Reuti
Am 26.03.2011 um 21:16 schrieb Michele Marena:

> No, I can't. I'm not a administrator of the cluster and I'm not the owner.

You can recompile your private version of Open MPI and install it in 
$HOME/local/openmpi-1.4.3 or alike and set paths accordingly during compilation 
of your source and execution.

-- Reuti


> 2011/3/26 Ralph Castain 
> Can you update to a more recent version? That version is several years 
> out-of-date - we don't even really support it any more.
> 
> 
> On Mar 26, 2011, at 1:04 PM, Michele Marena wrote:
> 
>> Yes, the syntax is wrong in the email, but I write it correctly when I 
>> launch mpirun. When some communicating processes are on the same node the 
>> application don't terminate, otherwise the application terminate and its 
>> results are correct. My OpenMPI version is 1.2.7.
>> 
>> 2011/3/26 Ralph Castain 
>> 
>> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
>> 
>> > Hi,
>> > I've a problem with shared memory. When my application runs using pure 
>> > message passing (one process for node), it terminates and returns correct 
>> > results. When 2 processes share a node and use shared memory for exchanges 
>> > messages, my application don't terminate. At shell I write "mpirun 
>> > -nolocal --mca self,sm,tcp ... (forces interface to eth0)... -np (number 
>> > of processes) executable arguments".
>> 
>> That's not right. If you want the apps to use shared memory, you don't have 
>> to do anything. However, if you do want to specify, then the correct syntax 
>> is
>> 
>> mpirun -mca btl self,sm,tcp
>> 
>> 
>> >
>> > I hope you help me.
>> > I thank you.
>> > Michele ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Michele Marena
No, I can't. I'm not a administrator of the cluster and I'm not the owner.

2011/3/26 Ralph Castain 

> Can you update to a more recent version? That version is several years
> out-of-date - we don't even really support it any more.
>
>
> On Mar 26, 2011, at 1:04 PM, Michele Marena wrote:
>
> Yes, the syntax is wrong in the email, but I write it correctly when I
> launch mpirun. When some communicating processes are on the same node the
> application don't terminate, otherwise the application terminate and its
> results are correct. My OpenMPI version is 1.2.7.
>
> 2011/3/26 Ralph Castain 
>
>>
>> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
>>
>> > Hi,
>> > I've a problem with shared memory. When my application runs using pure
>> message passing (one process for node), it terminates and returns correct
>> results. When 2 processes share a node and use shared memory for exchanges
>> messages, my application don't terminate. At shell I write "mpirun -nolocal
>> --mca self,sm,tcp ... (forces interface to eth0)... -np (number of
>> processes) executable arguments".
>>
>> That's not right. If you want the apps to use shared memory, you don't
>> have to do anything. However, if you do want to specify, then the correct
>> syntax is
>>
>> mpirun -mca btl self,sm,tcp
>>
>>
>> >
>> > I hope you help me.
>> > I thank you.
>> > Michele ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain
Can you update to a more recent version? That version is several years 
out-of-date - we don't even really support it any more.


On Mar 26, 2011, at 1:04 PM, Michele Marena wrote:

> Yes, the syntax is wrong in the email, but I write it correctly when I launch 
> mpirun. When some communicating processes are on the same node the 
> application don't terminate, otherwise the application terminate and its 
> results are correct. My OpenMPI version is 1.2.7.
> 
> 2011/3/26 Ralph Castain 
> 
> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
> 
> > Hi,
> > I've a problem with shared memory. When my application runs using pure 
> > message passing (one process for node), it terminates and returns correct 
> > results. When 2 processes share a node and use shared memory for exchanges 
> > messages, my application don't terminate. At shell I write "mpirun -nolocal 
> > --mca self,sm,tcp ... (forces interface to eth0)... -np (number of 
> > processes) executable arguments".
> 
> That's not right. If you want the apps to use shared memory, you don't have 
> to do anything. However, if you do want to specify, then the correct syntax is
> 
> mpirun -mca btl self,sm,tcp
> 
> 
> >
> > I hope you help me.
> > I thank you.
> > Michele ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Michele Marena
Yes, the syntax is wrong in the email, but I write it correctly when I
launch mpirun. When some communicating processes are on the same node the
application don't terminate, otherwise the application terminate and its
results are correct. My OpenMPI version is 1.2.7.

2011/3/26 Ralph Castain 

>
> On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:
>
> > Hi,
> > I've a problem with shared memory. When my application runs using pure
> message passing (one process for node), it terminates and returns correct
> results. When 2 processes share a node and use shared memory for exchanges
> messages, my application don't terminate. At shell I write "mpirun -nolocal
> --mca self,sm,tcp ... (forces interface to eth0)... -np (number of
> processes) executable arguments".
>
> That's not right. If you want the apps to use shared memory, you don't have
> to do anything. However, if you do want to specify, then the correct syntax
> is
>
> mpirun -mca btl self,sm,tcp
>
>
> >
> > I hope you help me.
> > I thank you.
> > Michele ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain

On Mar 26, 2011, at 11:34 AM, Michele Marena wrote:

> Hi,
> I've a problem with shared memory. When my application runs using pure 
> message passing (one process for node), it terminates and returns correct 
> results. When 2 processes share a node and use shared memory for exchanges 
> messages, my application don't terminate. At shell I write "mpirun -nolocal 
> --mca self,sm,tcp ... (forces interface to eth0)... -np (number of processes) 
> executable arguments".

That's not right. If you want the apps to use shared memory, you don't have to 
do anything. However, if you do want to specify, then the correct syntax is

mpirun -mca btl self,sm,tcp


> 
> I hope you help me.
> I thank you.
> Michele ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Shared Memory Problem.

2011-03-26 Thread Michele Marena
Hi,
I've a problem with shared memory. When my application runs using pure
message passing (one process for node), it terminates and returns correct
results. When 2 processes share a node and use shared memory for exchanges
messages, my application don't terminate. At shell I write "mpirun -nolocal
--mca self,sm,tcp ... (forces interface to eth0)... -np (number of
processes) executable arguments".

I hope you help me.
I thank you.
Michele


Re: [OMPI users] Shared memory

2010-10-06 Thread Richard Treumann
When you use MPI message passing in your application, the MPI library 
decides how to deliver the message. The "magic" is simply that when sender 
process and receiver process are on the same node (shared memory domain) 
the library uses shared memory to deliver the message from process to 
process.  When the sender process and receiver process are on different 
nodes, some interconnect method is used.

The MPI API does not have any explicit recognition of shared memory. If 
you are thinking of the MPI 1sided when you mention "MPI-2 shared memory", 
we should be clear that MPI 1-sided communication is only vaguely similar 
to shared memory and only provide access through MPI calls (MPI_Put, 
MPI_Get and MPI_Aaccumulate) and does not magically created shared memory 
that you can load/store.


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Andrei Fokau <andrei.fo...@neutron.kth.se>
To:
Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:
10/06/2010 10:12 AM
Subject:
Re: [OMPI users] Shared memory
Sent by:
users-boun...@open-mpi.org



Currently we run a code on a cluster with distributed memory, and this 
code needs a lot of memory. Part of the data stored in memory is the same 
for each process, but it is organized as one array - we can split it if 
necessary. So far no magic occurred for us. What do we need to do to make 
the magic working?


On Wed, Oct 6, 2010 at 12:43, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
wrote:
Open MPI will use shared memory to communicate between peers on the sane 
node - but that's hidden beneath the covers; it's not exposed via the MPI 
API. You just MPI-send and magic occurs and the receiver gets the 
message. 

On Oct 4, 2010, at 11:13 AM, "Andrei Fokau" <andrei.fo...@neutron.kth.se> 
wrote:
Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
How can I use them?

On Sat, Sep 25, 2010 at 23:19, Andrei Fokau <andrei.fo...@neutron.kth.se> 
wrote:
Here are some more details about our problem. We use a dozen of 
4-processor nodes with 8 GB memory on each node. The code we run needs 
about 3 GB per processor, so we can load only 2 processors out of 4. The 
vast majority of those 3 GB is the same for each processor and is 
accessed continuously during calculation. In my original question I wasn't 
very clear asking about a possibility to use shared memory with Open MPI - 
in our case we do not need to have a remote access to the data, and it 
would be sufficient to share memory within each node only.

Of course, the possibility to access the data remotely (via mmap) is 
attractive because it would allow to store much larger arrays (up to 10 
GB) at one remote place, meaning higher accuracy for our calculations. 
However, I believe that the access time would be too long for the data 
read so frequently, and therefore the performance would be lost.

I still hope that some of the subscribers to this mailing list have an 
experience of using Global Arrays. This library seems to be fine for our 
case, however I feel that there should be a simpler solution. Open MPI 
conforms with MPI-2 standard, and the later has a description of shared 
memory application. Do you see any other way for us to use shared memory 
(within node) apart of using Global Arrays?

On Fri, Sep 24, 2010 at 19:03, Durga Choudhury <dpcho...@gmail.com> wrote:
I think the 'middle ground' approach can be simplified even further if
the data file is in a shared device (e.g. NFS/Samba mount) that can be
mounted at the same location of the file system tree on all nodes. I
have never tried it, though and mmap()'ing a non-POSIX compliant file
system such as Samba might have issues I am unaware of.

However, I do not see why you should not be able to do this even if
the file is being written to as long as you call msync() before using
the mapped pages.

Durga


On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh <eugene@oracle.com> 
wrote:
> It seems to me there are two extremes.
>
> One is that you replicate the data for each process.  This has the
> disadvantage of consuming lots of memory "unnecessarily."
>
> Another extreme is that shared data is distributed over all processes.  
This
> has the disadvantage of making at least some of the data less 
accessible,
> whether in programming complexity and/or run-time performance.
>
> I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.  
I
> think the natural thing to do with those programming models is to 
distribute
> data over all processes, which may relieve the excessive memory 
consumption
> you're trying to address but which may also just put you at a different
> "extreme" of this spectrum.
>
> The middle ground I think might make

Re: [OMPI users] Shared memory

2010-10-06 Thread Andrei Fokau
Currently we run a code on a cluster with distributed memory, and this code
needs a lot of memory. Part of the data stored in memory is the same for
each process, but it is organized as one array - we can split it if
necessary. So far no magic occurred for us. What do we need to do to make
the magic working?


On Wed, Oct 6, 2010 at 12:43, Jeff Squyres (jsquyres) wrote:

> Open MPI will use shared memory to communicate between peers on the sane
> node - but that's hidden beneath the covers; it's not exposed via the MPI
> API. You just MPI-send and magic occurs and the receiver gets the message.
>
> On Oct 4, 2010, at 11:13 AM, "Andrei Fokau" 
> wrote:
>
> Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
> How can I use them?
>
> On Sat, Sep 25, 2010 at 23:19, Andrei Fokau <
> andrei.fo...@neutron.kth.se> wrote:
>
>> Here are some more details about our problem. We use a dozen of
>> 4-processor nodes with 8 GB memory on each node. The code we run needs about
>> 3 GB per processor, so we can load only 2 processors out of 4. The vast
>> majority of those 3 GB is the same for each processor and is
>> accessed continuously during calculation. In my original question I wasn't
>> very clear asking about a possibility to use shared memory with Open MPI -
>> in our case we do not need to have a remote access to the data, and it
>> would be sufficient to share memory within each node only.
>>
>> Of course, the possibility to access the data remotely (via mmap) is
>> attractive because it would allow to store much larger arrays (up to 10 GB)
>> at one remote place, meaning higher accuracy for our calculations. However,
>> I believe that the access time would be too long for the data read so
>> frequently, and therefore the performance would be lost.
>>
>> I still hope that some of the subscribers to this mailing list have an
>> experience of using Global Arrays. This library seems to be fine for our
>> case, however I feel that there should be a simpler solution. Open MPI
>> conforms with MPI-2 standard, and the later has a description of shared
>> memory application. Do you see any other way for us to use shared memory
>> (within node) apart of using Global Arrays?
>>
>> On Fri, Sep 24, 2010 at 19:03, Durga Choudhury < 
>> dpcho...@gmail.com> wrote:
>>
>>> I think the 'middle ground' approach can be simplified even further if
>>> the data file is in a shared device (e.g. NFS/Samba mount) that can be
>>> mounted at the same location of the file system tree on all nodes. I
>>> have never tried it, though and mmap()'ing a non-POSIX compliant file
>>> system such as Samba might have issues I am unaware of.
>>>
>>> However, I do not see why you should not be able to do this even if
>>> the file is being written to as long as you call msync() before using
>>> the mapped pages.
>>>
>>> Durga
>>>
>>>
>>> On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh < 
>>> eugene@oracle.com> wrote:
>>> > It seems to me there are two extremes.
>>> >
>>> > One is that you replicate the data for each process.  This has the
>>> > disadvantage of consuming lots of memory "unnecessarily."
>>> >
>>> > Another extreme is that shared data is distributed over all processes.
>>> This
>>> > has the disadvantage of making at least some of the data less
>>> accessible,
>>> > whether in programming complexity and/or run-time performance.
>>> >
>>> > I'm not familiar with Global Arrays.  I was somewhat familiar with
>>> HPF.  I
>>> > think the natural thing to do with those programming models is to
>>> distribute
>>> > data over all processes, which may relieve the excessive memory
>>> consumption
>>> > you're trying to address but which may also just put you at a different
>>> > "extreme" of this spectrum.
>>> >
>>> > The middle ground I think might make most sense would be to share data
>>> only
>>> > within a node, but to replicate the data for each node.  There are
>>> probably
>>> > multiple ways of doing this -- possibly even GA, I don't know.  One way
>>> > might be to use one MPI process per node, with OMP multithreading
>>> within
>>> > each process|node.  Or (and I thought this was the solution you were
>>> looking
>>> > for), have some idea which processes are collocal.  Have one process
>>> per
>>> > node create and initialize some shared memory -- mmap, perhaps, or SysV
>>> > shared memory.  Then, have its peers map the same shared memory into
>>> their
>>> > address spaces.
>>> >
>>> > You asked what source code changes would be required.  It depends.  If
>>> > you're going to mmap shared memory in on each node, you need to know
>>> which
>>> > processes are collocal.  If you're willing to constrain how processes
>>> are
>>> > mapped to nodes, this could be easy.  (E.g., "every 4 processes are
>>> > collocal".)  If you want to discover dynamically at run time which are
>>> > collocal, it would be 

Re: [OMPI users] Shared memory

2010-10-06 Thread Jeff Squyres (jsquyres)
Open MPI will use shared memory to communicate between peers on the sane node - 
but that's hidden beneath the covers; it's not exposed via the MPI API. You 
just MPI-send and magic occurs and the receiver gets the message. 

Sent from my PDA. No type good. 

On Oct 4, 2010, at 11:13 AM, "Andrei Fokau"  wrote:

> Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
> How can I use them?
> 
> Andrei
> 
> 
> On Sat, Sep 25, 2010 at 23:19, Andrei Fokau  
> wrote:
> Here are some more details about our problem. We use a dozen of 4-processor 
> nodes with 8 GB memory on each node. The code we run needs about 3 GB per 
> processor, so we can load only 2 processors out of 4. The vast majority of 
> those 3 GB is the same for each processor and is accessed continuously during 
> calculation. In my original question I wasn't very clear asking about a 
> possibility to use shared memory with Open MPI - in our case we do not need 
> to have a remote access to the data, and it would be sufficient to share 
> memory within each node only.
> 
> Of course, the possibility to access the data remotely (via mmap) is 
> attractive because it would allow to store much larger arrays (up to 10 GB) 
> at one remote place, meaning higher accuracy for our calculations. However, I 
> believe that the access time would be too long for the data read so 
> frequently, and therefore the performance would be lost.
> 
> I still hope that some of the subscribers to this mailing list have an 
> experience of using Global Arrays. This library seems to be fine for our 
> case, however I feel that there should be a simpler solution. Open MPI 
> conforms with MPI-2 standard, and the later has a description of shared 
> memory application. Do you see any other way for us to use shared memory 
> (within node) apart of using Global Arrays?
> 
> Andrei
> 
> 
> On Fri, Sep 24, 2010 at 19:03, Durga Choudhury  wrote:
> I think the 'middle ground' approach can be simplified even further if
> the data file is in a shared device (e.g. NFS/Samba mount) that can be
> mounted at the same location of the file system tree on all nodes. I
> have never tried it, though and mmap()'ing a non-POSIX compliant file
> system such as Samba might have issues I am unaware of.
> 
> However, I do not see why you should not be able to do this even if
> the file is being written to as long as you call msync() before using
> the mapped pages.
> 
> Durga
> 
> 
> On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh  wrote:
> > It seems to me there are two extremes.
> >
> > One is that you replicate the data for each process.  This has the
> > disadvantage of consuming lots of memory "unnecessarily."
> >
> > Another extreme is that shared data is distributed over all processes.  This
> > has the disadvantage of making at least some of the data less accessible,
> > whether in programming complexity and/or run-time performance.
> >
> > I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.  I
> > think the natural thing to do with those programming models is to distribute
> > data over all processes, which may relieve the excessive memory consumption
> > you're trying to address but which may also just put you at a different
> > "extreme" of this spectrum.
> >
> > The middle ground I think might make most sense would be to share data only
> > within a node, but to replicate the data for each node.  There are probably
> > multiple ways of doing this -- possibly even GA, I don't know.  One way
> > might be to use one MPI process per node, with OMP multithreading within
> > each process|node.  Or (and I thought this was the solution you were looking
> > for), have some idea which processes are collocal.  Have one process per
> > node create and initialize some shared memory -- mmap, perhaps, or SysV
> > shared memory.  Then, have its peers map the same shared memory into their
> > address spaces.
> >
> > You asked what source code changes would be required.  It depends.  If
> > you're going to mmap shared memory in on each node, you need to know which
> > processes are collocal.  If you're willing to constrain how processes are
> > mapped to nodes, this could be easy.  (E.g., "every 4 processes are
> > collocal".)  If you want to discover dynamically at run time which are
> > collocal, it would be harder.  The mmap stuff could be in a stand-alone
> > function of about a dozen lines.  If the shared area is allocated as one
> > piece, substituting the single malloc() call with a call to your mmap
> > function should be simple.  If you have many malloc()s you're trying to
> > replace, it's harder.
> >
> > Andrei Fokau wrote:
> >
> > The data are read from a file and processed before calculations begin, so I
> > think that mapping will not work in our case.
> > Global Arrays look promising indeed. As I said, we need to put just a part
> > of data to the 

Re: [OMPI users] Shared memory

2010-10-04 Thread Andrei Fokau
Does OMPI have shared memory capabilities (as it is mentioned in MPI-2)?
How can I use them?

Andrei


On Sat, Sep 25, 2010 at 23:19, Andrei Fokau wrote:

> Here are some more details about our problem. We use a dozen of 4-processor
> nodes with 8 GB memory on each node. The code we run needs about 3 GB per
> processor, so we can load only 2 processors out of 4. The vast majority of
> those 3 GB is the same for each processor and is accessed continuously
> during calculation. In my original question I wasn't very clear asking about
> a possibility to use shared memory with Open MPI - in our case we do not
> need to have a remote access to the data, and it would be sufficient to
> share memory within each node only.
>
> Of course, the possibility to access the data remotely (via mmap) is
> attractive because it would allow to store much larger arrays (up to 10 GB)
> at one remote place, meaning higher accuracy for our calculations. However,
> I believe that the access time would be too long for the data read so
> frequently, and therefore the performance would be lost.
>
> I still hope that some of the subscribers to this mailing list have an
> experience of using Global Arrays. This library seems to be fine for our
> case, however I feel that there should be a simpler solution. Open MPI
> conforms with MPI-2 standard, and the later has a description of shared
> memory application. Do you see any other way for us to use shared memory
> (within node) apart of using Global Arrays?
>
> Andrei
>
>
> On Fri, Sep 24, 2010 at 19:03, Durga Choudhury  wrote:
>
>> I think the 'middle ground' approach can be simplified even further if
>> the data file is in a shared device (e.g. NFS/Samba mount) that can be
>> mounted at the same location of the file system tree on all nodes. I
>> have never tried it, though and mmap()'ing a non-POSIX compliant file
>> system such as Samba might have issues I am unaware of.
>>
>> However, I do not see why you should not be able to do this even if
>> the file is being written to as long as you call msync() before using
>> the mapped pages.
>>
>> Durga
>>
>>
>> On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh 
>> wrote:
>> > It seems to me there are two extremes.
>> >
>> > One is that you replicate the data for each process.  This has the
>> > disadvantage of consuming lots of memory "unnecessarily."
>> >
>> > Another extreme is that shared data is distributed over all processes.
>> This
>> > has the disadvantage of making at least some of the data less
>> accessible,
>> > whether in programming complexity and/or run-time performance.
>> >
>> > I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.
>> I
>> > think the natural thing to do with those programming models is to
>> distribute
>> > data over all processes, which may relieve the excessive memory
>> consumption
>> > you're trying to address but which may also just put you at a different
>> > "extreme" of this spectrum.
>> >
>> > The middle ground I think might make most sense would be to share data
>> only
>> > within a node, but to replicate the data for each node.  There are
>> probably
>> > multiple ways of doing this -- possibly even GA, I don't know.  One way
>> > might be to use one MPI process per node, with OMP multithreading within
>> > each process|node.  Or (and I thought this was the solution you were
>> looking
>> > for), have some idea which processes are collocal.  Have one process per
>> > node create and initialize some shared memory -- mmap, perhaps, or SysV
>> > shared memory.  Then, have its peers map the same shared memory into
>> their
>> > address spaces.
>> >
>> > You asked what source code changes would be required.  It depends.  If
>> > you're going to mmap shared memory in on each node, you need to know
>> which
>> > processes are collocal.  If you're willing to constrain how processes
>> are
>> > mapped to nodes, this could be easy.  (E.g., "every 4 processes are
>> > collocal".)  If you want to discover dynamically at run time which are
>> > collocal, it would be harder.  The mmap stuff could be in a stand-alone
>> > function of about a dozen lines.  If the shared area is allocated as one
>> > piece, substituting the single malloc() call with a call to your mmap
>> > function should be simple.  If you have many malloc()s you're trying to
>> > replace, it's harder.
>> >
>> > Andrei Fokau wrote:
>> >
>> > The data are read from a file and processed before calculations begin,
>> so I
>> > think that mapping will not work in our case.
>> > Global Arrays look promising indeed. As I said, we need to put just a
>> part
>> > of data to the shared section. John, do you (or may be other users) have
>> an
>> > experience of working with GA?
>> > http://www.emsl.pnl.gov/docs/global/um/build.html
>> > When GA runs with MPI:
>> > MPI_Init(..)  ! start MPI
>> > GA_Initialize()   ! start global arrays
>> > MA_Init(..)   ! 

Re: [OMPI users] Shared memory

2010-09-24 Thread Durga Choudhury
I think the 'middle ground' approach can be simplified even further if
the data file is in a shared device (e.g. NFS/Samba mount) that can be
mounted at the same location of the file system tree on all nodes. I
have never tried it, though and mmap()'ing a non-POSIX compliant file
system such as Samba might have issues I am unaware of.

However, I do not see why you should not be able to do this even if
the file is being written to as long as you call msync() before using
the mapped pages.

Durga


On Fri, Sep 24, 2010 at 12:31 PM, Eugene Loh  wrote:
> It seems to me there are two extremes.
>
> One is that you replicate the data for each process.  This has the
> disadvantage of consuming lots of memory "unnecessarily."
>
> Another extreme is that shared data is distributed over all processes.  This
> has the disadvantage of making at least some of the data less accessible,
> whether in programming complexity and/or run-time performance.
>
> I'm not familiar with Global Arrays.  I was somewhat familiar with HPF.  I
> think the natural thing to do with those programming models is to distribute
> data over all processes, which may relieve the excessive memory consumption
> you're trying to address but which may also just put you at a different
> "extreme" of this spectrum.
>
> The middle ground I think might make most sense would be to share data only
> within a node, but to replicate the data for each node.  There are probably
> multiple ways of doing this -- possibly even GA, I don't know.  One way
> might be to use one MPI process per node, with OMP multithreading within
> each process|node.  Or (and I thought this was the solution you were looking
> for), have some idea which processes are collocal.  Have one process per
> node create and initialize some shared memory -- mmap, perhaps, or SysV
> shared memory.  Then, have its peers map the same shared memory into their
> address spaces.
>
> You asked what source code changes would be required.  It depends.  If
> you're going to mmap shared memory in on each node, you need to know which
> processes are collocal.  If you're willing to constrain how processes are
> mapped to nodes, this could be easy.  (E.g., "every 4 processes are
> collocal".)  If you want to discover dynamically at run time which are
> collocal, it would be harder.  The mmap stuff could be in a stand-alone
> function of about a dozen lines.  If the shared area is allocated as one
> piece, substituting the single malloc() call with a call to your mmap
> function should be simple.  If you have many malloc()s you're trying to
> replace, it's harder.
>
> Andrei Fokau wrote:
>
> The data are read from a file and processed before calculations begin, so I
> think that mapping will not work in our case.
> Global Arrays look promising indeed. As I said, we need to put just a part
> of data to the shared section. John, do you (or may be other users) have an
> experience of working with GA?
> http://www.emsl.pnl.gov/docs/global/um/build.html
> When GA runs with MPI:
> MPI_Init(..)      ! start MPI
> GA_Initialize()   ! start global arrays
> MA_Init(..)       ! start memory allocator
>     do work
> GA_Terminate()    ! tidy up global arrays
> MPI_Finalize()    ! tidy up MPI
>                   ! exit program
> On Fri, Sep 24, 2010 at 13:44, Reuti  wrote:
>>
>> Am 24.09.2010 um 13:26 schrieb John Hearns:
>>
>> > On 24 September 2010 08:46, Andrei Fokau 
>> > wrote:
>> >> We use a C-program which consumes a lot of memory per process (up to
>> >> few
>> >> GB), 99% of the data being the same for each process. So for us it
>> >> would be
>> >> quite reasonable to put that part of data in a shared memory.
>> >
>> > http://www.emsl.pnl.gov/docs/global/
>> >
>> > Is this eny help? Apologies if I'm talking through my hat.
>>
>> I was also thinking of this when I read "data in a shared memory" (besides
>> approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
>> this also one idea behind "High Performance Fortran" - running in parallel
>> across nodes even without knowing that it's across nodes at all while
>> programming and access all data like it's being local.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Shared memory

2010-09-24 Thread Eugene Loh




It seems to me there are two extremes.

One is that you replicate the data for each process.  This has the
disadvantage of consuming lots of memory "unnecessarily."

Another extreme is that shared data is distributed over all processes. 
This has the disadvantage of making at least some of the data less
accessible, whether in programming complexity and/or run-time
performance.

I'm not familiar with Global Arrays.  I was somewhat familiar with
HPF.  I think the natural thing to do with those programming models is
to distribute data over all processes, which may relieve the excessive
memory consumption you're trying to address but which may also just put
you at a different "extreme" of this spectrum.

The middle ground I think might make most sense would be to share data
only within a node, but to replicate the data for each node.  There are
probably multiple ways of doing this -- possibly even GA, I don't
know.  One way might be to use one MPI process per node, with OMP
multithreading within each process|node.  Or (and I thought this was
the solution you were looking for), have some idea which processes are
collocal.  Have one process per node create and initialize some shared
memory -- mmap, perhaps, or SysV shared memory.  Then, have its peers
map the same shared memory into their address spaces.

You asked what source code changes would be required.  It depends.  If
you're going to mmap shared memory in on each node, you need to know
which processes are collocal.  If you're willing to constrain how
processes are mapped to nodes, this could be easy.  (E.g., "every 4
processes are collocal".)  If you want to discover dynamically at run
time which are collocal, it would be harder.  The mmap stuff could be
in a stand-alone function of about a dozen lines.  If the shared area
is allocated as one piece, substituting the single malloc() call with a
call to your mmap function should be simple.  If you have many
malloc()s you're trying to replace, it's harder.

Andrei Fokau wrote:

  The data are read from a file and
processed before calculations begin, so I think that mapping will not
work in our case.
  
  
Global Arrays look promising indeed. As I said, we need to put just a
part of data to the shared section. John, do you (or may be other
users) have an experience of working with GA?
  
  
  
  
  
  
  http://www.emsl.pnl.gov/docs/global/um/build.html
  
  
  
  When GA runs with MPI:
  
  
  MPI_Init(..)
     ! start MPI 
  GA_Initialize()
  ! start global arrays 
  MA_Init(..)
      ! start memory allocator
  
  
    
 do work
  
  
  GA_Terminate()
   ! tidy up global arrays 
  MPI_Finalize()
   ! tidy up MPI 
    
               ! exit program
  
  
  On Fri, Sep 24, 2010 at 13:44, Reuti 
wrote:
  Am
24.09.2010 um 13:26 schrieb John Hearns:

> On 24 September 2010 08:46, Andrei Fokau 
wrote:
>> We use a C-program which consumes a lot of memory per process
(up to few
>> GB), 99% of the data being the same for each process. So for
us it would be
>> quite reasonable to put that part of data in a shared memory.
>
> http://www.emsl.pnl.gov/docs/global/
>
> Is this eny help? Apologies if I'm talking through my hat.


I was also thinking of this when I read "data in a shared memory"
(besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page).
Wasn't this also one idea behind "High Performance Fortran" - running
in parallel across nodes even without knowing that it's across nodes at
all while programming and access all data like it's being local.
  
  
  
  
  
  





Re: [OMPI users] Shared memory

2010-09-24 Thread Andrei Fokau
The data are read from a file and processed before calculations begin, so I
think that mapping will not work in our case.

Global Arrays look promising indeed. As I said, we need to put just a part
of data to the shared section. John, do you (or may be other users) have an
experience of working with GA?

http://www.emsl.pnl.gov/docs/global/um/build.html
*When GA runs with MPI:*

MPI_Init(..)  ! start MPI
GA_Initialize()   ! start global arrays
MA_Init(..)   ! start memory allocator

    do work

GA_Terminate()! tidy up global arrays
MPI_Finalize()! tidy up MPI
  ! exit program



On Fri, Sep 24, 2010 at 13:44, Reuti  wrote:

> Am 24.09.2010 um 13:26 schrieb John Hearns:
>
> > On 24 September 2010 08:46, Andrei Fokau 
> wrote:
> >> We use a C-program which consumes a lot of memory per process (up to few
> >> GB), 99% of the data being the same for each process. So for us it would
> be
> >> quite reasonable to put that part of data in a shared memory.
> >
> > http://www.emsl.pnl.gov/docs/global/
> >
> > Is this eny help? Apologies if I'm talking through my hat.
>
> I was also thinking of this when I read "data in a shared memory" (besides
> approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't
> this also one idea behind "High Performance Fortran" - running in parallel
> across nodes even without knowing that it's across nodes at all while
> programming and access all data like it's being local.
>
> -- Reuti
>
>


Re: [OMPI users] Shared memory

2010-09-24 Thread Reuti
Am 24.09.2010 um 13:26 schrieb John Hearns:

> On 24 September 2010 08:46, Andrei Fokau  wrote:
>> We use a C-program which consumes a lot of memory per process (up to few
>> GB), 99% of the data being the same for each process. So for us it would be
>> quite reasonable to put that part of data in a shared memory.
> 
> http://www.emsl.pnl.gov/docs/global/
> 
> Is this eny help? Apologies if I'm talking through my hat.

I was also thinking of this when I read "data in a shared memory" (besides 
approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't this 
also one idea behind "High Performance Fortran" - running in parallel across 
nodes even without knowing that it's across nodes at all while programming and 
access all data like it's being local.

-- Reuti


> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared memory

2010-09-24 Thread John Hearns
On 24 September 2010 08:46, Andrei Fokau  wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.

http://www.emsl.pnl.gov/docs/global/

Is this eny help? Apologies if I'm talking through my hat.


Re: [OMPI users] Shared memory

2010-09-24 Thread Durga Choudhury
Is the data coming from a read-only file? In that case, a better way
might be to memory map that file in the root process and share the map
pointer in all the slave threads. This, like shared memory, will work
only for processes within a node, of course.


On Fri, Sep 24, 2010 at 3:46 AM, Andrei Fokau
 wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.
> In the source code, the memory is allocated via malloc() function. What
> would it require for us to change in the source code to be able to put that
> repeating data in a shared memory?
> The code is normally run on several nodes.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Shared memory

2010-09-24 Thread Andrei Fokau
We use a C-program which consumes a lot of memory per process (up to few
GB), 99% of the data being the same for each process. So for us it would be
quite reasonable to put that part of data in a shared memory.

In the source code, the memory is allocated via malloc() function. What
would it require for us to change in the source code to be able to put that
repeating data in a shared memory?

The code is normally run on several nodes.


Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-19 Thread Nicolas Bock
Thanks, that explains it :)

On Tue, Jan 19, 2010 at 15:01, Ralph Castain  wrote:

> Shared memory doesn't extend between comm_spawn'd parent/child processes in
> Open MPI. Perhaps someday it will, but not yet.
>
>
> On Jan 19, 2010, at 1:14 PM, Nicolas Bock wrote:
>
> Hello list,
>
> I think I understand better now what's happening, although I still don't
> know why. I have attached two small C codes that demonstrate the problem.
> The code in main.c uses MPI_Comm_spawn() to start the code in the second
> source, child.c. I can force the issue by running the main.c code with
>
> mpirun -mca btl self,sm -np 1 ./main
>
> and get this error:
>
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[26121,2],0]) is on host: mujo
>   Process 2 ([[26121,1],0]) is on host: mujo
>   BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --
>
> Is that because the spawned process is in a different group? They are still
> all running on the same host, so at least in principle they should be able
> to communicate with each other via shared memory.
>
> nick
>
>
>
> On Fri, Jan 15, 2010 at 16:08, Eugene Loh  wrote:
>
>>  Dunno.  Do lower np values succeed?  If so, at what value of np does the
>> job no longer start?
>>
>> Perhaps it's having a hard time creating the shared-memory backing file in
>> /tmp.  I think this is a 64-Mbyte file.  If this is the case, try reducing
>> the size of the shared area per this FAQ item:
>> http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably,
>> reduce mpool_sm_min_size below 67108864.
>>
>> Also note trac ticket 2043, which describes problems with the sm BTL
>> exposed by GCC 4.4.x compilers.  You need to get a sufficiently recent build
>> to solve this.  But, those problems don't occur until you start passing
>> messages, and here you're not even starting up.
>>
>>
>> Nicolas Bock wrote:
>>
>> Sorry, I forgot to give more details on what versions I am using:
>>
>> OpenMPI 1.4
>> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
>> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>>
>> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock wrote:
>>
>>> Hello list,
>>>
>>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16
>>> cores, which I can verify by looking at /proc/cpuinfo. However, when I run a
>>> job with
>>>
>>> mpirun -np 16 -mca btl self,sm job
>>>
>>> I get this error:
>>>
>>>
>>> --
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications.  This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes.  This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other.  This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>>   Process 1 ([[56972,2],0]) is on host: rust
>>>   Process 2 ([[56972,1],0]) is on host: rust
>>>   BTLs attempted: self sm
>>>
>>> Your MPI job is now going to abort; sorry.
>>>
>>> --
>>>
>>> By adding the tcp btl I can run the job. I don't understand why openmpi
>>> claims that a pair of processes can not reach each other, all processor
>>> cores should have access to all memory after all. Do I need to set some
>>> other btl limit?
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
>
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-19 Thread Ralph Castain
Shared memory doesn't extend between comm_spawn'd parent/child processes in 
Open MPI. Perhaps someday it will, but not yet.


On Jan 19, 2010, at 1:14 PM, Nicolas Bock wrote:

> Hello list,
> 
> I think I understand better now what's happening, although I still don't know 
> why. I have attached two small C codes that demonstrate the problem. The code 
> in main.c uses MPI_Comm_spawn() to start the code in the second source, 
> child.c. I can force the issue by running the main.c code with
> 
> mpirun -mca btl self,sm -np 1 ./main
> 
> and get this error:
> 
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[26121,2],0]) is on host: mujo
>   Process 2 ([[26121,1],0]) is on host: mujo
>   BTLs attempted: self sm
> 
> Your MPI job is now going to abort; sorry.
> --
> 
> Is that because the spawned process is in a different group? They are still 
> all running on the same host, so at least in principle they should be able to 
> communicate with each other via shared memory.
> 
> nick
> 
> 
> 
> On Fri, Jan 15, 2010 at 16:08, Eugene Loh  wrote:
> Dunno.  Do lower np values succeed?  If so, at what value of np does the job 
> no longer start?
> 
> Perhaps it's having a hard time creating the shared-memory backing file in 
> /tmp.  I think this is a 64-Mbyte file.  If this is the case, try reducing 
> the size of the shared area per this FAQ item:  
> http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably, reduce 
> mpool_sm_min_size below 67108864.
> 
> Also note trac ticket 2043, which describes problems with the sm BTL exposed 
> by GCC 4.4.x compilers.  You need to get a sufficiently recent build to solve 
> this.  But, those problems don't occur until you start passing messages, and 
> here you're not even starting up.
> 
> 
> Nicolas Bock wrote:
>> 
>> Sorry, I forgot to give more details on what versions I am using:
>> 
>> OpenMPI 1.4
>> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
>> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>> 
>> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock  wrote:
>> Hello list,
>> 
>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores, 
>> which I can verify by looking at /proc/cpuinfo. However, when I run a job 
>> with
>> 
>> mpirun -np 16 -mca btl self,sm job
>> 
>> I get this error:
>> 
>> --
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> 
>>   Process 1 ([[56972,2],0]) is on host: rust
>>   Process 2 ([[56972,1],0]) is on host: rust
>>   BTLs attempted: self sm
>> 
>> Your MPI job is now going to abort; sorry.
>> --
>> 
>> By adding the tcp btl I can run the job. I don't understand why openmpi 
>> claims that a pair of processes can not reach each other, all processor 
>> cores should have access to all memory after all. Do I need to set some 
>> other btl limit?
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-19 Thread Nicolas Bock
Hello list,

I think I understand better now what's happening, although I still don't
know why. I have attached two small C codes that demonstrate the problem.
The code in main.c uses MPI_Comm_spawn() to start the code in the second
source, child.c. I can force the issue by running the main.c code with

mpirun -mca btl self,sm -np 1 ./main

and get this error:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[26121,2],0]) is on host: mujo
  Process 2 ([[26121,1],0]) is on host: mujo
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

Is that because the spawned process is in a different group? They are still
all running on the same host, so at least in principle they should be able
to communicate with each other via shared memory.

nick



On Fri, Jan 15, 2010 at 16:08, Eugene Loh  wrote:

>  Dunno.  Do lower np values succeed?  If so, at what value of np does the
> job no longer start?
>
> Perhaps it's having a hard time creating the shared-memory backing file in
> /tmp.  I think this is a 64-Mbyte file.  If this is the case, try reducing
> the size of the shared area per this FAQ item:
> http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably, reduce
> mpool_sm_min_size below 67108864.
>
> Also note trac ticket 2043, which describes problems with the sm BTL
> exposed by GCC 4.4.x compilers.  You need to get a sufficiently recent build
> to solve this.  But, those problems don't occur until you start passing
> messages, and here you're not even starting up.
>
>
> Nicolas Bock wrote:
>
> Sorry, I forgot to give more details on what versions I am using:
>
> OpenMPI 1.4
> Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
> gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
>
> On Fri, Jan 15, 2010 at 15:47, Nicolas Bock  wrote:
>
>> Hello list,
>>
>> I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores,
>> which I can verify by looking at /proc/cpuinfo. However, when I run a job
>> with
>>
>> mpirun -np 16 -mca btl self,sm job
>>
>> I get this error:
>>
>> --
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[56972,2],0]) is on host: rust
>>   Process 2 ([[56972,1],0]) is on host: rust
>>   BTLs attempted: self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --
>>
>> By adding the tcp btl I can run the job. I don't understand why openmpi
>> claims that a pair of processes can not reach each other, all processor
>> cores should have access to all memory after all. Do I need to set some
>> other btl limit?
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  int rank;
  int error_codes[1];
  char buffer[1];
  MPI_Comm intercomm;
  MPI_Status status;

  MPI_Init(, );

  MPI_Comm_rank(MPI_COMM_WORLD, );

  if (rank == 0)
  {
printf("[master] spawning process\n");
MPI_Comm_spawn("./other", argv, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, , error_codes);

/* Wait for children to finish. */
MPI_Recv(buffer, 1, MPI_CHAR, MPI_ANY_SOURCE, 1, intercomm, );
  }

  printf("[master (%i)] waiting at barrier\n", rank);
  MPI_Barrier(MPI_COMM_WORLD);
  printf("[master (%i)] done\n", rank);

  MPI_Finalize();
}
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  int rank;
  char buffer[1];
  MPI_Comm parent;

  MPI_Init(, );

  MPI_Comm_rank(MPI_COMM_WORLD, );
  MPI_Comm_get_parent();

  printf("[slave (%i)] starting up, sleeping...\n", rank);
  sleep(5);
  printf("[slave (%i)] done sleeping, signalling master\n", rank);
  MPI_Send(buffer, 1, MPI_CHAR, 0, 1, parent);

  MPI_Finalize();
}


Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-15 Thread Eugene Loh




Dunno.  Do lower np values succeed?  If so, at what value of np does
the job no longer start?

Perhaps it's having a hard time creating the shared-memory backing file
in /tmp.  I think this is a 64-Mbyte file.  If this is the case, try
reducing the size of the shared area per this FAQ item: 
http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably,
reduce mpool_sm_min_size below 67108864.

Also note trac ticket 2043, which describes problems with the sm BTL
exposed by GCC 4.4.x compilers.  You need to get a sufficiently recent
build to solve this.  But, those problems don't occur until you start
passing messages, and here you're not even starting up.

Nicolas Bock wrote:
Sorry, I forgot to give more details on what versions I am
using:
  
OpenMPI 1.4
Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
  
  On Fri, Jan 15, 2010 at 15:47, Nicolas Bock 
wrote:
  Hello
list,

I am running a job on a 4 quadcore AMD Opteron. This machine has 16
cores, which I can verify by looking at /proc/cpuinfo. However, when I
run a job with

mpirun -np 16 -mca btl self,sm job

I get this error:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[56972,2],0]) is on host: rust
  Process 2 ([[56972,1],0]) is on host: rust
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

By adding the tcp btl I can run the job. I don't understand why openmpi
claims that a pair of processes can not reach each other, all processor
cores should have access to all memory after all. Do I need to set some
other btl limit?
  
  





[OMPI users] shared memory (sm) module not working properly?

2010-01-15 Thread Nicolas Bock
Hello list,

I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores,
which I can verify by looking at /proc/cpuinfo. However, when I run a job
with

mpirun -np 16 -mca btl self,sm job

I get this error:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[56972,2],0]) is on host: rust
  Process 2 ([[56972,1],0]) is on host: rust
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

By adding the tcp btl I can run the job. I don't understand why openmpi
claims that a pair of processes can not reach each other, all processor
cores should have access to all memory after all. Do I need to set some
other btl limit?

nick


Re: [OMPI users] Shared Memory (SM) module andsharedcache implications

2009-06-25 Thread Jeff Squyres

On Jun 25, 2009, at 9:12 AM, Ralph Castain wrote:


Doesn't that still pull the message off-socket? I thought it went
through the kernel for that method, which means moving it to main
memory.



It may or may not.

Sorry -- let me clarify: I was just pointing out other on-node/memory- 
based work going on.  Not necessarily the same thing as sharing cache,  
etc.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Shared Memory (SM) module and sharedcache implications

2009-06-25 Thread Ralph Castain
Doesn't that still pull the message off-socket? I thought it went  
through the kernel for that method, which means moving it to main  
memory.



On Jun 25, 2009, at 6:49 AM, Jeff Squyres wrote:

FWIW: there's also work going on to use direct process-to-process  
copies (vs. using shared memory bounce buffers).  Various MPI  
implementations have had this technology for a while (e.g., QLogic's  
PSM-based MPI); the Open-MX guys are publishing the knem open source  
kernel module for this purpose these days (http://runtime.bordeaux.inria.fr/knem/ 
), etc.



On Jun 25, 2009, at 8:31 AM, Simone Pellegrini wrote:


Ralph Castain wrote:
> At the moment, I believe the answer is the main memory route. We  
have

> a project just starting here (LANL) to implement the cache-level
> exchange, but it won't be ready for release for awhile.
Interesting, actually I am a PhD student and my topic is  
optimization of
MPI applications on multi-core architectures. I will be very  
interested
in collaborating in such project. Can you give me more details  
about it

(links/pointers)?

regards, Simone
>
>
> On Jun 25, 2009, at 2:39 AM, Simone Pellegrini wrote:
>
>> Hello,
>> I have a simple question for the shared memory (sm) module  
developers

>> of Open MPI.
>>
>> In the current implementation, is there any advantage of having
>> shared cache among processes communicating?
>> For example let say we have P1 and P2 placed in the same CPU on 2
>> different physical cores with shared cache, P1 wants to send a
>> message to P2 and the message is already in the cache.
>>
>> How the message is being actually exchanged? Is the cache line
>> invalidated, written to main memory and exchanged by using some  
DMA
>> transfer... or is the message in the cache used (avoiding access  
to

>> the main memory)?
>>
>> thanks in advance, Simone P.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory (SM) module and sharedcache implications

2009-06-25 Thread Jeff Squyres
FWIW: there's also work going on to use direct process-to-process  
copies (vs. using shared memory bounce buffers).  Various MPI  
implementations have had this technology for a while (e.g., QLogic's  
PSM-based MPI); the Open-MX guys are publishing the knem open source  
kernel module for this purpose these days (http://runtime.bordeaux.inria.fr/knem/ 
), etc.



On Jun 25, 2009, at 8:31 AM, Simone Pellegrini wrote:


Ralph Castain wrote:
> At the moment, I believe the answer is the main memory route. We  
have

> a project just starting here (LANL) to implement the cache-level
> exchange, but it won't be ready for release for awhile.
Interesting, actually I am a PhD student and my topic is  
optimization of
MPI applications on multi-core architectures. I will be very  
interested
in collaborating in such project. Can you give me more details about  
it

(links/pointers)?

regards, Simone
>
>
> On Jun 25, 2009, at 2:39 AM, Simone Pellegrini wrote:
>
>> Hello,
>> I have a simple question for the shared memory (sm) module  
developers

>> of Open MPI.
>>
>> In the current implementation, is there any advantage of having
>> shared cache among processes communicating?
>> For example let say we have P1 and P2 placed in the same CPU on 2
>> different physical cores with shared cache, P1 wants to send a
>> message to P2 and the message is already in the cache.
>>
>> How the message is being actually exchanged? Is the cache line
>> invalidated, written to main memory and exchanged by using some DMA
>> transfer... or is the message in the cache used (avoiding access to
>> the main memory)?
>>
>> thanks in advance, Simone P.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Shared Memory (SM) module and shared cache implications

2009-06-25 Thread Simone Pellegrini

Ralph Castain wrote:
At the moment, I believe the answer is the main memory route. We have 
a project just starting here (LANL) to implement the cache-level 
exchange, but it won't be ready for release for awhile.
Interesting, actually I am a PhD student and my topic is optimization of 
MPI applications on multi-core architectures. I will be very interested 
in collaborating in such project. Can you give me more details about it 
(links/pointers)?


regards, Simone



On Jun 25, 2009, at 2:39 AM, Simone Pellegrini wrote:


Hello,
I have a simple question for the shared memory (sm) module developers 
of Open MPI.


In the current implementation, is there any advantage of having 
shared cache among processes communicating?
For example let say we have P1 and P2 placed in the same CPU on 2 
different physical cores with shared cache, P1 wants to send a 
message to P2 and the message is already in the cache.


How the message is being actually exchanged? Is the cache line 
invalidated, written to main memory and exchanged by using some DMA 
transfer... or is the message in the cache used (avoiding access to 
the main memory)?


thanks in advance, Simone P.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Shared Memory (SM) module and shared cache implications

2009-06-25 Thread Ralph Castain
At the moment, I believe the answer is the main memory route. We have  
a project just starting here (LANL) to implement the cache-level  
exchange, but it won't be ready for release for awhile.



On Jun 25, 2009, at 2:39 AM, Simone Pellegrini wrote:


Hello,
I have a simple question for the shared memory (sm) module  
developers of Open MPI.


In the current implementation, is there any advantage of having  
shared cache among processes communicating?
For example let say we have P1 and P2 placed in the same CPU on 2  
different physical cores with shared cache, P1 wants to send a  
message to P2 and the message is already in the cache.


How the message is being actually exchanged? Is the cache line  
invalidated, written to main memory and exchanged by using some DMA  
transfer... or is the message in the cache used (avoiding access to  
the main memory)?


thanks in advance, Simone P.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Shared Memory (SM) module and shared cache implications

2009-06-25 Thread Simone Pellegrini

Hello,
I have a simple question for the shared memory (sm) module developers of 
Open MPI.


In the current implementation, is there any advantage of having shared 
cache among processes communicating?
For example let say we have P1 and P2 placed in the same CPU on 2 
different physical cores with shared cache, P1 wants to send a message 
to P2 and the message is already in the cache.


How the message is being actually exchanged? Is the cache line 
invalidated, written to main memory and exchanged by using some DMA 
transfer... or is the message in the cache used (avoiding access to the 
main memory)?


thanks in advance, Simone P.


Re: [OMPI users] SHARED Memory----------------

2009-04-23 Thread shan axida
Hi,
It have read that FAQ. 
Does it mean shared memory communication is used when send messages 
between the processes in same node in default?
No need any options and configuration for OpenMPI shared memory?

THANK YOU!





From: Eugene Loh <eugene@sun.com>
To: Open MPI Users <us...@open-mpi.org>
Sent: Thursday, April 23, 2009 2:08:33 PM
Subject: Re: [OMPI users] SHARED Memory

shan axida wrote: 
What
I am asking is if I use MPI_Send and MPI_Recv between processes in  
a node, does it mean using shared memory or not?
It (typically) does.  (Some edge cases could occur.)  Your question is
addressed by the FAQ I mentioned.

if not, how to use 
shared memory among processes which are runing in a node?


From: Eugene Loh <eugene@sun.com>
To: Open MPI Users <us...@open-mpi.org>
Sent: Thursday, April 23, 2009 1:20:05 PM
Subject: Re: [OMPI users] SHARED Memory

Just to clarify (since "send to self" strikes me as confusing)...

If you're talking about using shared memory for point-to-point MPI
message passing, OMPI typically uses it automatically between two
processes on the same node.  It is *not* used for a process sending to
itself.  There is a well-written FAQ (in my arrogant opinion!) at
http://www.open-mpi.org/faq/?category=sm -- e.g.,
http://www.open-mpi.org/faq/?category=sm#sm-btl .

If you're talking about some other use of shared memory, let us know
what you had in mind.

Elvedin Trnjanin wrote: 
Shared memory is used for send-to-self scenarios such as if you're
making use of multiple slots on the same machine.

shan axida wrote: 
Any body know how to make use of shared memory in OpenMPI
implementation?



  

Re: [OMPI users] SHARED Memory----------------

2009-04-23 Thread Eugene Loh




shan axida wrote:

  
  What
I am asking is if I use MPI_Send and MPI_Recv between processes in 
  a node, does it mean using shared memory or not?
  

It (typically) does.  (Some edge cases could occur.)  Your question is
addressed by the FAQ I mentioned.

  
   if not, how to use 
  shared memory among processes which are runing in a node?

  From: Eugene Loh <eugene@sun.com>
To: Open MPI Users <us...@open-mpi.org>
Sent: Thursday, April 23, 2009 1:20:05 PM
Subject: Re: [OMPI users] SHARED Memory
  
Just to clarify (since "send to self" strikes me as confusing)...
  
If you're talking about using shared memory for point-to-point MPI
message passing, OMPI typically uses it automatically between two
processes on the same node.  It is *not* used for a process sending to
itself.  There is a well-written FAQ (in my arrogant opinion!) at
  http://www.open-mpi.org/faq/?category=sm
-- e.g.,
  http://www.open-mpi.org/faq/?category=sm#sm-btl
.
  
If you're talking about some other use of shared memory, let us know
what you had in mind.
  
Elvedin Trnjanin wrote: 
Shared memory is used for send-to-self scenarios such as if you're
making use of multiple slots on the same machine.
  
shan axida wrote: 
Any body know how to make use of shared memory in OpenMPI
implementation?
  






Re: [OMPI users] SHARED Memory----------------

2009-04-23 Thread shan axida
Hi,
What I am asking is if I use MPI_Send and MPI_Recv between processes in 
a node, does it mean using shared memory or not? if not, how to use 
shared memory among processes which are runing in a node?


Thank you!





From: Eugene Loh <eugene@sun.com>
To: Open MPI Users <us...@open-mpi.org>
Sent: Thursday, April 23, 2009 1:20:05 PM
Subject: Re: [OMPI users] SHARED Memory

Just to clarify (since "send to self" strikes me as confusing)...

If you're talking about using shared memory for point-to-point MPI
message passing, OMPI typically uses it automatically between two
processes on the same node.  It is *not* used for a process sending to
itself.  There is a well-written FAQ (in my arrogant opinion!) at
http://www.open-mpi.org/faq/?category=sm -- e.g.,
http://www.open-mpi.org/faq/?category=sm#sm-btl .

If you're talking about some other use of shared memory, let us know
what you had in mind.

Elvedin Trnjanin wrote: 
Shared memory is used for send-to-self scenarios such as if you're
making use of multiple slots on the same machine.

shan axida wrote: 
Any body know how to make use of shared memory in OpenMPI
implementation?


  

Re: [OMPI users] SHARED Memory----------------

2009-04-23 Thread Elvedin Trnjanin
Shared memory is used for send-to-self scenarios such as if you're 
making use of multiple slots on the same machine.


shan axida wrote:

Hi,

Any body know how to make use of shared memory in OpenMPI implementation?

Thanks




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  1   2   >