Re: [OMPI users] Broadcast problem

2013-04-30 Thread Randolph Pullen
Oops,I think I meant gather not scatter...

[OMPI users] Broadcast problem

2013-04-30 Thread Randolph Pullen
I have a number of processes split into sender and receivers.
Senders read large quantities of randomly organised data into buffers for 
transmission to receivers.
When a buffer is full it needs to be transmitted to all receivers this repeats 
until all the data is transmitted.

Problem is that MPI_Bcast must know the root it is to receive from and 
therefore cant receive 'blind' from the first full sender.
Scatter would be inneffieienct because a few senders wont have anything to send 
- so its wasteful to transmit those empty buffers repeatedly. 

Any ideas?
Can Bcast recievers be promiscuous?

Thanks Randolph

[OMPI users] Re...

2013-02-02 Thread Randolph Pullen
http://www.compu-gen.com/components/com_content/yaid3522.php 





Randolph Pullen
2/3/2013 1:41:11 AM
.

[OMPI users] Hi!!!

2013-02-02 Thread Randolph Pullen
http://www.corcoranharnist.com/components/com_content/yaid3521.php 



_
Randolph Pullen

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-11 Thread Randolph Pullen
Thats very interesting Yevgeny,

Yes tcp,self ran in 12 seconds
tcp,self,sm ran in 27 seconds

Does anyone have any idea how this can be?

About half the data would go to local processes, so SM should pay dividends.



 From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
To: Randolph Pullen <randolph_pul...@yahoo.com.au> 
Cc: OpenMPI Users <us...@open-mpi.org> 
Sent: Monday, 10 September 2012 9:11 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
Randolph,

So what you saying in short, leaving all the numbers aside, is the following:
In your particular application on your particular setup with this particular 
OMPI version,
1. openib BTL performs faster than shared memory BTL
2. TCP BTL performs faster than shared memory

IMHO, this indicates that you have some problem on your machines,
and this problem is unrelated to interconnect.

Shared memory should be much faster than IB, not to mention IPoIB.

Could you run these two commands?

mpirun --mca btl tcp,self    -H vh2,vh1 -np 9 --bycore prog
mpirun --mca btl tcp,self,sm -H vh2,vh1 -np 9 --bycore prog

You will probably see better number w/o sm.
Why? Don't know.

Perhaps someone who has better knowledge in sm BTL can elaborate?

-- YK


On 9/10/2012 6:32 AM, Randolph Pullen wrote:
> See my comments in line...
> 
>
 
--

> *From:* Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
> *To:* Randolph Pullen <randolph_pul...@yahoo.com.au>
> *Cc:* OpenMPI Users <us...@open-mpi.org>
> *Sent:* Sunday, 9 September 2012 6:18 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
> 
> Randolph,
> 
> On 9/7/2012 7:43 AM, Randolph Pullen wrote:
>  > Yevgeny,
>  > The ibstat results:
>  > CA 'mthca0'
>  > CA type: MT25208 (MT23108 compat mode)
> 
> What you have is InfiniHost III HCA, which is 4x SDR card.
> This card has theoretical peak of 10 Gb/s, which is 1GB/s in IB bit coding.
> 
>  > And more interestingly, ib_write_bw:
>  > Conflicting CPU frequency values detected: 1600.00 != 3301.00
>  >
>  > What does Conflicting CPU frequency values mean?
>  >
>  > Examining the /proc/cpuinfo file however shows:
>  > processor : 0
>  > cpu MHz : 3301.000
>  > processor : 1
>  > cpu MHz : 3301.000
>  > processor : 2
>  > cpu MHz : 1600.000
>  > processor : 3
>  > cpu MHz : 1600.000
>  >
>  > Which seems oddly wierd to me...
> 
> You need to have all the cores running at highest clock to get better numbers.
> May be you have power governor not set to optimal performance on these 
> machines.
> Google for "Linux CPU scaling governor" to get more info on this subject, or
> contact your system admin and ask him to take care of the CPU frequencies.
> 
> Once this is done, check all the pairs of your machines - ensure that you get
> a good number with ib_write_br.
> Note that if you have a slower machine in the cluster, general application
> performance will suffer from this.
> 
> I have anchored the clocks speeds to:
> [root@vh1 ~]# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq
> 360
> 360
> 360
> 360
> 360
> 360
> 360
> 360
> 
> [root@vh2 ~]# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq
> 320
> 320
> 320
> 320
> 
> However /proc/cpuinfo still reports them incorrectly
> [deepcloud@vh2 c]$ grep MHz /proc/cpuinfo
> cpu MHz : 3300.000
> cpu MHz : 1600.000
> cpu MHz : 1600.000
> cpu MHz : 1600.000
> 
> I don't think this is the problem, so I used -F option in ib_write_bw to push 
> ahead. ie;
> [deepcloud@vh2 c]$ ib_write_bw -F vh1
> --
> RDMA_Write BW Test
> N

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
One system is actually an i5-2400 - maybe its throttling back on 2 cores to 
save power?
The other(I7) shows consistent CPU MHz on all cores



 From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; OpenMPI Users 
<us...@open-mpi.org> 
Sent: Thursday, 6 September 2012 6:03 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
On 9/3/2012 4:14 AM, Randolph Pullen wrote:
> No RoCE, Just native IB with TCP over the top.

Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
Could you run "ibstat" and post the results?

What is the expected BW on your cards?
Could you run "ib_write_bw" between two machines?

Also, please see below.

> No I haven't used 1.6 I was trying to stick with the standards on the 
> mellanox disk.
> Is there a known problem with 1.4.3 ?
> 
>
 
--

> *From:* Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
> *To:* Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
> <us...@open-mpi.org>
> *Sent:* Sunday, 2 September 2012 10:54 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
> 
> Randolph,
> 
> Some clarification on the setup:
> 
> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to 
> Ethernet?
> That is, when you're using openib BTL, you mean RoCE, right?
> 
> Also, have you had a chance to try some newer OMPI release?
> Any 1.6.x would do.
> 
> 
> -- YK
> 
> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>  > (reposted with consolidatedinformation)
>  > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G 
>cards
>  > running Centos 5.7 Kernel 2.6.18-274
>  > Open MPI 1.4.3
>  > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>  > On a Cisco 24 pt switch
>  > Normal performance is:
>  > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>  > results in:
>  > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>  > and:
>  > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>  > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
>  > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems 
>fine.
>  > log_num_mtt =20 and log_mtts_per_seg params =2
>  > My application exchanges about a gig of data between the processes with 2 
>sender and 2 consumer processes on each node with 1 additional controller 
>process on the starting node.
>  > The program splits the data into 64K blocks and uses non blocking sends 
>and receives with busy/sleep loops to monitor progress until completion.
>  > Each process owns a single buffer for these 64K blocks.
>  > My problem is I see better performance under IPoIB then I do on native IB 
>(RDMA_CM).
>  > My understanding is that IPoIB is limited to about 1G/s so I am at a loss 
>to know why it is faster.
>  > These 2 configurations are equivelant (about 8-10 seconds per cycle)
>  > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl 
>tcp,self -H vh2,vh1 -np 9 --bycore prog
>  > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl 
>tcp,self -H vh2,vh1 -np 9 --bycore prog

When you say "--mca btl tcp,self", it means that openib btl is not enabled.
Hence "--mca btl_openib_flags" is irrelevant.

>  > And this one produces similar run times but seems to degrade with repeated 
>cycles:
>  > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
>openib,self -H vh2,vh1 -np 9 --bycore prog

You're running 9 ranks on two machines, but you're using IB for intra-node 
communication.
Is it intentional? If not, you can add "sm" b

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
Yevgeny,
The ibstat results:
CA 'mthca0'
    CA type: MT25208 (MT23108 compat mode)
    Number of ports: 2
    Firmware version: 4.7.600
    Hardware version: a0
    Node GUID: 0x0005ad0c21e0
    System image GUID: 0x0005ad000100d050
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 10
    Base lid: 4
    LMC: 0
    SM lid: 2
    Capability mask: 0x02510a68
    Port GUID: 0x0005ad0c21e1
    Link layer: IB
    Port 2:
    State: Down
    Physical state: Polling
    Rate: 10
    Base lid: 0
    LMC: 0
    SM lid: 0
    Capability mask: 0x02510a68
    Port GUID: 0x0005ad0c21e2
    Link layer: IB

And more interestingly, ib_write_bw: 
   RDMA_Write BW Test
 Number of qps   : 1
 Connection type : RC
 TX depth    : 300
 CQ Moderation   : 50
 Link type   : IB
 Mtu : 2048
 Inline data is used up to 0 bytes message
 local address: LID 0x04 QPN 0x1c0407 PSN 0x48ad9e RKey 0xd86a0051 VAddr 
0x002ae36287
 remote address: LID 0x03 QPN 0x2e0407 PSN 0xf57209 RKey 0x8d98003b VAddr 
0x002b533d366000
--
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
Conflicting CPU frequency values detected: 1600.00 != 3301.00
 65536 5000   0.00   0.00   
--

What does Conflicting CPU frequency values mean?

Examining the /proc/cpuinfo file however shows:
processor   : 0
cpu MHz : 3301.000
processor   : 1
cpu MHz : 3301.000
processor   : 2

cpu MHz : 1600.000
processor   : 3
cpu MHz : 1600.000


Which seems oddly wierd to me...




 From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; OpenMPI Users 
<us...@open-mpi.org> 
Sent: Thursday, 6 September 2012 6:03 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
On 9/3/2012 4:14 AM, Randolph Pullen wrote:
> No RoCE, Just native IB with TCP over the top.

Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
Could you run "ibstat" and post the results?

What is the expected BW on your cards?
Could you run "ib_write_bw" between two machines?

Also, please see below.

> No I haven't used 1.6 I was trying to stick with the standards on the 
> mellanox disk.
> Is there a known problem with 1.4.3 ?
> 
>
 
--

> *From:* Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
> *To:* Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
> <us...@open-mpi.org>
> *Sent:* Sunday, 2 September 2012 10:54 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
> 
> Randolph,
> 
> Some clarification on the setup:
> 
> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to 
> Ethernet?
> That is, when you're using openib BTL, you mean RoCE, right?
> 
> Also, have you had a chance to try some newer OMPI release?
> Any 1.6.x would do.
> 
> 
> -- YK
> 
> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>  > (reposted with consolidatedinformation)
>  > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G 
>cards
>  > running Centos 5.7 Kernel 2.6.18-274
>  > Open MPI 1.4.3
>  > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>  > On a Cisco 24 pt switch
>  > Normal performance is:
>  > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>  > results in:
>  > Max rate = 958.388867 MB/sec Min latency

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-02 Thread Randolph Pullen
No RoCE, Just native IB with TCP over the top.



No I haven't used 1.6 I was trying to stick with the standards on the mellanox 
disk.
Is there a known problem with 1.4.3 ?



 From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Sunday, 2 September 2012 10:54 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
Randolph,

Some clarification on the setup:

"Melanox III HCA 10G
 cards" - are those ConnectX 3 cards configured to Ethernet?
That is, when you're using openib BTL, you mean RoCE, right?

Also, have you had a chance to try some newer OMPI release?
Any 1.6.x would do.


-- YK

On 8/31/2012 10:53 AM, Randolph Pullen wrote:
> (reposted with consolidatedinformation)
> I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G 
> cards
> running Centos 5.7 Kernel 2.6.18-274
> Open MPI 1.4.3
> MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
> On a Cisco 24 pt switch
> Normal performance is:
> $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
> results in:
> Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
> and:
> $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
> Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
> NetPipeMPI results show a max of 7.4
 Gb/s at 8388605 bytes which seems fine.
> log_num_mtt =20 and log_mtts_per_seg params =2
> My application exchanges about a gig of data between the processes with 2 
> sender and 2 consumer processes on each node with 1 additional controller 
> process on the starting node.
> The program splits the data into 64K blocks and uses non blocking sends and 
> receives with busy/sleep loops to monitor progress until completion.
> Each process owns a single buffer for these 64K blocks.
> My problem is I see better performance under IPoIB then I do on native IB 
> (RDMA_CM).
> My understanding is that IPoIB is limited to about 1G/s so I am at a loss to 
> know why it is faster.
> These 2 configurations are equivelant (about 8-10 seconds per cycle)
> mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self 
> -H vh2,vh1 -np 9 --bycore prog
> mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl
 tcp,self -H vh2,vh1 -np 9 --bycore prog
> And this one produces similar run times but seems to degrade with repeated 
> cycles:
> mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
> openib,self -H vh2,vh1 -np 9 --bycore prog
> Other btl_openib_flags settings result in much lower performance.
> Changing the first of the above configs to use openIB results in a 21 second 
> run time at best. Sometimes it takes up to 5 minutes.
> In all cases, OpenIB runs in twice the time it takes TCP,except if I push the 
> small message max to 64K and force short messages. Then the openib times are 
> the same as TCP and no faster.
> With openib:
> - Repeated cycles during a single run seem to slow down with each cycle
> (usually by about 10 seconds).
> - On occasions it seems to stall indefinitely, waiting on a single receive.
> I'm still at a loss as to why. I can’t find any errors logged during
 the runs.
> Any ideas appreciated.
> Thanks in advance,
> Randolph
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Infiniband performance Problem and stalling

2012-08-31 Thread Randolph Pullen
(reposted with consolidated information)
I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III
HCA 10G cards
running Centos 5.7 Kernel 2.6.18-274
Open MPI 1.4.3
MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
On a Cisco 24 pt switch
 
Normal performance is:
$ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts
 PingPong
results in:
 Max rate = 958.388867 MB/sec  Min latency = 4.529953
usec
and:
$ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts
 PingPong
Max rate = 653.547293 MB/sec  Min latency = 19.550323 usec
 
NetPipeMPI  results
show a max of 7.4 Gb/s at 8388605 bytes which seems fine.
log_num_mtt =20 and log_mtts_per_seg params
=2
 
My application exchanges about a gig of data between the processes
with 2 sender and 2 consumer processes on each node with 1 additional controller
process on the starting node. 
The program splits the data into 64K blocks and uses non blocking
sends and receives with busy/sleep loops to monitor progress until completion.
Each process owns a single buffer for these 64K blocks.
 
 
My problem is I see better performance under IPoIB then I do on
native IB (RDMA_CM).
My understanding is that IPoIB is limited to about 1G/s so I am at
a loss to know why it is faster.
 
These 2 configurations are equivelant (about 8-10 seconds per
cycle)
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl
tcp,self -H vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl
tcp,self -H vh2,vh1 -np 9 --bycore prog
 
And this one produces similar run times but seems to degrade with
repeated cycles:
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1
--mca btl openib,self -H vh2,vh1 -np 9 --bycore  prog
 
Other  btl_openib_flags settings result in much lower
performance. 
Changing the first of the above configs to use openIB results in a
21 second run time at best.  Sometimes it takes up to 5 minutes.
In all cases, OpenIB runs in twice the time it takes TCP,except if
I push the small message max to 64K and force short messages.  Then the
openib times are the same as TCP and no faster.
 
With openib:
- Repeated cycles during a single run seem to slow down with each
cycle
(usually by about 10 seconds).
- On occasions it seems to stall indefinitely, waiting on a single
receive. 
 
I'm  still at a loss
as to why.  I can’t find any errors
logged during the runs.
Any ideas appreciated.
 
Thanks in advance,
Randolph

Re: [OMPI users] Infiniband performance Problem and stalling

2012-08-30 Thread Randolph Pullen
Paul I tried NetPipeMPI - (belatedly because their site was down down for a 
couple of days)

The results show a max of 7.4 Gb/s at 8388605 bytes which seems fine.

But my program still runs slowly and stalls occasionally.
I've using 1 buffer per process - I assume this is ok.
Is it of any significance that the  log_num_mtt and log_mtts_per_seg params 
where not set?
Is this a symptom of a broken install?

Reposting original message for clarity - its been a few days...
2nd posts are below this 1st section

I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G cards
running Centos 5.7 Kernel 2.6.18-274
Open MPI 1.4.3
MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
On a Cisco 24 pt switch

Normal performance is:
$ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts  PingPong
results in:
 Max rate = 958.388867 MB/sec  Min latency = 4.529953 usec
and:
$ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts  PingPong
Max rate = 653.547293 MB/sec  Min latency = 19.550323 usec


My application exchanges about a gig of data between the processes with 2 
sender and 2 consumer processes on each node with 1 additional controler 
process on the starting node.
The program splits the data into 64K blocks and uses non blocking sends and 
receives with busy/sleep loops to monitor progress until completion.

My problem is I see better performance under IPoIB then I do on native IB 
(RDMA_CM).
My understanding is that IPoIB is limited to about 1G/s so I am at a loss to 
know why it is faster.

These 2 configurations are equivelant (about 8-10 seconds per cycle)
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H 
vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl tcp,self -H 
vh2,vh1 -np 9 --bycore prog

And this one produces similar run times but seems to degrade with repeated 
cycles:
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
openib,self -H vh2,vh1 -np 9 --bycore  prog

Other  btl_openib_flags settings result in much lower performance. 
Changing the first of the above configs to use openIB results in a 21 second 
run time at best.  Sometimes it takes up to 5 minutes.
With openib:
- Repeated cycles during a single run seem to slow down with each cycle.
- On occasions it seems to stall indefinately, waiting on a single receive. 

Any ideas appreciated.

Thanks in advance,
Randolph



 From: Randolph Pullen <randolph_pul...@yahoo.com.au>
To: Paul Kapinos <kapi...@rz.rwth-aachen.de>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Thursday, 30 August 2012 11:46 AM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 

Interesting, the log_num_mtt and log_mtts_per_seg params where not set.
Setting them to utilise 2*8G of my RAM resulted in no change to the stalls or 
run time ie; (19,3) (20,2) (21,1) or (6,16). 
In all cases, OpenIB runs in twice the time it takes TCP,except if I push the 
small message max to 64K and force short messages.  Then the openib times are 
the same as TCP and no faster.

I'ms till at a loss as to why...



 From: Paul Kapinos <kapi...@rz.rwth-aachen.de>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Tuesday, 28 August 2012 6:13 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
Randolph,
after reading this:

On 08/28/12 04:26, Randolph Pullen wrote:
> - On occasions it seems to stall indefinately, waiting on a single receive.

... I
 would make a blind guess: are you aware about IB card parameters for 
registered memory?
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

"Waiting forever" for a single operation is one of symptoms of the problem 
especially in 1.5.3.


best,
Paul

P.S. the lower performance with 'big' chinks is known phenomenon, cf.
http://www.scl.ameslab.gov/netpipe/
(image on bottom of the page). But the chunk size of 64k is fairly small




-- Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Infiniband performance Problem and stalling

2012-08-29 Thread Randolph Pullen
Interesting, the log_num_mtt and log_mtts_per_seg params where not set.
Setting them to utilise 2*8G of my RAM resulted in no change to the stalls or 
run time ie; (19,3) (20,2) (21,1) or (6,16). 
In all cases, OpenIB runs in twice the time it takes TCP,except if I push the 
small message max to 64K and force short messages.  Then the openib times are 
the same as TCP and no faster.

I'ms till at a loss as to why...



 From: Paul Kapinos <kapi...@rz.rwth-aachen.de>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Tuesday, 28 August 2012 6:13 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
Randolph,
after reading this:

On 08/28/12 04:26, Randolph Pullen wrote:
> - On occasions it seems to stall indefinately, waiting on a single receive.

... I would make a blind guess: are you aware about IB card parameters for 
registered memory?
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

"Waiting forever" for a single operation is one of symptoms of the problem 
especially in 1.5.3.


best,
Paul

P.S. the lower performance with 'big' chinks is known phenomenon, cf.
http://www.scl.ameslab.gov/netpipe/
(image on bottom of the page). But the chunk size of 64k is fairly small




-- Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

[OMPI users] Infiniband performance Problem and stalling

2012-08-27 Thread Randolph Pullen
I have a test rig comprising 2 i7 systems with Melanox III HCA 10G cards

running Centos 5.7 Kernel 2.6.18-274
Open MPI 1.4.3
MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
On a Cisco 24 pt switch

Normal performance is:
$ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts  PingPong
results in:
 Max rate = 958.388867 MB/sec  Min latency = 4.529953 usec
and:
$ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts  PingPong
Max rate = 653.547293 MB/sec  Min latency = 19.550323 usec


My application exchanges about a gig of data between the processes with 2 
sender and 2 consumer processes on each node with 1 additional controler 
process on the starting node.
The program splits the data into 64K blocks and uses non blocking sends and 
receives with busy/sleep loops to monitor progress until completion.

My problem is I see better performance under IPoIB then I do on native IB 
(RDMA_CM).
My understanding is that IPoIB is limited to about 1G/s so I am at a loss to 
know why it is faster.

These 2 configurations are equivelant (about 8-10 seconds per cycle)
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H 
vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl tcp,self -H 
vh2,vh1 -np 9 --bycore prog

And this one produces similar run times but seems to degrade with repeated 
cycles:
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
openib,self -H vh2,vh1 -np 9 --bycore  prog

Other  btl_openib_flags settings result in much lower performance. 
Changing the first of the above configs to use openIB results in a 21 second 
run time at best.  Sometimes it takes up to 5 minutes.
With openib:

- Repeated cycles during a single run seem to slow down with each cycle.
- On occasions it seems to stall indefinately, waiting on a single receive. 

Any ideas appreciated.

Thanks in advance,
Randolph


Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Aha! I may have been stupid.

The perl program is calling another small openMPI routine via mpirun and 
system()
That is bad isn't it?

How can MPI tell that a program 2 system calls away is MPI ?
A better question is how can I trick it to not knowing that its MPI so it runs 
just as it does when started manually ?
Doh !


 From: Randolph Pullen <randolph_pul...@yahoo.com.au>
To: Ralph Castain <r...@open-mpi.org>; Open MPI Users <us...@open-mpi.org> 
Sent: Friday, 20 January 2012 2:17 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain <r...@open-mpi.org>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>----- Forwarded Message -
>From: Randolph Pullen <randolph_pul...@yahoo.com.au>
>To: Jeff Squyres <jsquy...@cisco.com> 
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres <jsquy...@cisco.com>
>To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
><us...@open-mpi.org> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call 

Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain <r...@open-mpi.org>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>- Forwarded Message -
>From: Randolph Pullen <randolph_pul...@yahoo.com.au>
>To: Jeff Squyres <jsquy...@cisco.com> 
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres <jsquy...@cisco.com>
>To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
><us...@open-mpi.org> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>> 
>> Any ideas ?
>> 
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>-- 
>Jeff
 Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen


FYI
- Forwarded Message -
From: Randolph Pullen <randolph_pul...@yahoo.com.au>
To: Jeff Squyres <jsquy...@cisco.com> 
Sent: Friday, 20 January 2012 12:45 PM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 

I'm using TCP on 1.4.1 (its actually IPoIB)
OpenIB is compiled in.
Note that these nodes are containers running in OpenVZ where IB is not 
available.  there may be some SDP running in system level routines on the VH 
but this is unlikely.
OpenIB is not available to the VMs.  they happily get TCP services from the VH
In any case, the problem still occurs if I use: --mca btl tcp,self

I have traced the perl code and observed that OpenMPI gets confused whenever 
the perl program executes a system command itself
eg:
`command 2>&1 1> log`;

This probably narrows it down (I hope)


 From: Jeff Squyres <jsquy...@cisco.com>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Friday, 20 January 2012 1:52 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
Which network transport are you using, and what version of Open MPI are you 
using?  Do you have OpenFabrics support compiled into your Open MPI 
installation?

If you're just using TCP and/or shared memory, I can't think of a reason 
immediately as to why this wouldn't work, but there may be a subtle interaction 
in there
 somewhere that causes badness (e.g., memory corruption).


On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:

> 
> I have a section in my code running in rank 0 that must start a perl program 
> that it then connects to via a tcp socket.
> The initialisation section is shown here:
> 
>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>     int i = system(buf);
>     printf("system returned %d\n", i);
> 
> 
> Some time after I run this code, while waiting for the data from the perl 
> program, the error below occurs:
> 
> qplan connection
> DCsession_fetch: waiting for Mcode data...
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file rml_oob_send.c at 
> line 105
> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file 
> base/plm_base_proxy.c at line 86
> 
> 
> It seems that the linux system() call is breaking OpenMPI internal 
> connections.  Removing the system() call and executing the perl code 
> externaly fixes the problem but I can't go into production like that as its a 
> security issue.
> 
> Any ideas ?
> 
> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff
 Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Hi Ralph,

The port is defined in config as 5000 it is used by both versions so you would 
think both would fail if where an issue.
Is there any way of reserving ports for non MPI use?



 From: Ralph Castain <r...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org> 
Sent: Friday, 20 January 2012 10:30 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
Hi Randolph!

Sorry for delay - was on the road. This isn't an issue of corruption. What ORTE 
is complaining about is that your perl script wound up connecting to the same 
port that your process is listening on via ORTE. ORTE is rather particular 
about the message format - specifically, it requires a header that includes the 
name of the process and a bunch of other stuff.

Where did you get the port that you are passing to your perl script?


On Jan 19, 2012, at 8:22 AM, Durga Choudhury wrote:

> This is just a thought:
> 
> according to the system() man page, 'SIGCHLD' is blocked during the
> execution of the program. Since you are executing your command as a
> daemon in the background, it will be permanently blocked.
> 
> Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
> only difference that I can think of between running the command
> stand-alone (which works) and running via a system() API call (that
> does not work).
> 
> Best
> Durga
> 
> 
> On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
>> Which network transport are you using, and what version of Open MPI are you 
>> using?  Do you have OpenFabrics support compiled into your Open MPI 
>> installation?
>> 
>> If you're just using TCP and/or shared memory, I can't think of a reason 
>> immediately as to why this wouldn't work, but there may be a subtle 
>> interaction in there somewhere that causes badness (e.g., memory corruption).
>> 
>> 
>> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>> 
>>> 
>>> I have a section in my code running in rank 0 that must start a perl 
>>> program that it then connects to via a tcp socket.
>>> The initialisation section is shown here:
>>> 
>>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>>     int i = system(buf);
>>>     printf("system returned %d\n", i);
>>> 
>>> 
>>> Some time after I run this code, while waiting for the data from the perl 
>>> program, the error below occurs:
>>> 
>>> qplan connection
>>> DCsession_fetch: waiting for Mcode data...
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> rml_oob_send.c at line 105
>>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> base/plm_base_proxy.c at line 86
>>> 
>>> 
>>> It seems that the linux system() call is breaking OpenMPI internal 
>>> connections.  Removing the system() call and executing the perl code 
>>> externaly fixes the problem but I can't go into production like that as its 
>>> a security issue.
>>> 
>>> Any ideas ?
>>> 
>>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
I assume that the SIGCHLD was released after starting the daemon ie on return 
of the system() call



 From: Durga Choudhury <dpcho...@gmail.com>
To: Open MPI Users <us...@open-mpi.org> 
Sent: Friday, 20 January 2012 2:22 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
This is just a thought:

according to the system() man page, 'SIGCHLD' is blocked during the
execution of the program. Since you are executing your command as a
daemon in the background, it will be permanently blocked.

Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
only difference that I can think of between running the command
stand-alone (which works) and running via a system() API call (that
does not work).

Best
Durga


On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> Which network transport are you using, and what version of Open MPI are you 
> using?  Do you have OpenFabrics support compiled into your Open MPI 
> installation?
>
> If you're just using TCP and/or shared memory, I can't think of a reason 
> immediately as to why this wouldn't work, but there may be a subtle 
> interaction in there somewhere that causes badness (e.g., memory corruption).
>
>
> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>>
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>>
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>>
>>
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>>
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>>
>>
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>>
>> Any ideas ?
>>
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen

I have a section in my code running in rank 0 that must start a perl program 
that it then connects to via a tcp socket.
The initialisation section is shown here:

    sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
    int i = system(buf);
    printf("system returned %d\n", i);


Some time after I run this code, while waiting for the data from the perl 
program, the error below occurs:

qplan connection
DCsession_fetch: waiting for Mcode data...
[dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to 
a process whose contact information is unknown in file rml_oob_send.c at line 
105
[dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
[dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to 
a process whose contact information is unknown in file base/plm_base_proxy.c at 
line 86


It seems that the linux system() call is breaking OpenMPI internal connections. 
 Removing the system() call and executing the perl code externaly fixes the 
problem but I can't go into production like that as its a security issue.

Any ideas ?


(environment: OpenMPI 1.4.1 on kernel Linux dc1 2.6.18-274.3.1.el5.028stab094.3 
 using TCP and mpirun)


Re: [OMPI users] High CPU usage with yield_when_idle =1 on CFS

2011-09-03 Thread Randolph Pullen
I have already implemented test/sleep code but the main problem is with the 
broadcasts that send out the SIMD instructions, because these are blocking and 
when the system is idle, its these guys who consume the CPU while waiting for 
work.
Implementing  
echo "1"
> /proc/sys/kernel/sched_compat_yield
Helps quite a bit (thanks Jeff) in that it makes yield more aggressive but the 
fundamental problem still remains.

A non-blocking broadcast would fix it, and/or a good scheduler.

Do other MPI's use busy loops so extensively in there comms?



From: Jeff Squyres <jsquy...@cisco.com>
To: Open MPI Users <us...@open-mpi.org>
Sent: Friday, 2 September 2011 9:45 PM
Subject: Re: [OMPI users] High CPU usage with yield_when_idle =1 on CFS

This might also be in reference to the issue that shed_yield() really does 
nothing in recent Linux kernels (there was big debate about this at 
kernel.org).  

IIRC, there's some kernel parameter that you can tweak to make it behave 
better, but I'm afraid I don't remember what it is.  Some googling might find 
it...?


On Sep 1, 2011, at 10:06 PM, Eugene Loh wrote:

> On 8/31/2011 11:48 PM, Randolph Pullen wrote:
>> I recall a discussion some time ago about yield, the Completely F%’d 
>> Scheduler (CFS) and OpenMPI.
>> 
>> My system is currently suffering from massive CPU use while busy waiting.  
>> This gets worse as I try to bump up user concurrency.
> Yup.
>> I am running with yield_when_idle but its not enough.
> Yup.
>> Is there anything else I can do to release some CPU resource?
>> I recall seeing one post where usleep(1) was inserted around the yields, is 
>> this still feasible?
>> 
>> I'm using 1.4.1 - is there a fix to be found in upgrading?
>> Unfortunately I am stuck  with the CFS as I need Linux.  Currently its 
>> Ubuntu 10.10 with 2.6.32.14
> I think OMPI doesn't yet do (much/any) better than what you've observed.  You 
> might be able to hack something up yourself.  In something I did recently, I 
> replaced blocking sends and receives with test/nanosleep loops.  An "optimum" 
> solution (minimum latency, optimal performance at arbitrary levels of under 
> and oversubscription) might be elusive, but hopefully you'll quickly be able 
> to piece together something for your particular purposes.  In my case, I was 
> lucky and the results were very gratifying... my bottleneck vaporized for 
> modest levels of oversubscription (2-4 more processes than processors).
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] High CPU usage with yield_when_idle =1 on CFS

2011-09-01 Thread Randolph Pullen
I recall a discussion some time ago about yield, the Completely F%’d Scheduler 
(CFS) and OpenMPI.

My system is currently suffering from massive CPU use while busy waiting.  This 
gets worse as I try to bump up user concurrency.

I am running with yield_when_idle but its not enough. Is there anything else I 
can do to release some CPU resource?
I recall seeing one post where usleep(1) was inserted around the yields, is 
this still feasible? 

I'm using 1.4.1 - is there a fix to be found in upgrading?
Unfortunately I am stuck  with the CFS as I need Linux.  Currently its Ubuntu 
10.10 with 2.6.32.14


Thanks in advance

Re: [OMPI users] Mpirun only works when n< 3

2011-07-13 Thread Randolph Pullen
Got it.   Building a new openMPI solved it.

I don't know if the standard Ubuntu install was the problem or if it just 
didn't like the slightly later kernel.
Seems to be reason to be suspicious of Ubuntu 10.10 OpenMPI builds if you have 
anything unusual in your system.
Thanks.
--- On Tue, 12/7/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pul...@yahoo.com.au
Cc: "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 12 July, 2011, 10:29 PM

On Jul 11, 2011, at 11:31 AM, Randolph Pullen wrote:

> There are no firewalls by default.  I can ssh between both nodes without a 
> password so I assumed that all is good with the comms.

FWIW, ssh'ing is different than "comms" (which I assume you mean opening random 
TCP sockets between two servers).

> I can also get both nodes to participate in the ring program at the same time.
> Its just that I am limited to inly 2 processes if they are split between the 
> nodes
> ie:
> mpirun -H A,B ring                         (works)
> mpirun -H A,A,A,A,A,A,A  ring     (works)
> mpirun -H B,B,B,B ring                 (works)
> mpirun -H A,B,A  ring                    (hangs)

It is odd that A,B works and A,B,A does not.

> I have discovered slightly more information:
> When I replace node 'B' from the new cluster with node 'C' from the old 
> cluster
> I get the similar behavior but with an error message:
> mpirun -H A,A,A,A,A,A,A  ring     (works from either node)
> mpirun -H C,C,C  ring     (works from either node)
> mpirun -H A,C  ring     (Fails from either node:)
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> [C:23465] ***  An error occurred in MPI_Recv
> [C:23465] ***  on communicator MPI_COMM_WORLD
> [C:23465] ***  MPI_ERRORS_ARE FATAL (your job will now abort)
> Process 0 sent to 1
> --
> Running this on either node A or C produces the same result
> Node C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , not an i5 
> 2400 like the others.
> all the binaries are compiled on FC10 with gcc 4.3.2


Are you sure that all the versions of Open MPI being used on all nodes are 
exactly the same?  I.e., are you finding/using Open MPI v1.4.1 on all nodes?

Are the nodes homogeneous in terms of software?  If they're heterogeneous in 
terms of hardware, you *might* need to have separate OMPI installations on each 
machine (vs., for example, a network-filesystem-based install shared to all 3) 
because the compiler's optimizer may produce code tailored for one of the 
machines, and it may therefore fail in unexpected ways on the other(s).  The 
same is true for your executable.

See this FAQ entry about heterogeneous setups:

    http://www.open-mpi.org/faq/?category=building#where-to-install

...hmm.  I could have sworn we had more on the FAQ about heterogeneity, but 
perhaps not.  The old LAM/MPI FAQ on heterogeneity is somewhat outdated, but 
most of its concepts are directly relevant to Open MPI as well:

    http://www.lam-mpi.org/faq/category11.php3

I should probably copy most of that LAM/MPI heterogeneous FAQ to the Open MPI 
FAQ, but it'll be waaay down on my priority list.  :-(  If anyone could help 
out here, I'd be happy to point them in the right direction to convert the 
LAM/MPI FAQ PHP to Open MPI FAQ PHP...  

To be clear: the PHP conversion will be pretty trivial; I stole heavily from 
the LAM/MPI FAQ PHP to create the Open MPI FAQ PHP -- but there are points 
where the LAM/MPI heterogeneity text needs to be updated; that'll take an hour 
or two to update all that content.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Mpirun only works when n< 3

2011-07-11 Thread Randolph Pullen
I have discovered slightly more information:When I replace node 'B' from the 
new cluster with node 'C' from the old clusterI get the similar behavior but 
with an error message:mpirun -H A,A,A,A,A,A,A  ring     (works from either node)
mpirun -H C,C,C  ring     (works from either node)
mpirun -H A,C  ring     (Fails from either node:)Process 0 sending 10 to 1, tag 
201 (3 processes in ring)
[C:23465] ***  An error occurred in MPI_Recv
[C:23465] ***  on communicator MPI_COMM_WORLD[C:23465] ***  MPI_ERRORS_ARE 
FATAL (your job will now abort)Process 0 sent to 
1--Running this on either node A or C produces 
the same resultNode C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , 
not an i5 2400 like the others.all the binaries are compiled on FC10 with gcc 
4.3.2
--- On Tue, 12/7/11, Randolph Pullen <randolph_pul...@yahoo.com.au> wrote:

From: Randolph Pullen <randolph_pul...@yahoo.com.au>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: "Open MPI Users" <us...@open-mpi.org>, "Jeff Squyres" <jsquy...@cisco.com>
Received: Tuesday, 12 July, 2011, 1:31 AM

There are no firewalls by default.  I can ssh between both nodes without a 
password so I assumed that all is good with the comms.I can also get both nodes 
to participate in the ring program at the same time.Its just that I am limited 
to inly 2 processes if they are split between the nodes
ie:mpirun -H A,B ring                         (works)mpirun -H A,A,A,A,A,A,A  
ring     (works)mpirun -H B,B,B,B ring                 (works)mpirun -H A,B,A  
ring                    (hangs)

--- On Tue, 12/7/11,
 Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pul...@yahoo.com.au, "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 12 July, 2011, 12:21 AM

Have you disabled firewalls between your compute nodes?


On Jul 11, 2011, at 9:34 AM, Randolph Pullen wrote:

> This appears to be similar to the problem described in:
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2043
> 
> However, those fixes do not work for me.
> 
> I am running on an 
> 
> - i5 sandy bridge under Ubuntu 10.10  8 G RAM
> 
> - Kernel 2.6.32.14 with OpenVZ
 tweaks
> 
> - OpenMPI V 1.4.1
> 
> I am trying to migrate existing software to a new cluster (A,B)
> 
> Symptoms:
> 
> I can run the ring demo on a single machine, either A or B with any number of 
> processes.
> 
> But when I combine the 2 machines I am limited to 2 processes, any more and 
> MPI hangs.   It gets as far as:
> 
>       Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> 
>       Process 0 sent to 1
> 
> and there it stays...
> 
> Any help greatly appreciated.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Mpirun only works when n< 3

2011-07-11 Thread Randolph Pullen
There are no firewalls by default.  I can ssh between both nodes without a 
password so I assumed that all is good with the comms.I can also get both nodes 
to participate in the ring program at the same time.Its just that I am limited 
to inly 2 processes if they are split between the nodes
ie:mpirun -H A,B ring                         (works)mpirun -H A,A,A,A,A,A,A  
ring     (works)mpirun -H B,B,B,B ring                 (works)mpirun -H A,B,A  
ring                    (hangs)

--- On Tue, 12/7/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pul...@yahoo.com.au, "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 12 July, 2011, 12:21 AM

Have you disabled firewalls between your compute nodes?


On Jul 11, 2011, at 9:34 AM, Randolph Pullen wrote:

> This appears to be similar to the problem described in:
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2043
> 
> However, those fixes do not work for me.
> 
> I am running on an 
> 
> - i5 sandy bridge under Ubuntu 10.10  8 G RAM
> 
> - Kernel 2.6.32.14 with OpenVZ tweaks
> 
> - OpenMPI V 1.4.1
> 
> I am trying to migrate existing software to a new cluster (A,B)
> 
> Symptoms:
> 
> I can run the ring demo on a single machine, either A or B with any number of 
> processes.
> 
> But when I combine the 2 machines I am limited to 2 processes, any more and 
> MPI hangs.   It gets as far as:
> 
>       Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> 
>       Process 0 sent to 1
> 
> and there it stays...
> 
> Any help greatly appreciated.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] Mpirun only works when n< 3

2011-07-11 Thread Randolph Pullen

















This appears to be similar to the problem described in:

https://svn.open-mpi.org/trac/ompi/ticket/2043

However, those fixes do not work for me. 

I am running on an - i5 sandy bridge under Ubuntu 10.10  8 G RAM

- Kernel 2.6.32.14 with OpenVZ tweaks

- OpenMPI V 1.4.1

I am trying to migrate existing software to a new cluster (A,B)

Symptoms:

I can run the ring demo on a single machine, either A or B
with any number of processes.

But when I combine the 2 machines I am limited to 2 processes,
any more and MPI hangs.   It gets as far as:      Process 0 sending 10 to 1, 
tag 201 (3 processes in ring)      Process 0 sent to 1and there it stays...

Any help greatly appreciated.



Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-10 Thread Randolph Pullen
Thanks,
The messages are small and frequent (they flash metadata across the cluster).  
The current approach works fine for small to medium clusters but I want it to 
be able to go big.  Maybe up to several hundred or even a thousands of nodes.

Its these larger deployments that concern me.  The current scheme may see the 
clearinghouse become overloaded in a very large cluster.
>From what you have  said, a possible strategy may be to combine the listener 
>and worker into a single process, using the non-blocking bcast just for that 
>group, while each worker scanned its own port for an incoming request, which 
>it would in turn bcast to its peers.
As you have indicated though, this would depend on the load the non-blocking 
bcast would cause.  - At least the load would be fairly even over the cluster.

--- On Mon, 9/5/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
To: randolph_pul...@yahoo.com.au
Cc: "Open MPI Users" <us...@open-mpi.org>
Received: Monday, 9 May, 2011, 11:27 PM

On May 3, 2011, at 8:20 PM, Randolph Pullen wrote:

> Sorry, I meant to say:
> - on each node there is 1 listener and 1 worker.
> - all workers act together when any of the listeners send them a request.
> - currently I must use an extra clearinghouse process to receive from any of 
> the listeners and bcast to workers, this is unfortunate because of the 
> potential scaling issues
> 
> I think you have answered this in that I must wait for MPI-3's non-blocking 
> collectives.

Yes and no.  If each worker starts N non-blocking broadcasts just to be able to 
test for completion of any of them, you might end up consuming a bunch of 
resources for them (I'm *anticipating* that pending non-blocking collective 
requests maybe more heavyweight than pending non-blocking point-to-point 
requests).

But then again, if N is small, it might not matter.

> Can anyone suggest another way?  I don't like the serial clearinghouse 
> approach.

If you only have a few workers and/or the broadcast message is small and/or the 
broadcasts aren't frequent, then MPI's built-in broadcast algorithms might not 
offer much more optimization than doing your own with point-to-point 
mechanisms.  I don't usually recommend this, but it may be possible for your 
case.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-04 Thread Randolph Pullen
Sorry, I meant to say:- on each node there is 1 listener and 1 worker.- all 
workers act together when any of the listeners send them a request.- currently 
I must use an extra clearinghouse process to receive from any of the listeners 
and bcast to workers, this is unfortunate because of the potential scaling 
issues
I think you have answered this in that I must wait for MPI-3's non-blocking 
collectives.
Can anyone suggest another way?  I don't like the serial clearinghouse approach.

--- On Wed, 4/5/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
To: randolph_pul...@yahoo.com.au, "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 4 May, 2011, 11:19 AM

I don't quite understand your architecture enough to answer your question.  
E.g., someone pointed out to me off-list that if you only have 1 listener, a 
send is effectively the same thing as a broadcast (for which you could 
test/wait on a non-blocking receive, for example).

MPI broadcasts only work on fixed communicators -- meaning that you effectively 
have to know the root and the receivers ahead of time.  If the receivers don't 
know who the root will be beforehand, that's unfortunately not a good match for 
the MPI_Bcast operation.



On May 3, 2011, at 4:07 AM, Randolph Pullen wrote:

> 
> From: Randolph Pullen <randolph_pul...@yahoo.com.au>
> Subject: Re: Re: [OMPI users] is there an equiv of iprove for bcast?
> To: us...@open-mpi.or
> Received: Monday, 2 May, 2011, 12:53 PM
> 
> Non blocking Bcasts or tests would do it.
> I currently have the clearing-house solution working but it is unsatisfying 
> because of its serial node. - As it scales it will overload this node.
> 
> The problem rephrased:
> Instead of n*2 processes, I am having to use n*2+1 with the extra process 
> serially receiving listener messages on behalf of the workers before 
> transmitting these messages to workers in its comm_group.
> 
> Is there a way to Bcast directly from each listener to the worker pool?  
> (listeners must monitor their ports most of the time and cant participate in 
> global bcasts)
> Not knowing which listener is going to transmit prevents the correct 
> comm_group being used with Bcast calls.
> 
> --- On Sat, 30/4/11, Jeff Squyres <jsquy...@cisco.com> wrote:
> 
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
> To: randolph_pul...@yahoo.com.au, "Open MPI Users" <us...@open-mpi.org>
> Received: Saturday, 30 April, 2011, 7:17 AM
> 
> On Apr 29, 2011, at 1:21 AM, Randolph Pullen wrote:
> 
> > I am having a design issue:
> > My server application has 2 processes per node, 1 listener and 1 worker.
> > Each listener monitors a specified port for incoming TCP connections with 
> > the goal that on receipt of a request it will distribute it over the 
> > workers in a SIMD fashion.
> > 
> > My problem is this: how can I get the workers to accept work from any of 
> > the listeners?
> > Making a separate communicator does not help as the sender is unknown.  
> > Other than making a serial 'clearing house' process I cant think of a way  
> > - Iprobe for Bcast would be useful.
> 
> I'm not quite sure I understand your question.
> 
> There currently is no probe for collectives, but MPI-3 has non-blocking 
> collectives which you could MPI_Test for.  There's a 3rd party library 
> implementation called libNBC (non-blocking collectives) that you could use 
> until such things become natively available.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-03 Thread Randolph Pullen

From: Randolph Pullen <randolph_pul...@yahoo.com.au>
Subject: Re: Re: [OMPI users] is there an equiv of iprove for bcast?
To: us...@open-mpi.or
Received: Monday, 2 May, 2011, 12:53 PM

Non blocking Bcasts or tests would do it.I currently have the clearing-house 
solution working but it is unsatisfying because of its serial node. - As it 
scales it will overload this node.

The problem rephrased:Instead of n*2 processes, I am having to use n*2+1 with 
the extra process serially receiving listener messages on behalf of the workers 
before transmitting these messages to workers in its comm_group.
Is there a way to Bcast directly from each listener to the worker pool?  
(listeners must monitor their ports most of the time and cant participate in 
global bcasts)Not knowing which listener is going to transmit prevents the 
correct comm_group being used with Bcast calls.
--- On Sat, 30/4/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
To: randolph_pul...@yahoo.com.au, "Open MPI Users" <us...@open-mpi.org>
Received: Saturday, 30 April, 2011, 7:17 AM

On Apr 29, 2011, at 1:21 AM, Randolph Pullen wrote:

> I am having a design issue:
> My server application has 2 processes per node, 1 listener and 1 worker.
> Each listener monitors a specified port for incoming TCP connections with the 
> goal that on receipt of a request it will distribute it over the workers in a 
> SIMD fashion.
> 
> My problem is this: how can I get the workers to accept work from any of the 
> listeners?
> Making a separate communicator does not help as the sender is unknown.  Other 
> than making a serial 'clearing house' process I cant think
 of a way  - Iprobe for Bcast would be useful.

I'm not quite sure I understand your question.

There currently is no probe for collectives, but MPI-3 has non-blocking 
collectives which you could MPI_Test for.  There's a 3rd party library 
implementation called libNBC (non-blocking collectives) that you could use 
until such things become natively available.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] is there an equiv of iprove for bcast?

2011-04-29 Thread Randolph Pullen

















I am having a design issue:My server application has 2 processes per node, 1 
listener
and 1 worker.

Each listener monitors a specified port for incoming TCP
connections with the goal that on receipt of a request it will distribute it
over the workers in a SIMD fashion.


My problem is this: how can I get the workers to accept work from any of the 
listeners?Making a separate communicator does not help as the sender is 
unknown.  Other than making a serial 'clearing house' process I cant think of a 
way  - Iprobe for Bcast would be useful.
Any ideas?
(OpenMPI 1.4.1 on FC10 with TCP)



[OMPI users] MPI_Comm_create prevents external socket connections

2011-04-28 Thread Randolph Pullen

















I have a problem with MPI_Comm_create,

My server application has 2 processes per node, 1 listener
and 1 worker.

Each listener monitors a specified port for incoming TCP
connections with the goal that on receipt of a request it will distribute it 
over the workers in a
SIMD fashion.

This all works fine unless MPI_Comm_create is called on the
listener process.  Then after the
call the incoming socket cannot be reached by the external client processes.  
The client reports “”Could’t open socket”.  No other error is apparent.   I 
have tried using a variety of
different sockets but to no effect.

I use OpenMPI 1.4.1 on FD10 with vanilla TCP.  The install is totally standard 
with no
changes.

Is this a known issue?

An help appreciated.



Re: [OMPI users] Running on crashing nodes

2010-09-27 Thread Randolph Pullen
I have have successfully used a perl program to start mpirun and record its 
PIDThe monitor can then watch the output from MPI and terminate the mpirun 
command with a series of kills or something if it is having trouble.

One method of doing this is to prefix all legal output from your MPI program 
with a known short string, if the monitor does not see this string prefixed on 
a line, it can terminate MPI, check available nodes and recast the job 
accordingly
Hope this helps,Randolph
--- On Fri, 24/9/10, Joshua Hursey  wrote:

From: Joshua Hursey 
Subject: Re: [OMPI users] Running on crashing nodes
To: "Open MPI Users" 
Received: Friday, 24 September, 2010, 10:18 PM

As one of the Open MPI developers actively working on the MPI layer 
stabilization/recover feature set, I don't think we can give you a specific 
timeframe for availability, especially availability in a stable release. Once 
the initial functionality is finished, we will open it up for user testing by 
making a public branch available. After addressing the concerns highlighted by 
public testing, we will attempt to work this feature into the mainline trunk 
and eventual release.

Unfortunately it is difficult to assess the time needed to go through these 
development stages. What I can tell you is that the work to this point on the 
MPI layer is looking promising, and that as soon as we feel that the code is 
ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the 
> stable version? A rough estimate will be fine.
> 
> 
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job 
> if it had processes on that node. There is no current ability to "ride-thru" 
> such an event.
> 
> That said, there is work being done to support "ride-thru". Most of that is 
> in the current developer's code trunk, and more is coming, but I wouldn't 
> consider it production-quality just yet.
> 
> Specifically, the code that does what you specify below is done and works. It 
> is recovery of the MPI job itself (collectives, lost messages, etc.) that 
> remains to be completed.
> 
> 
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau  
> wrote:
> Dear users,
> 
> Our cluster has a number of nodes which have high probability to crash, so it 
> happens quite often that calculations stop due to one node getting down. May 
> be you know if it is possible to block the crashed nodes during run-time when 
> running with OpenMPI? I am asking about principal possibility to program such 
> behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious 
> about is the following:
> 
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and 
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list 
> of active nodes
> 
> 
> Regards,
> Andrei
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Randolph Pullen
Std TCP/IP stack
it hung with an unknown but large(ish) quantity of data.  when I ran just one 
Bcast it was fine but Bcasts in lots in separate MPI_WORLD's hung.   - All the 
details are in some recent posts.

I could not figure it out and moved back to my PVM solution.


--- On Wed, 25/8/10, Rahul Nabar <rpna...@gmail.com> wrote:

From: Rahul Nabar <rpna...@gmail.com>
Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: 
debug ideas?
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 25 August, 2010, 3:38 AM

On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
<randolph_pul...@yahoo.com.au> wrote:
>
> I have had a similar load related problem with Bcast.

Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?

>
> I don't know what caused it though.  With this one, what about the 
> possibility of a buffer overrun or network saturation?

How can I test for a buffer overrun?

For network saturation I guess I could use something like mrtg to
monitor the bandwidth used. On the other hand, all 32 servers are
connected to a single dedicated Nexus5000. The back-plane carries no
other traffic. Hence I am skeptical that just 41943040 saturated what
Cisco rates as a 10GigE fabric. But I might be wrong.

-- 
Rahul

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-23 Thread Randolph Pullen
I have had a similar load related problem with Bcast.  I don't know what caused 
it though.  With this one, what about the possibility of a buffer overrun or 
network saturation?


--- On Tue, 24/8/10, Richard Treumann <treum...@us.ibm.com> wrote:

From: Richard Treumann <treum...@us.ibm.com>
Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: 
debug ideas?
To: "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 24 August, 2010, 9:39 AM



It is hard to imagine how a total data
load of 41,943,040 bytes could be a problem. That
is really not much data. By the time the BCAST is done, each task (except
root) will have received a single half meg message form one sender. That
is not much.



IMB does shift the root so some tasks may be in iteration
9 while some are still in iteration 8 or 7 but a 1/2 meg message should
use rendezvous protocol so no message will be injected into the network
until the destination task is ready to receive it.



Any task can be in only one MPI_Bcast at a time so
the total active data cannot ever exceed the 41,943,040 bytes, no matter
how fast the MPI_Bcast loop tries to iterate.



(There are MPI_Bcast algorithms that chunk the data
into smaller messages but even with those algorithms, the total concurrent
load will not exceed 41,943,040 bytes.)





Dick Treumann  -  MPI Team
          

IBM Systems & Technology Group

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601

Tele (845) 433-7846         Fax (845) 433-8363





users-boun...@open-mpi.org wrote on 08/23/2010 05:09:56
PM:



> [image removed] 

> 

> Re: [OMPI users] IMB-MPI broadcast test stalls for large core 

> counts: debug ideas?

> 

> Rahul Nabar 

> 

> to:

> 

> Open MPI Users

> 

> 08/23/2010 05:11 PM

> 

> Sent by:

> 

> users-boun...@open-mpi.org

> 

> Please respond to Open MPI Users

> 

> 



> On Sun, Aug 22, 2010 at 9:57 PM, Randolph Pullen
<randolph_pul...@yahoo.com.au

> > wrote:

> 

> Its a long shot but could it be related to the total data volume ?

> ie  524288 * 80 = 41943040 bytes active in the cluster

> 

> Can you exceed this 41943040 data volume with a smaller message 

> repeated more often or a larger one less often?

> 

> 

> Not so far, so your diagnosis could be right. The failures have been

> at the following data volumes:

> 

> 41.9E6

> 4.1E6

> 8.2E6 

> 

> Unfortunately, I'm not sure I can change the repeat rate with the


> OFED/MPI tests. Can I do that? Didn't see a suitable flag.

> 

> In any case, assuming it is related to the total data volume what


> could be causing such a failure?

> 

> -- 

> Rahul___

> users mailing list

> us...@open-mpi.org

> http://www.open-mpi.org/mailman/listinfo.cgi/users
-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-22 Thread Randolph Pullen
Its a long shot but could it be related to the total data volume ?
ie  524288 * 80 = 41943040 bytes active in the cluster

Can you exceed this 41943040 data volume with a smaller message repeated more 
often or a larger one less often?


--- On Fri, 20/8/10, Rahul Nabar  wrote:

From: Rahul Nabar 
Subject: [OMPI users] IMB-MPI broadcast test stalls for large core counts: 
debug ideas?
To: "Open MPI Users" 
Received: Friday, 20 August, 2010, 12:03 PM

My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast, gather and
exchange tests. Only for the larger core counts (~256 cores). Other
tests like pingpong and sendrecev run fine even with larger core
counts.

e.g. This bcast test hangs consistently at the 524288 bytes packet
size when invoked on 256 cores. Same test runs fine on 128 cores.

NP=256;mpirun  -np $NP --host [ 32_HOSTS_8_core_each]  -mca btl
openib,sm,self    /mpitests/imb/src/IMB-MPI1 -npmin $NP  bcast

       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.02         0.02
            1          130        26.94        27.59        27.25
            2          130        26.44        27.09        26.77
            4          130        75.98        81.07        76.75
            8          130        28.41        29.06        28.74
           16          130        28.70        29.39        29.03
           32          130        28.48        29.15        28.85
           64          130        30.10        30.86        30.48
          128          130        31.62        32.41        32.01
          256          130        31.08        31.72        31.42
          512          130        31.79        32.58        32.13
         1024          130        33.22        34.06        33.65
         2048          130        66.21        67.61        67.21
         4096          130        79.14        80.86        80.37
         8192          130       103.38       105.21       104.70
        16384          130       160.82       163.67       162.97
        32768          130       516.11       541.75       533.46
        65536          130      1044.09      1063.63      1052.88
       131072          130      1740.09      1750.12      1746.78
       262144          130      3587.23      3598.52      3594.52
       524288           80      4000.99      6669.65      5737.78
stalls for at least 5 minutes at this point when I killed the test.

I did more extensive testing for various combinations of test-type and
core counts (see below). I know exactly when the tests fail but I
still cannot see a trend from this data. Any points or further debug
ideas? I do have padb installed and have collected core dumps if that
is going to help? One example below:

http://dl.dropbox.com/u/118481/padb.log.new.new.txt

System Details:
Intel Nehalem 2.2 GHz
10Gig Ethernet Chelsio Cards and Cisco Nexus Switch. Using the OFED drivers.
CentOS 5.4
Open MPI: 1.4.1 / Open RTE: 1.4.1 / OPAL: 1.4.1


--
bcast:
    NP256    hangs
    NP128    OK

Note: "bcast" mostly hangs at:

       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
       524288           80      2682.61      4408.94      3880.68
--
sendrecv:
    NP256    OK
--
gather:
    NP256    hangs
    NP128    hangs
    NP64    hangs
    NP32    OK

Note: "gather" always hangs at the following line of the test:
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
[snip]
         4096         1000       525.80       527.69       526.79
--
exchange:
    NP256    hangs
    NP128    OK

Note: "exchange" always hangs at:

#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
8192         1000       109.65       110.79       110.37       282.08
--

Note: I kept the --host string the same (all 32 servers) and just
changed the NPMIN. Just in case this matters for how the procs are
mapped out
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-12 Thread Randolph Pullen
ample, I was under the impression that PVM was solely being used as a 
launcher.  This is apparently not the case -- the original code is a PVM job 
that has been modified to eventually call MPI_INIT.  I don't know how much more 
I can say on this open list.

Hence, I'm throughly confused as to the model that is being used at this point. 
 I don't think I can offer any further help unless a small [non-PVM] example is 
provided to the community that can show the problem.

I also asked a bunch of questions in a prior post that would be helpful to have 
answered before going further.

Sorry!  :-(



On Aug 12, 2010, at 9:32 AM, Richard Treumann wrote:

> 
> You said  "separate MPI  applications doing 1 to > N broadcasts over PVM".  
> You do not mean you are using pvm_bcast though - right? 
> 
> If these N MPI applications are so independent that you could run one at a 
> time or run them on N different clusters and still get the result you want 
> (not the time to solution) then I cannot imagine how there could be cross 
> talk.   
> 
> I have been assuming that when you describe this as an NxN problem, you mean 
> there is some desired interaction among the N MPI worlds.   
> 
> If I have misunderstood and the N MPI worlds stared with N mpirun operations 
> under PVM are each semantically independent of the other (N-1) then I am 
> totally at a loss for an explanation. 
> 
>   
> Dick Treumann  -  MPI Team           
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
> 
> 
> users-boun...@open-mpi.org wrote on 08/11/2010 08:59:16 PM:
> 
> > [image removed] 
> > 
> > Re: [OMPI users] MPI_Bcast issue 
> > 
> > Randolph Pullen 
> > 
> > to: 
> > 
> > Open MPI Users 
> > 
> > 08/11/2010 09:01 PM 
> > 
> > Sent by: 
> > 
> > users-boun...@open-mpi.org 
> > 
> > Please respond to Open MPI Users 
> > 
> > I (a single user) am running N separate MPI  applications doing 1 to
> > N broadcasts over PVM, each MPI application is started on each 
> > machine simultaneously by PVM - the reasons are back in the post history.
> > 
> > The problem is that they somehow collide - yes I know this should 
> > not happen, the question is why.
> > 
> > --- On Wed, 11/8/10, Richard Treumann <treum...@us.ibm.com> wrote: 
> > 
> > From: Richard Treumann <treum...@us.ibm.com>
> > Subject: Re: [OMPI users] MPI_Bcast issue
> > To: "Open MPI Users" <us...@open-mpi.org>
> > Received: Wednesday, 11 August, 2010, 11:34 PM
> 
> > 
> > Randolf 
> > 
> > I am confused about using multiple, concurrent mpirun operations.  
> > If there are M uses of mpirun and each starts N tasks (carried out 
> > under pvm or any other way) I would expect you to have M completely 
> > independent MPI jobs with N tasks (processes) each.  You could have 
> > some root in each of the M MPI jobs do an MPI_Bcast to the other 
> > N-1) in that job but there is no way in MPI (without using 
> > accept.connect) to get tasks of job 0 to give data to tasks of jobs 
> > 1-(m-1). 
> > 
> > With M uses of mpirun, you have M worlds that are forever isolated 
> > from the other M-1 worlds (again, unless you do accept/connect) 
> > 
> > In what sense are you treating this as an single MxN application?   
> > ( I use M & N to keep them distinct. I assume if M == N, we have your case) 
> > 
> > 
> > Dick Treumann  -  MPI Team           
> > IBM Systems & Technology Group
> > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846         Fax (845) 433-8363 
> > 
> > -Inline Attachment Follows-
> 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > 
> > 
> >  ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
Interesting point.

--- On Thu, 12/8/10, Ashley Pittman <ash...@pittman.co.uk> wrote:

From: Ashley Pittman <ash...@pittman.co.uk>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Thursday, 12 August, 2010, 12:22 AM


On 11 Aug 2010, at 05:10, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.

Going off-topic here but I think it's worth saying:

If you have a dataset that requires collective communication then use the 
function call that best matches what you are trying to do, far to many people 
try and re-implement the collectives in their own code and it nearly always 
goes badly, as someone who's spent many years implementing collectives I've 
lost count of the number of times I've made someones code go faster by 
replacing 500+ lines of code with a single call to MPI_Gather().

In the rare case that you find that some collectives are slower than they 
should be for your specific network and message size then the best thing to do 
is to work with the Open-MPI developers to tweak the thresholds so a better 
algorithm gets picked by the library.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
I (a single user) am running N separate MPI  applications doing 1 to N 
broadcasts over PVM, each MPI application is started on each machine 
simultaneously by PVM - the reasons are back in the post history.

The problem is that they somehow collide - yes I know this should not happen, 
the question is why.

--- On Wed, 11/8/10, Richard Treumann  wrote:

From: Richard Treumann 
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" 
Received: Wednesday, 11 August, 2010, 11:34 PM



Randolf 



I am confused about using multiple,
concurrent mpirun operations.  If there are M uses of mpirun and each
starts N tasks (carried out under pvm or any other way) I would expect
you to have M completely independent MPI jobs with N tasks (processes)
each.  You could have some root in each of the M MPI jobs do an MPI_Bcast
to the other N-1) in that job but there is no way in MPI (without using
accept.connect) to get tasks of job 0 to give data to tasks of jobs 1-(m-1).



With M uses of mpirun, you have M worlds
that are forever isolated from the other M-1 worlds (again, unless you
do accept/connect)



In what sense are you treating this
as an single MxN application?   ( I use M & N to keep them distinct.
I assume if M == N, we have your case)





Dick Treumann  -  MPI Team
          

IBM Systems & Technology Group

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601

Tele (845) 433-7846         Fax (845) 433-8363


-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
Sure, but broadcasts are faster - less reliable apparently, but much faster for 
large clusters.  Jeff says that all OpenMPI calls are implemented with point to 
point B-tree style communications of log N transmissions
So I guess that altoall would be N log N

--- On Wed, 11/8/10, Terry Frankcombe <te...@chem.gu.se> wrote:

From: Terry Frankcombe <te...@chem.gu.se>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 11 August, 2010, 1:57 PM

On Tue, 2010-08-10 at 19:09 -0700, Randolph Pullen wrote:
> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data
> movement program to affect an N to N solution.

I'm no MPI guru, nor do I completely understand what you are doing, but
isn't this an allgather (or possibly an alltoall)?



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-10 Thread Randolph Pullen
n or two, multiple processes 
have the data and can start pipelining sends to other processes, etc.

- We don't have a multicast-based broadcast for a variety of reasons.  
MPI_Bcast needs to be reliable.  Multicast is not reliable.  There have been 
many good algorithms published over the years to make unreliable multicast be 
reliable, but no one has implemented those in a robust, production-quality 
manner for Open MPI.  Part of the reason for that is the non-uniform support of 
robust multicast implementations by network vendors, the lack of spanning 
multicast across multiple subnets, etc.  In practice, the log(n) algorithms 
that Open MPI uses have generally been "fast enough" such that there hasn't 
been a clamor for a multicast-based broadcast.  To be fair: every once in a 
(great) while, someone says they need it, but to be totally blunt, a) we 
haven't received enough requests to implement it ourselves, or b) no one has 
contributed a patch / plugin that implements it.  That sounds snobby, but I 
don't mean it that way: what I mean is that most of
 the features in Open MPI are customer-driven.  All I'm saying is that we have 
a lot of other higher-priority customer-requested features that we're working 
on.  Multicast-bcast support is not high enough in priority because not enough 
people have asked for it.

I hope that helps...



On Aug 9, 2010, at 10:43 PM, Randolph Pullen wrote:

> The install was completly vanilla - no extras a plain .configure command line 
> (on FC10 x8x_64 linux)
> 
> Are you saying that all broadcast calls are actually implemented as serial 
> point to point calls?
> 
> 
> --- On Tue, 10/8/10, Ralph Castain <r...@open-mpi.org> wrote:
> 
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] MPI_Bcast issue
> To: "Open MPI Users" <us...@open-mpi.org>
> Received: Tuesday, 10 August, 2010, 12:33 AM
> 
> No idea what is going on here. No MPI call is implemented as a multicast - it 
> all flows over the MPI pt-2-pt system via one of the various algorithms.
> 
> Best guess I can offer is that there is a race condition in your program that 
> you are tripping when other procs that share the node change the timing.
> 
> How did you configure OMPI when you built it?
> 
> 
> On Aug 8, 2010, at 11:02 PM, Randolph Pullen wrote:
> 
>> The only MPI calls I am using are these (grep-ed from my code):
>> 
>> MPI_Abort(MPI_COMM_WORLD, 1);
>> MPI_Barrier(MPI_COMM_WORLD);
>> MPI_Bcast([0].hdr, sizeof(BD_CHDR), MPI_CHAR, 0, MPI_COMM_WORLD);
>> MPI_Comm_rank(MPI_COMM_WORLD,);
>> MPI_Comm_size(MPI_COMM_WORLD,); 
>> MPI_Finalize();
>> MPI_Init(, );
>> MPI_Irecv(
>> MPI_Isend(
>> MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, );
>> MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
>> MPI_Test(, , );
>> MPI_Wait(, );  
>> 
>> The big wait happens on receipt of a bcast call that would otherwise work.
>> Its a bit mysterious really...
>> 
>> I presume that bcast is implemented with multicast calls but does it use any 
>> actual broadcast calls at all?  
>> I know I'm scraping the edges here looking for something but I just cant get 
>> my head around why it should fail where it has.
>> 
>> --- On Mon, 9/8/10, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> From: Ralph Castain <r...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI_Bcast issue
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Received: Monday, 9 August, 2010, 1:32 PM
>> 
>> Hi Randolph
>> 
>> Unless your code is doing a connect/accept between the copies, there is no 
>> way they can cross-communicate. As you note, mpirun instances are completely 
>> isolated from each other - no process in one instance can possibly receive 
>> information from a process in another instance because it lacks all 
>> knowledge of it -unless- they wireup into a greater communicator by 
>> performing connect/accept calls between them.
>> 
>> I suspect you are inadvertently doing just that - perhaps by doing 
>> connect/accept in a tree-like manner, not realizing that the end result is 
>> one giant communicator that now links together all the N servers.
>> 
>> Otherwise, there is no possible way an MPI_Bcast in one mpirun can collide 
>> or otherwise communicate with an MPI_Bcast between processes started by 
>> another mpirun.
>> 
>> 
>> 
>> On Aug 8, 2010, at 7:13 PM, Randolph Pullen wrote:
>> 
>>> Thanks,  although “An intercommunicator cannot be used for collective 
>>> communication.” i.e ,  bcast calls., I can see how the MPI_Group_xx calls 
>>> can

Re: [OMPI users] MPI_Bcast issue

2010-08-09 Thread Randolph Pullen
The install was completly vanilla - no extras a plain .configure command line 
(on FC10 x8x_64 linux)

Are you saying that all broadcast calls are actually implemented as serial 
point to point calls?


--- On Tue, 10/8/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 10 August, 2010, 12:33 AM

No idea what is going on here. No MPI call is implemented as a multicast - it 
all flows over the MPI pt-2-pt system via one of the various algorithms.
Best guess I can offer is that there is a race condition in your program that 
you are tripping when other procs that share the node change the timing.
How did you configure OMPI when you built it?

On Aug 8, 2010, at 11:02 PM, Randolph Pullen wrote:
The only MPI calls I am using are these (grep-ed from my code):

MPI_Abort(MPI_COMM_WORLD, 1);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast([0].hdr, sizeof(BD_CHDR), MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD,);
MPI_Comm_size(MPI_COMM_WORLD,); 
MPI_Finalize();
MPI_Init(, );
MPI_Irecv(
MPI_Isend(
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, );
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
MPI_Test(, , );
MPI_Wait(, );  

The big wait happens on receipt of a bcast call that would otherwise work.
Its a bit mysterious really...

I presume that bcast is implemented with multicast calls but does it use any 
actual broadcast calls at all?  
I
 know I'm scraping the edges here looking for something but I just cant get my 
head around why it should fail where it has.

--- On Mon, 9/8/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Monday, 9 August, 2010, 1:32 PM

Hi Randolph
Unless your code is doing a connect/accept between the copies, there is no way 
they can cross-communicate. As you note, mpirun instances are completely 
isolated from each other - no process in one instance can possibly receive 
information from a process in another instance because it lacks all knowledge 
of it -unless- they wireup into a greater communicator by performing 
connect/accept calls between
 them.
I suspect you are inadvertently doing just that - perhaps by doing 
connect/accept in a tree-like manner, not realizing that the end result is one 
giant communicator that now links together all the N servers.
Otherwise, there is no possible way an MPI_Bcast in one mpirun can collide or 
otherwise communicate with an MPI_Bcast between processes started by another 
mpirun.


On Aug 8, 2010, at 7:13 PM, Randolph Pullen wrote:
Thanks,  although “An intercommunicator cannot be
 used for collective communication.” i.e ,  bcast calls., I can see how the 
MPI_Group_xx calls can be used to produce a useful group and then communicator; 
 - thanks again but this is really the side issue to my main question about 
MPI_Bcast.

I seem to have duplicate concurrent processes interfering with each other.  
This would appear to be a breach of the MPI safety dictum, ie MPI_COMM_WORD is 
supposed to only include the processes started by a single mpirun command and 
isolate these processes from other similar groups of processes safely.

So, it would appear to be a bug.  If so this has significant implications for 
environments such as mine, where it may often occur that the same program is 
run by different users simultaneously.  

It is really this issue
 that it concerning me, I can rewrite the code but if it can crash when 2 
copies run at the same time, I have a much bigger problem.

My suspicion is that a within the MPI_Bcast handshaking, a syncronising 
broadcast call may be colliding across the environments.  My only evidence is 
an otherwise working program waits on broadcast reception forever when two or 
more copies are run at [exactly] the same time.

Has anyone else seen similar behavior in concurrently running programs that 
perform lots of broadcasts perhaps?

Randolph


--- On Sun, 8/8/10, David Zhang <solarbik...@gmail.com> wrote:

From: David Zhang <solarbik...@gmail.com>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Sunday, 8 August, 2010, 12:34 PM

In particular, intercommunicators

On 8/7/10, Aurélien Bouteiller <boute...@eecs.utk.edu> wrote:
> You should consider reading about communicators in MPI.
>
> Aurelien
> --
> Aurelien Bouteiller, Ph.D.
> Innovative Computing Laboratory, The University of Tennessee.
>
> Envoyé de mon iPad
>
> Le Aug 7, 2010 à 1:05, Randolph Pullen <randolph_pul...@yahoo.com.au> a
> écrit :
>
>> I seem to be having a problem with
 MPI_Bcast.
>> My massive I/O intensive data movement program must broadcast from n to n
>> nodes.

Re: [OMPI users] MPI_Bcast issue

2010-08-09 Thread Randolph Pullen
The only MPI calls I am using are these (grep-ed from my code):

MPI_Abort(MPI_COMM_WORLD, 1);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast([0].hdr, sizeof(BD_CHDR), MPI_CHAR, 0, MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD,);
MPI_Comm_size(MPI_COMM_WORLD,); 
MPI_Finalize();
MPI_Init(, );
MPI_Irecv(
MPI_Isend(
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, );
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
MPI_Test(, , );
MPI_Wait(, );  

The big wait happens on receipt of a bcast call that would otherwise work.
Its a bit mysterious really...

I presume that bcast is implemented with multicast calls but does it use any 
actual broadcast calls at all?  
I know I'm scraping the edges here looking for something but I just cant get my 
head around why it should fail where it has.

--- On Mon, 9/8/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Monday, 9 August, 2010, 1:32 PM

Hi Randolph
Unless your code is doing a connect/accept between the copies, there is no way 
they can cross-communicate. As you note, mpirun instances are completely 
isolated from each other - no process in one instance can possibly receive 
information from a process in another instance because it lacks all knowledge 
of it -unless- they wireup into a greater communicator by performing 
connect/accept calls between them.
I suspect you are inadvertently doing just that - perhaps by doing 
connect/accept in a tree-like manner, not realizing that the end result is one 
giant communicator that now links together all the N servers.
Otherwise, there is no possible way an MPI_Bcast in one mpirun can collide or 
otherwise communicate with an MPI_Bcast between processes started by another 
mpirun.


On Aug 8, 2010, at 7:13 PM, Randolph Pullen wrote:
Thanks,  although “An intercommunicator cannot be used for collective 
communication.” i.e ,  bcast calls., I can see how the MPI_Group_xx calls can 
be used to produce a useful group and then communicator;  - thanks again but 
this is really the side issue to my main question about MPI_Bcast.

I seem to have duplicate concurrent processes interfering with each other.  
This would appear to be a breach of the MPI safety dictum, ie MPI_COMM_WORD is 
supposed to only include the processes started by a single mpirun command and 
isolate these processes from other similar groups of processes safely.

So, it would appear to be a bug.  If so this has significant implications for 
environments such as mine, where it may often occur that the same program is 
run by different users simultaneously.  

It is really this issue
 that it concerning me, I can rewrite the code but if it can crash when 2 
copies run at the same time, I have a much bigger problem.

My suspicion is that a within the MPI_Bcast handshaking, a syncronising 
broadcast call may be colliding across the environments.  My only evidence is 
an otherwise working program waits on broadcast reception forever when two or 
more copies are run at [exactly] the same time.

Has anyone else seen similar behavior in concurrently running programs that 
perform lots of broadcasts perhaps?

Randolph


--- On Sun, 8/8/10, David Zhang <solarbik...@gmail.com> wrote:

From: David Zhang <solarbik...@gmail.com>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Sunday, 8 August, 2010, 12:34 PM

In particular, intercommunicators

On 8/7/10, Aurélien Bouteiller <boute...@eecs.utk.edu> wrote:
> You should consider reading about communicators in MPI.
>
> Aurelien
> --
> Aurelien Bouteiller, Ph.D.
> Innovative Computing Laboratory, The University of Tennessee.
>
> Envoyé de mon iPad
>
> Le Aug 7, 2010 à 1:05, Randolph Pullen <randolph_pul...@yahoo.com.au> a
> écrit :
>
>> I seem to be having a problem with MPI_Bcast.
>> My massive I/O intensive data movement program must broadcast from n to n
>> nodes. My problem starts because I require 2 processes per node, a sender
>> and a receiver and I have implemented
 these using MPI processes rather
>> than tackle the complexities of threads on MPI.
>>
>> Consequently, broadcast and calls like alltoall are not completely
>> helpful.  The dataset is huge and each node must end up with a complete
>> copy built by the large number of contributing broadcasts from the sending
>> nodes.  Network efficiency and run time are paramount.
>>
>> As I don’t want to needlessly broadcast all this data to the sending nodes
>> and I have a perfectly good MPI program that distributes globally from a
>> single node (1 to N), I took the unusual decision to start N copies of
>> this program by spawning the MPI system from the PVM system in an effort
>> 

Re: [OMPI users] MPI_Bcast issue

2010-08-08 Thread Randolph Pullen
Thanks,  although “An intercommunicator cannot be used for collective 
communication.” i.e ,  bcast calls., I can see how the MPI_Group_xx calls can 
be used to produce a useful group and then communicator;  - thanks again but 
this is really the side issue to my main question about MPI_Bcast.

I seem to have duplicate concurrent processes interfering with each other.  
This would appear to be a breach of the MPI safety dictum, ie MPI_COMM_WORD is 
supposed to only include the processes started by a single mpirun command and 
isolate these processes from other similar groups of processes safely.

So, it would appear to be a bug.  If so this has significant implications for 
environments such as mine, where it may often occur that the same program is 
run by different users simultaneously.  

It is really this issue that it concerning me, I can rewrite the code but if it 
can crash when 2 copies run at the same time, I have a much bigger problem.

My suspicion is that a within the MPI_Bcast handshaking, a syncronising 
broadcast call may be colliding across the environments.  My only evidence is 
an otherwise working program waits on broadcast reception forever when two or 
more copies are run at [exactly] the same time.

Has anyone else seen similar behavior in concurrently running programs that 
perform lots of broadcasts perhaps?

Randolph


--- On Sun, 8/8/10, David Zhang <solarbik...@gmail.com> wrote:

From: David Zhang <solarbik...@gmail.com>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Sunday, 8 August, 2010, 12:34 PM

In particular, intercommunicators

On 8/7/10, Aurélien Bouteiller <boute...@eecs.utk.edu> wrote:
> You should consider reading about communicators in MPI.
>
> Aurelien
> --
> Aurelien Bouteiller, Ph.D.
> Innovative Computing Laboratory, The University of Tennessee.
>
> Envoyé de mon iPad
>
> Le Aug 7, 2010 à 1:05, Randolph Pullen <randolph_pul...@yahoo.com.au> a
> écrit :
>
>> I seem to be having a problem with MPI_Bcast.
>> My massive I/O intensive data movement program must broadcast from n to n
>> nodes. My problem starts because I require 2 processes per node, a sender
>> and a receiver and I have implemented these using MPI processes rather
>> than tackle the complexities of threads on MPI.
>>
>> Consequently, broadcast and calls like alltoall are not completely
>> helpful.  The dataset is huge and each node must end up with a complete
>> copy built by the large number of contributing broadcasts from the sending
>> nodes.  Network efficiency and run time are paramount.
>>
>> As I don’t want to needlessly broadcast all this data to the sending nodes
>> and I have a perfectly good MPI program that distributes globally from a
>> single node (1 to N), I took the unusual decision to start N copies of
>> this program by spawning the MPI system from the PVM system in an effort
>> to get my N to N concurrent transfers.
>>
>> It seems that the broadcasts running on concurrent MPI environments
>> collide and cause all but the first process to hang waiting for their
>> broadcasts.  This theory seems to be confirmed by introducing a sleep of
>> n-1 seconds before the first MPI_Bcast  call on each node, which results
>> in the code working perfectly.  (total run time 55 seconds, 3 nodes,
>> standard TCP stack)
>>
>> My guess is that unlike PVM, OpenMPI implements broadcasts with broadcasts
>> rather than multicasts.  Can someone confirm this?  Is this a bug?
>>
>> Is there any multicast or N to N broadcast where sender processes can
>> avoid participating when they don’t need to?
>>
>> Thanks in advance
>> Randolph
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sent from my mobile device

David Zhang
University of California, San Diego

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

[OMPI users] MPI_Bcast issue

2010-08-07 Thread Randolph Pullen
I seem to be having a problem with MPI_Bcast.
My massive I/O intensive data movement program must broadcast from n to n 
nodes. My problem starts because I require 2 processes per node, a sender and a 
receiver and I have implemented these using MPI processes rather than tackle 
the complexities of threads on MPI.

Consequently, broadcast and calls like alltoall are not completely helpful.  
The dataset is huge and each node must end up with a complete copy built by the 
large number of contributing broadcasts from the sending nodes.  Network 
efficiency and run time are paramount.

As I don’t want to needlessly broadcast all this data to the sending nodes and 
I have a perfectly good MPI program that distributes globally from a single 
node (1 to N), I took the unusual decision to start N copies of this program by 
spawning the MPI system from the PVM system in an effort to get my N to N 
concurrent transfers.

It seems that the broadcasts running on concurrent MPI environments collide and 
cause all but the first process to hang waiting for their broadcasts.  This 
theory seems to be confirmed by introducing a sleep of n-1 seconds before the 
first MPI_Bcast  call on each node, which results in the code working 
perfectly.  (total run time 55 seconds, 3 nodes, standard TCP stack)

My guess is that unlike PVM, OpenMPI implements broadcasts with broadcasts 
rather than multicasts.  Can someone confirm this?  Is this a bug?

Is there any multicast or N to N broadcast where sender processes can avoid 
participating when they don’t need to?

Thanks in advance
Randolph




  

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
It would be excellent if you could address this in 1.4.x  or provide an 
alernative as it is an important attribute in fault recovery, particularly with 
a large number of nodes where the MTBF is significantly lowered; - ie we can 
expect node failures from time to time.

A bit of background:
I am building a parallel SQL engine for large scale analytics and need to 
re-map failed nodes to a suitable backup data set, without losing the currently 
running query.
I am assuming this means re-starting mpirun with adjusted parameters but it may 
be possible (although probably very messy) to re-start failed processes on 
backup nodes without losing the current query.

What are your thoughts?

Regards,
Randolph

PS: excellent product, keep up the good work
--- On Thu, 24/6/10, Ralph Castain <r...@open-mpi.org> wrote:

From: Ralph Castain <r...@open-mpi.org>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Thursday, 24 June, 2010, 12:00 AM

mpirun is not an MPI process, so it makes no difference what your processes are 
doing wrt MPI_Abort or any other MPI function call.
A quick glance thru the code shows that mpirun won't properly terminate under 
these conditions. It is waiting to hear that all daemons have terminated, and 
obviously is missing the one that was on the node that you powered off.
This obviously isn't a scenario we regularly test. The work Jeff referred to is 
intended to better handle such situations, but isn't ready for release yet. I'm 
not sure if I'll have time to go back to the 1.4 series and resolve this 
behavior, but I'll put it on my list of things to look at if/when time permits.

On Jun 23, 2010, at 6:53 AM, Randolph Pullen wrote:
ok,
Having confirmed that replacing MPI_Abort with exit() does not work and 
checking that under these conditions the only process left running appears to 
be mpirun,
I think I need to report a bug, ie:
Although the processes themselves can be stopped (by exit if nothing else)
mpirun hangs after a node is powered off and can never exit as it appears to 
wait indefinitely for the missing node to receive or send a signal.


--- On Wed, 23/6/10, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 23 June, 2010, 9:10 PM

Open
 MPI's fault tolerance support is fairly rudimentary.  If you kill any process 
without calling MPI_Finalize, Open MPI will -- by default -- kill all the 
others in the job.

Various research work is ongoing to improve fault tolerance in Open MPI, but I 
don't know the state of it in terms of surviving a failed process.  I *think* 
that this kind of stuff is not ready for prime time, but I admit that this is 
not an area that I pay close attention to.



On Jun 23, 2010, at 3:08 AM, Randolph Pullen wrote:

> That is effectively what I have done by changing to the immediate 
> send/receive and waiting in a loop a finite number of times for the transfers 
> to complete - and calling MPI_abort if they do not complete in a set time.
> It is not clear how I can kill mpirun in a manner consistent with the API.
> Are you implying I should call exit() rather than MPI_abort?
> 
> --- On Wed, 23/6/10, David Zhang
 <solarbik...@gmail.com> wrote:
> 
> From: David Zhang <solarbik...@gmail.com>
> Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
> To: "Open MPI Users" <us...@open-mpi.org>
> Received: Wednesday, 23 June, 2010, 4:37 PM
> 
> Since you turned the machine off instead of just killing one of the 
> processes, no signals could be sent to other processes.  Perhaps you could 
> institute some sort of handshaking in your software that periodically check 
> for the attendance of all machines, and timeout if not all are present within 
> some alloted time?
> 
> On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen 
> <randolph_pul...@yahoo.com.au> wrote:
> 
> I have a mpi program that aggregates data from multiple sql systems.  It all 
> runs fine.  To test fault tolerance I switch one of the machines off while it 
> is running.  The result is always a hang, ie mpirun never completes.
>  
> To try and avoid this I have replaced the send and receive calls with 
> immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends 
> and receives but it makes no difference.
> My requirement is that all complete or mpirun exits with an error - no matter 
> where they are in their execution when a failure occurs.  This system must 
> continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
> remaining nodes.
> 
> I am running FC10, gcc 4.3.2 and openMPI 1.4.1
> 4G RAM, dual core intel
 all x86_64
> 
> 

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
ok,
Having confirmed that replacing MPI_Abort with exit() does not work and 
checking that under these conditions the only process left running appears to 
be mpirun,
I think I need to report a bug, ie:
Although the processes themselves can be stopped (by exit if nothing else)
mpirun hangs after a node is powered off and can never exit as it appears to 
wait indefinitely for the missing node to receive or send a signal.


--- On Wed, 23/6/10, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 23 June, 2010, 9:10 PM

Open MPI's fault tolerance support is fairly rudimentary.  If you kill any 
process without calling MPI_Finalize, Open MPI will -- by default -- kill all 
the others in the job.

Various research work is ongoing to improve fault tolerance in Open MPI, but I 
don't know the state of it in terms of surviving a failed process.  I *think* 
that this kind of stuff is not ready for prime time, but I admit that this is 
not an area that I pay close attention to.



On Jun 23, 2010, at 3:08 AM, Randolph Pullen wrote:

> That is effectively what I have done by changing to the immediate 
> send/receive and waiting in a loop a finite number of times for the transfers 
> to complete - and calling MPI_abort if they do not complete in a set time.
> It is not clear how I can kill mpirun in a manner consistent with the API.
> Are you implying I should call exit() rather than MPI_abort?
> 
> --- On Wed, 23/6/10, David Zhang <solarbik...@gmail.com> wrote:
> 
> From: David Zhang <solarbik...@gmail.com>
> Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
> To: "Open MPI Users" <us...@open-mpi.org>
> Received: Wednesday, 23 June, 2010, 4:37 PM
> 
> Since you turned the machine off instead of just killing one of the 
> processes, no signals could be sent to other processes.  Perhaps you could 
> institute some sort of handshaking in your software that periodically check 
> for the attendance of all machines, and timeout if not all are present within 
> some alloted time?
> 
> On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen 
> <randolph_pul...@yahoo.com.au> wrote:
> 
> I have a mpi program that aggregates data from multiple sql systems.  It all 
> runs fine.  To test fault tolerance I switch one of the machines off while it 
> is running.  The result is always a hang, ie mpirun never completes.
>  
> To try and avoid this I have replaced the send and receive calls with 
> immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends 
> and receives but it makes no difference.
> My requirement is that all complete or mpirun exits with an error - no matter 
> where they are in their execution when a failure occurs.  This system must 
> continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
> remaining nodes.
> 
> I am running FC10, gcc 4.3.2 and openMPI 1.4.1
> 4G RAM, dual core intel all x86_64
> 
> 
> ===
> The commands I have tried:
> mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
> tab"   
> 
> mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
> "select * from tab"   
> 
> mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
> ./ingsprinkle  test t3  "select * from tab"   
> 
> 
> ===
> 
> The results:
> recv returned 0 with status 0
> waited  # 202 tiumes - now status is  0 flag is -1976147192
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
> with errorcode 5.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> --
> mpirun has exited due to process rank 0 with PID 29141 on
> node bd01 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --
> 
> [*** wait a long time ***]
> [bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
> Connection reset by

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
That is effectively what I have done by changing to the immediate send/receive 
and waiting in a loop a finite number of times for the transfers to complete - 
and calling MPI_abort if they do not complete in a set time.
It is not clear how I can kill mpirun in a manner consistent with the API.
Are you implying I should call exit() rather than MPI_abort?

--- On Wed, 23/6/10, David Zhang <solarbik...@gmail.com> wrote:

From: David Zhang <solarbik...@gmail.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 23 June, 2010, 4:37 PM

Since you turned the machine off instead of just killing one of the processes, 
no signals could be sent to other processes.  Perhaps you could institute some 
sort of handshaking in your software that periodically check for the attendance 
of all machines, and timeout if not all are present within some alloted time?



On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen 
<randolph_pul...@yahoo.com.au> wrote:





I have a mpi program that aggregates data from multiple sql systems.  It all 
runs fine.  To test fault tolerance I switch one of the machines off while it 
is running.  The result is always a hang, ie mpirun never completes.


 
To try and avoid this I have replaced the send and receive calls with immediate 
calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives 
but it makes no difference.
My requirement is that all complete or mpirun exits with an error - no matter 
where they are in their execution when a failure occurs.  This system must 
continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
remaining nodes.



I am running FC10, gcc 4.3.2 and openMPI 1.4.1
4G RAM, dual core intel all
 x86_64


===
The commands I have tried:
mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
tab"   



mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
"select * from tab"   


mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
./ingsprinkle  test t3  "select * from tab"   



===

The results:
recv returned 0 with status 0
waited  # 202 tiumes - now status is  0 flag is -1976147192


--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.


You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
--


mpirun has exited due to process rank 0 with PID 29141 on
node bd01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun
 (as reported here).
--

[*** wait a long time ***]
[bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)



^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate


===

As you can see, my trap can signal an abort, the tcp layer can time out but 
mpirun just keeps on running...



Any help greatly appreciated..
Vlad







   
___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
David Zhang
University of California, San Diego




-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

[OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen

I have a mpi program that aggregates data from multiple sql systems.  It all 
runs fine.  To test fault tolerance I switch one of the machines off while it 
is running.  The result is always a hang, ie mpirun never completes.
 
To try and avoid this I have replaced the send and receive calls with immediate 
calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives 
but it makes no difference.
My requirement is that all complete or mpirun exits with an error - no matter 
where they are in their execution when a failure occurs.  This system must 
continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
remaining nodes.

I am running FC10, gcc 4.3.2 and openMPI 1.4.1
4G RAM, dual core intel all x86_64


===
The commands I have tried:
mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
tab"   

mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
"select * from tab"   


mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
./ingsprinkle  test t3  "select * from tab"   



===

The results:
recv returned 0 with status 0
waited  # 202 tiumes - now status is  0 flag is -1976147192
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
--
mpirun has exited due to process rank 0 with PID 29141 on
node bd01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

[*** wait a long time ***]
[bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)

^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate


===

As you can see, my trap can signal an abort, the tcp layer can time out but 
mpirun just keeps on running...

Any help greatly appreciated..
Vlad





  

ompi.info.tar.gz
Description: GNU Zip compressed data