date:20110520

Re: [OMPI users] users Digest, Vol 1911, Issue 4

2011-05-20 Thread Jason Mackay


"MPI can get through your firewall, right?"

As far as I can tell the firewall is not the problem - have tried it with 
firewalls disabled, automatic fw polices based on port requests from MPI, and 
with manual exception policies. 
 
> From: users-requ...@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 4
> To: us...@open-mpi.org
> Date: Fri, 20 May 2011 14:58:40 -0400
> 
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
> 1. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Damien)
> 
> 
> --
> 
> Message: 1
> Date: Fri, 20 May 2011 12:58:21 -0600
> From: Damien 
> Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7
> workgroup
> To: Open MPI Users 
> Message-ID: <4dd6b9cd.8060...@khubla.com>
> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
> 
> MPI can get through your firewall, right?
> 
> Damien
> 
> On 20/05/2011 12:53 PM, Jason Mackay wrote:
> > I have verified that disabling UAC does not fix the problem. xhlp.exe 
> > starts, threads spin up on both machines, CPU usage is at 80-90% but 
> > no progress is ever made.
> >
> > >From this state, Ctrl-break on the head node yields the following output:
> >
> > [REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] 
> > mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
> > [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> > [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> > [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> > [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> > [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> > [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to 
> > lifeline [[20816,0],0] lost
> >
> >
> >
> > > From: users-requ...@open-mpi.org
> > > Subject: users Digest, Vol 1911, Issue 1
> > > To: us...@open-mpi.org
> > > Date: Fri, 20 May 2011 08:14:13 -0400
> > >
> > > Send users mailing list submissions to
> > > us...@open-mpi.org
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > or, via email, send a message with subject or body 'help' to
> > > users-requ...@open-mpi.org
> > >
> > > You can reach the person managing the list at
> > > users-ow...@open-mpi.org
> > >
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of users digest..."
> > >
> > >
> > > Today's Topics:
> > >
> > > 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> > > 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> > > MPI_Wait/Test (George Bosilca)
> > > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> > > 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> > > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> > > 12.0) (Jeff Squyres)
> > > 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> > > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> > > 8. Re: Trouble with MPI-IO (Jeff Squyres)
> > > 9. Re: Trouble with MPI-IO (Tom Rosmond)
> > > 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> > > MPI_Wait/Test (David B?ttner)
> > > 11. Re: Trouble with MPI-IO (Jeff Squyres)
> > > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> > > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> > > sometimes... (Jeff Squyres)
> > > 14. Re: Trouble with MPI-IO (Jeff Squyres)
> > >
> > >
> > > --
> > >
> > > Message: 1
> > > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> > > From: Zhangping Wei 
> > > Subject: Re: [OMPI users] Error: Entry Point Not

Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

2011-05-20 Thread Damien


MPI can get through your firewall, right?

Damien

On 20/05/2011 12:53 PM, Jason Mackay wrote:
I have verified that disabling UAC does not fix the problem. xhlp.exe 
starts, threads spin up on both machines, CPU usage is at 80-90% but 
no progress is ever made.


>From this state, Ctrl-break on the head node yields the following output:

[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] 
mca_oob_tcp_msg_recv: readv failed: Unknown error (108)
[REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to 
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to 
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to 
lifeline [[20816,0],0] lost
[REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to 
lifeline [[20816,0],0] lost
[REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to 
lifeline [[20816,0],0] lost
[REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to 
lifeline [[20816,0],0] lost




> From: users-requ...@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 1
> To: us...@open-mpi.org
> Date: Fri, 20 May 2011 08:14:13 -0400
>
> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (George Bosilca)
> 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> 12.0) (Jeff Squyres)
> 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> 8. Re: Trouble with MPI-IO (Jeff Squyres)
> 9. Re: Trouble with MPI-IO (Tom Rosmond)
> 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (David B?ttner)
> 11. Re: Trouble with MPI-IO (Jeff Squyres)
> 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> sometimes... (Jeff Squyres)
> 14. Re: Trouble with MPI-IO (Jeff Squyres)
>
>
> --
>
> Message: 1
> Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> From: Zhangping Wei 
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: us...@open-mpi.org
> Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com>
> Content-Type: text/plain; charset="gb2312"
>
> Dear Paul,
>
> I checked the way 'mpirun -np N ' you mentioned, but it was the 
same

> problem.
>
> I guess it may related to the system I used, because I have used it 
correctly in

> another XP 32 bit system.
>
> I look forward to more advice.Thanks.
>
> Zhangping
>
>
>
>
> 
>  "users-requ...@open-mpi.org" 
>  us...@open-mpi.org
> ?? 2011/5/19 () 11:00:02 
> ??  users Digest, Vol 1910, Issue 2
>
> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> 2. Re: Openib with > 32 cores per node (Robert Horton)
> 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez)
>
>
> --
>
> Message: 1
> Date: Thu, 19 May 2011 16:14:02 +0100
> From: Paul van der Walt 
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users 
> Message-ID: 
> Content-Type:

Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

2011-05-20 Thread Jason Mackay


I have verified that disabling UAC does not fix the problem. xhlp.exe starts, 
threads spin up on both machines, CPU usage is at 80-90% but no progress is 
ever made.
 
>From this state, Ctrl-break on the head node yields the following output:

[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
 
 
 
> From: users-requ...@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 1
> To: us...@open-mpi.org
> Date: Fri, 20 May 2011 08:14:13 -0400
> 
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
> 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (George Bosilca)
> 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> 12.0) (Jeff Squyres)
> 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> 8. Re: Trouble with MPI-IO (Jeff Squyres)
> 9. Re: Trouble with MPI-IO (Tom Rosmond)
> 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (David B?ttner)
> 11. Re: Trouble with MPI-IO (Jeff Squyres)
> 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> sometimes... (Jeff Squyres)
> 14. Re: Trouble with MPI-IO (Jeff Squyres)
> 
> 
> --
> 
> Message: 1
> Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> From: Zhangping Wei 
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: us...@open-mpi.org
> Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com>
> Content-Type: text/plain; charset="gb2312"
> 
> Dear Paul,
> 
> I checked the way 'mpirun -np N ' you mentioned, but it was the same 
> problem.
> 
> I guess it may related to the system I used, because I have used it correctly 
> in 
> another XP 32 bit system.
> 
> I look forward to more advice.Thanks.
> 
> Zhangping 
> 
> 
> 
> 
> 
>  "users-requ...@open-mpi.org" 
>  us...@open-mpi.org
> ?? 2011/5/19 () 11:00:02 
> ??  users Digest, Vol 1910, Issue 2
> 
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
> 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> 2. Re: Openib with > 32 cores per node (Robert Horton)
> 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez)
> 
> 
> --
> 
> Message: 1
> Date: Thu, 19 May 2011 16:14:02 +0100
> From: Paul van der Walt 
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=UTF-8
> 
> Hi,
> 
> On 19 May 2011 15:54, Zhangping

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-20 Thread Gus Correa


Hi Salvatore

Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?

Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.

My two cents,
Gus Correa

Salvatore Podda wrote:
We are still struggling we these problems. Actually the new version of 
intel compilers does
not seem to be the real issue. We clash against the same errors using 
also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler 
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now doing a 
severe IMB benchmark campaign.


However, yes this happen only whe we use the --mca btl openib,self, on 
the contrary if we use

--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on the 
opem-mpi.org documentation,

 but can you be so kindly to explain the meaning of this flag?

Thanks

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:


Sorry for the late reply.

Other users have seen something similar but we have never been able to 
reproduce it.  Is this only when using IB?  If you use "mpirun --mca 
btl_openib_cpc_if_include rdmacm", does the problem go away?



On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:

I've seen the same thing when I build openmpi 1.4.3 with Intel 12, 
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 
then the collectives hangs go away. I don't know what, if anything, 
the higher optimization buys you when compiling openmpi, so I'm not 
sure if that's an acceptable workaround or not.


My system is similar to yours - Intel X5570 with QDR Mellanox IB 
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm 
using IMB 3.2.2 with a single iteration of Barrier to reproduce the 
hang, and it happens 100% of the time for me when I invoke it like this:


# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the 
participating ranks have this backtrace:


__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from 
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from 
[instdir]/lib/libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from 
[instdir]/lib/libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems to 
rule out the sm btl (or interactions with it) as a culprit at least.


I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in building several version of openmpi from 1.2.8 to 1.4.3
with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending from the
application: IMB, xhpl or user applications
and form the number of required cores) above which the application 
hangs

(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the same 
limits.

There are any known incompatibilities of openmpi with this version of
intel compiilers?

The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org

Re: [OMPI users] Openib with > 32 cores per node

2011-05-20 Thread Jeff Squyres

If you're using QLogic, you might want to try the native PSM Open MPI support 
rather than the verbs support.  QLogic cards only "sorta" support verbs in 
order to say that they're OFED-complaint; their native PSM interface is more 
performant than verbs for MPI.

Assuming you built OMPI with PSM support:

mpirun --mca pml cm --mca mtl psm 

(although probably just the pml/cm setting is sufficient -- the mtl/psm option 
will probably happen automatically)

See the OMPI README file for some more details about MTLs, PMLs, etc. (look for 
"psm"/i in the file)



On May 20, 2011, at 10:19 AM, Robert Horton wrote:

> Hi,
> 
> Thanks for getting back to me (and thanks to Jeff for the explanation
> too).
> 
> On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote:
>> Hi,
>> 
>> On May 19, 2011, at 9:37 AM, Robert Horton wrote
>> 
>>> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
 Hi,
 
 Try the following QP parameters that only use shared receive queues.
 
 -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
 
>>> 
>>> Thanks for that. If I run the job over 2 x 48 cores it now works and the
>>> performance seems reasonable (I need to do some more tuning) but when I
>>> go up to 4 x 48 cores I'm getting the same problem:
>>> 
>>> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>>  error creating qp errno says Cannot allocate memory
>>> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
>>> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
>>> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
>>> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
>>> abort)
>>> 
>>> Any thoughts?
>> 
>> How much memory does each node have?  Does this happen at startup?
> 
> Each node has 64GB of RAM. The error happens fairly soon after the job
> starts.
> 
>> 
>> Try adding:
>> 
>> -mca btl_openib_cpc_include rdmacm
> 
> Ah - that looks much better. I can now run hpcc over all 15x48 cores. I
> need to look at the performance in a bit more detail but it seems to be
> "reasonable" at least :)
> 
> One thing is puzzling me - when I compile OpenMPI myself it seems to
> lack rdmamc support - however the one compiled by the OFED install
> process does include it. I'm compiling with:
> 
> '--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' 
> '--enable-openib-rdmacm'
> 
> Any idea what might be going on there?
> 
>> I'm not sure if your version of OFED supports this feature, but maybe using 
>> XRC may help.  I **think** other tweaks are needed to get this going, but 
>> I'm not familiar with the details.
> 
> I'm using the QLogic (QLE7340) rather than Mellanox cards so that
> doesn't seem to be an option to me (?). It would be interesting to know
> how much difference it would make though...
> 
> Thanks again for your help and have a good weekend.
> 
> Rob
> 
> -- 
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Openib with > 32 cores per node

2011-05-20 Thread Robert Horton

Hi,

Thanks for getting back to me (and thanks to Jeff for the explanation
too).

On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote:
> Hi,
> 
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
> 
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >> Hi,
> >> 
> >> Try the following QP parameters that only use shared receive queues.
> >> 
> >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >> 
> > 
> > Thanks for that. If I run the job over 2 x 48 cores it now works and the
> > performance seems reasonable (I need to do some more tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> > 
> > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> >  error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> > abort)
> > 
> > Any thoughts?
> 
> How much memory does each node have?  Does this happen at startup?

Each node has 64GB of RAM. The error happens fairly soon after the job
starts.

> 
> Try adding:
> 
> -mca btl_openib_cpc_include rdmacm

Ah - that looks much better. I can now run hpcc over all 15x48 cores. I
need to look at the performance in a bit more detail but it seems to be
"reasonable" at least :)

One thing is puzzling me - when I compile OpenMPI myself it seems to
lack rdmamc support - however the one compiled by the OFED install
process does include it. I'm compiling with:

'--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' 
'--enable-openib-rdmacm'

Any idea what might be going on there?

> I'm not sure if your version of OFED supports this feature, but maybe using 
> XRC may help.  I **think** other tweaks are needed to get this going, but I'm 
> not familiar with the details.

I'm using the QLogic (QLE7340) rather than Mellanox cards so that
doesn't seem to be an option to me (?). It would be interesting to know
how much difference it would make though...

Thanks again for your help and have a good weekend.

Rob

-- 
Robert Horton
System Administrator (Research Support) - School of Mathematical Sciences
Queen Mary, University of London
r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-20 Thread Salvatore Podda

We are still struggling we these problems. Actually the new version of  
intel compilers does
not seem to be the real issue. We clash against the same errors using  
also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler  
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now doing  
a severe IMB benchmark campaign.


However, yes this happen only whe we use the --mca btl openib,self, on  
the contrary if we use

--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on the  
opem-mpi.org documentation,

 but can you be so kindly to explain the meaning of this flag?

Thanks

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:


Sorry for the late reply.

Other users have seen something similar but we have never been able  
to reproduce it.  Is this only when using IB?  If you use "mpirun -- 
mca btl_openib_cpc_if_include rdmacm", does the problem go away?



On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:

I've seen the same thing when I build openmpi 1.4.3 with Intel 12,  
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1  
then the collectives hangs go away. I don't know what, if anything,  
the higher optimization buys you when compiling openmpi, so I'm not  
sure if that's an acceptable workaround or not.


My system is similar to yours - Intel X5570 with QDR Mellanox IB  
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm  
using IMB 3.2.2 with a single iteration of Barrier to reproduce the  
hang, and it happens 100% of the time for me when I invoke it like  
this:


# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the  
participating ranks have this backtrace:


__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/ 
lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ 
libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ 
libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems  
to rule out the sm btl (or interactions with it) as a culprit at  
least.


I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in building several version of openmpi from 1.2.8 to  
1.4.3

with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending  
from the

application: IMB, xhpl or user applications
and form the number of required cores) above which the application  
hangs

(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the  
same limits.
There are any known incompatibilities of openmpi with this version  
of

intel compiilers?

The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-20 Thread Peter Thompson

Thanks Ralph.  I've seen the messages generated in b...@open-mpi.org, so 
I figured something was up!   I was going to provide the unified diff, 
but then ran into another issue in testing where we immediately ran into 
a seq fault, even with this fix.   It turns out that a pre-pending of 
/lib64 (and maybe /usr/lib64) to LD_LIBRARY_PATH works around that one 
though, so I don't think it's directly related, but it threw me off, 
along with the beta testing we're doing...


Cheers,
PeterT


Ralph Castain wrote:

Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

  
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() beforehand, and then calling putenv() with the string duplicated from env[j].  Of course, if the strdup fails, then we bail out. 
As for why it's suddenly a problem, I'm not quite as certain.   The problem we do show is a double free, so someone has already freed that memory used by putenv(), and I do know that while that used to be just flagged as an event before, now we seem to be unable to continue past it.   Not sure if that is our change or a library/system change. 
PeterT



Ralph Castain wrote:


On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

 
  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.

   


No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


 
  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:
   


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }
  /* All done */
--- 1578,1600 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the

[OMPI users] Issue with mpicc --showme in windows

2011-05-20 Thread AMARNATH, Balachandar

Hello,

Here in my windows machine, if i ran mpicc -showme, i get erroneous output like 
below:-

**
C:\>C:\Users\BAAMARNA5617\Programs\mpi\OpenMPI_v1.5.3-win32\bin\mpicc.exe 
--showme
Cannot open configuration file C:/Users/hpcfan/Documents/OpenMPI/openmpi-1.5.3/i
nstalled-32/share/openmpi\mpif77.exe-wrapper-data.txt
Error parsing data file mpif77.exe: Not found
**


I installed openmpi from 
http://www.open-mpi.org/software/ompi/v1.5/downloads/OpenMPI_v1.5.3-2_win32.exe 
and end up with error.  (Read in a forum that 1.4 version of openmpi does not 
support fortran bindings and hence obtained one of the recent releases). Hope 
to fix this soon,

With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.

Re: [OMPI users] Trouble with MPI-IO

2011-05-20 Thread Jeff Squyres

On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:

> Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?

Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile error 
(even though they're allocatable -- so allocate was a red herring, sorry).  
That's all that "use mpi" is complaining about -- that the function signatures 
didn't match.

use mpi is your friend -- even if you don't use F90 constructs much.  
Compile-time checking is Very Good Thing (you were effectively "getting lucky" 
by passing in the 2D arrays, I think).

Attached is my final version.  And with this version, I see the hang when 
running it with the "T" parameter.

That being said, I'm not an expert on the MPI IO stuff -- your code *looks* 
right to me, but I could be missing something subtle in the interpretation of 
MPI_FILE_SET_VIEW.  I tried running your code with MPICH 1.3.2p1 and it also 
hung.

Rob (ROMIO guy) -- can you comment this code?  Is it correct?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


x.f90
Description: Binary data

Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only sometimes...

2011-05-20 Thread Jeff Squyres

Sorry for the super-late reply.  :-\

Yes, ERR_TRUNCATE means that the receiver didn't have a large enough buffer.

Have you tried upgrading to a newer version of Open MPI?  1.4.3 is the current 
stable release (I have a very dim and not guaranteed to be correct recollection 
that we fixed something in the internals of collectives somewhere with regards 
to ERR_TRUNCATE...?).


On Apr 25, 2011, at 4:44 PM, Wei Hao wrote:

> Hi:
> 
> I'm running openmpi 1.2.8. I'm working on a project where one part involves 
> communicating an integer, representing the number of data points I'm keeping 
> track of, to all the processors. The line is simple:
> 
>MPI_Allreduce(,_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD);
> 
> where np and geo_N are integers, np is the result of a local calculation, and 
> geo_N has been declared on all the processors. geo_N is nondecreasing. This 
> line works the first time I call it (geo_N goes from 0 to some other 
> integer), but if I call it later in the program, I get the following error:
> 
> 
>[woodhen-039:26189] *** An error occurred in MPI_Allreduce
>[woodhen-039:26189] *** on communicator MPI_COMM_WORLD
>[woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated
>[woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye)
> 
> 
> As I understand it, MPI_ERR_TRUNCATE means that the output buffer is too 
> small, but I'm not sure where I've made a mistake. It's particularly 
> frustrating because it seems to work fine the first time. Does anyone have 
> any thoughts?
> 
> Thanks
> Wei
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] MPI_Alltoallv function crashes when np > 100

2011-05-20 Thread Jeff Squyres

I missed this email in my INBOX, sorry.

Can you be more specific about what exact error is occurring?  You just say 
that the application crashes...?  Please send all the information listed here:

http://www.open-mpi.org/community/help/


On Apr 26, 2011, at 10:51 PM, 孟宪军 wrote:

> It seems that the const variable SOMAXCONN who used by listen() system call 
> causes this problem. Can anybody help me resolve this question?
> 
> 2011/4/25 孟宪军 
> Dear all,
> 
> As I mentioned, when I mpiruned an application with the parameter "np = 
> 150(or bigger)", the application who used the MPI_Alltoallv function would 
> carsh. The problem would recur no matter how many nodes we used. 
> 
> The edition of OpenMPI: 1.4.1 or 1.4.3
> The OS: linux redhat 2.6.32
> 
> BTW, my nodes had enough memory to run the application, and the MPI_Alltoall 
> function worked well at my environment.
> Did anybody meet the same problem? Thanks.
> 
> 
> Best Regards
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Trouble with MPI-IO

2011-05-20 Thread Jeff Squyres

On May 19, 2011, at 11:24 PM, Tom Rosmond wrote:

> What fortran compiler did you use?

gfortran.

> In the original script my Intel compile used the -132 option, 
> allowing up to that many columns per line.  

Gotcha.

>> x.f90:99.77:
>> 
>>call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
>>   1  
>> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at 
>> (1)
> 
> Hmmm, very strange, since I am looking right at the MPI standard
> documents with that routine documented.  I too get this compile failure
> when I switch to 'use mpi'.  Could that be a problem with the Open MPI
> fortran libraries???

I think that that error is telling us that there's a compile-time mismatch -- 
that the signature of what you've passed doesn't match the signature of OMPI's 
MPI_Type_indexed subroutine.

>> I looked at our mpi F90 module and see the following:
>> 
>> interface MPI_Type_indexed
>> subroutine MPI_Type_indexed(count, array_of_blocklengths, 
>> array_of_displacements, oldtype, newtype, ierr)
>>  integer, intent(in) :: count
>>  integer, dimension(*), intent(in) :: array_of_blocklengths
>>  integer, dimension(*), intent(in) :: array_of_displacements
>>  integer, intent(in) :: oldtype
>>  integer, intent(out) :: newtype
>>  integer, intent(out) :: ierr
>> end subroutine MPI_Type_indexed
>> end interface

Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and MPI_Wait/Test

2011-05-20 Thread David Büttner


Hello,

thanks for the quick answer. I am sorry that I forgot to mention this: I 
did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if 
required == provided after the MPI_Thread_init call.



I do not see any mechanism for protecting the accesses to the requests to a 
single thread? What is the thread model you're using?

Again I am sorry that this was not clear: In the pseudo code below I 
wanted to indicate the access-protection I do by thread-id dependent 
calls if(0 == thread-id) and by using the trylock(...) (using 
pthread-mutexes). In the code all accesses concerning one MPI_Request 
(which are pthread-global-pointers in my case) are protected and called 
in sequential order, i.e. MPI_Isend/recv is returns before any thread is 
allowed to call the corresponding MPI_Test and no-one can call MPI_Test 
any more when a thread is allowed to call MPI_Wait.
I did this in the same manner before with other MPI implementations, but 
also on the same machine with the same (untouched) OpenMPI 
implementation, also using pthreads and MPI in combination, but I used


MPI_Request req;

instead of

MPI_Request* req;
(and later)
req = (MPI_Request*)malloc(sizeof(MPI_Request));


In my recent (problem) code, I also tried not using pointers, but got 
the same problem. Also, as I described in the first mail, I tried 
everything concerning the memory allocation of the MPI_Request objects.
I tried not calling malloc. This I guessed wouldn't work, but the 
OpenMPI documentation says this:


" Nonblocking calls allocate a communication request object and 
associate it with the request handle  the argument request). " 
[http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and


" [...] if the communication object was created by a nonblocking send or 
receive, then it is deallocated and the request handle is set to 
MPI_REQUEST_NULL." 
[http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly 
different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php]


So I thought that it might do some kind of optimized memory stuff 
internally.


I also tried allocating req (for each used MPI_Request) once before the 
first use and deallocation after the last use (which I thought was the 
way it was supposed to work), but that crashes also.


I tried replacing the pointers through global variables

MPI_Request req;

which didn't do the job...

The only thing that seems to work is what I mentioned below: Allocate 
every time I am going to need it in the MPI_Isend/recv, use it in 
MPI_Test/Wait and after that deallocate it by hand each time.
I don't think that this is supposed to be like this since I have to do a 
call to malloc and free so often (for multiple MPI_Request objects in 
each iteration) that it will most likely limit performance...


Anyway I still have the same problem and am still unclear on what kind 
of memory allocation I should be doing for the MPI_Requests. Is there 
anything else (besides MPI_THREAD_MULTIPLE support, thread access 
control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for 
one MPI_Request object) I need to take care of? If not, what could I do 
to find the source of my problem?


Thanks again for any kind of help!

Kind regards,
David




> From an implementation perspective, your code is correct only if you 
initialize the MPI library with MPI_THREAD_MULTIPLE and if the library accepts. 
Otherwise, there is an assumption that the application is single threaded, or that 
the MPI behavior is implementation dependent. Please read the MPI standard 
regarding to MPI_Init_thread for more details.

Regards,
   george.

On May 19, 2011, at 02:34 , David Büttner wrote:


Hello,

I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using 
MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if it 
is done. I do this repeatedly in the outer loop of my code. The MPI_Test is 
used in the inner loop to check if some function can be called which depends on 
the received data.
The program regularly crashed (only when not using printf...) and after 
debugging it I figured out the following problem:

In MPI_Isend I have an invalid read of memory. I fixed the problem with not 
re-using a

MPI_Request req_s, req_r;

but by using

MPI_Request* req_s;
MPI_Request* req_r

and re-allocating them before the MPI_Isend/recv.

The documentation says, that in MPI_Wait and MPI_Test (if successful) the 
request-objects are deallocated and set to MPI_REQUEST_NULL.
It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and 
associates it with the request object.

As I understand this, this either means I can use a pointer to MPI_Request 
which I don't have to initialize for this (it doesn't work but crashes), or 
that I can use a MPI_Request pointer which I have initialized with 
malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), 
which is set and unset in the functions. But this version crashes, too.

Re: [OMPI users] Trouble with MPI-IO

2011-05-20 Thread Tom Rosmond

Thanks for looking at my problem.  Sounds like you did reproduce my
problem.  I have added some comments below

On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote:
> Props for that testio script.  I think you win the award for "most easy to 
> reproduce test case."  :-)
> 
> I notice that some of the lines went over 72 columns, so I renamed the file 
> x.f90 and changed all the comments from "c" to "!" and joined the two &-split 
> lines.  The error about implicit type for lenr went away, but then when I 
> enabled better type checking by using "use mpi" instead of "include 
> 'mpif.h'", I got the following:

What fortran compiler did you use?

In the original script my Intel compile used the -132 option, 
allowing up to that many columns per line.  I still think in
F77 fortran much of the time, and use 'c' for comments out
of habit.  The change to '!' doesn't make any difference.


> x.f90:99.77:
> 
> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
>1  
> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at 
> (1)

Hmmm, very strange, since I am looking right at the MPI standard
documents with that routine documented.  I too get this compile failure
when I switch to 'use mpi'.  Could that be a problem with the Open MPI
fortran libraries???
> 
> I looked at our mpi F90 module and see the following:
> 
> interface MPI_Type_indexed
> subroutine MPI_Type_indexed(count, array_of_blocklengths, 
> array_of_displacements, oldtype, newtype, ierr)
>   integer, intent(in) :: count
>   integer, dimension(*), intent(in) :: array_of_blocklengths
>   integer, dimension(*), intent(in) :: array_of_displacements
>   integer, intent(in) :: oldtype
>   integer, intent(out) :: newtype
>   integer, intent(out) :: ierr
> end subroutine MPI_Type_indexed
> end interface
> 
> I don't quite grok the syntax of the "allocatable" type ijdisp, so that might 
> be the problem here...?

Just a standard F90 'allocatable' statement.  I've written thousands
just like it.
> 
> Regardless, I'm not entirely sure if the problem is the >72 character lines, 
> but then when that is gone, I'm not sure how the allocatable stuff fits in... 
>  (I'm not enough of a Fortran programmer to know)
> 
Anyone else out that who can comment


T. Rosmond



> 
> On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> 
> > I would appreciate someone with experience with MPI-IO look at the
> > simple fortran program gzipped and attached to this note.  It is
> > imbedded in a script so that all that is necessary to run it is do:
> > 'testio' from the command line.  The program generates a small 2-D input
> > array, sets up an MPI-IO environment, and write a 2-D output array
> > twice, with the only difference being the displacement arrays used to
> > construct the indexed datatype.  For the first write, simple
> > monotonically increasing displacements are used, for the second the
> > displacements are 'shuffled' in one dimension.  They are printed during
> > the run.
> > 
> > For the first case the file is written properly, but for the second the
> > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > Although the program is compiled as an mpi program, I am running on a
> > single processor, which makes the problem more puzzling.
> > 
> > The program should be relatively self-explanatory, but if more
> > information is needed, please ask.  I am on an 8 core Xeon based Dell
> > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > OpenMPI 1.5.3.  I have also attached output from 'ompi_info'.
> > 
> > T. Rosmond
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>

Re: [OMPI users] users Digest, Vol 1911, Issue 4

Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

Re: [OMPI users] Openib with > 32 cores per node

Re: [OMPI users] Openib with > 32 cores per node

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

Re: [OMPI users] TotalView Memory debugging and OpenMPI

[OMPI users] Issue with mpicc --showme in windows

Re: [OMPI users] Trouble with MPI-IO

Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only sometimes...

Re: [OMPI users] MPI_Alltoallv function crashes when np > 100

Re: [OMPI users] Trouble with MPI-IO

Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and MPI_Wait/Test

Re: [OMPI users] Trouble with MPI-IO

15 matches

Site Navigation

Mail list logo

Footer information