Re: [OMPI users] Could following situations caused by RDMA mcaparameters?

2009-04-22 Thread Tsung Han Shie
Dear Jeff

Thanks for your help.
Unfortunately, after I thoroughly examined entire cluster, I found a bad
node with busted hard drive. That's the reason why this job hanged.
Also, when this job is sent with one bad node among the machinefile, neither
the openmpi nor my program gives me any error messages. That's why I can't
find the reason for job hanged.

Best regard

2009/4/22 Jeff Squyres 

> On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:
>
>  I tried to increase speed of a program with openmpi-1.1.3
>>
>
> Did you mean 1.1.3 or 1.3.1?I mean 1.1.3.
>
>  by adding following 4 parameters into openmpi-mca-params.conf file.
>>
>> mpi_leave_pinned=1
>> btl_openib_eager_rdma_num=128
>> btl_openib_max_eager_rdma=128
>> btl_openib_eager_limit=1024
>>
>
> If you meant 1.3.1 above, please see the following message about an
> important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:
>
>http://www.open-mpi.org/community/lists/announce/2009/03/0029.php
>
>
>  and then, I ran my program twice(124 processes on 31 nodes). one with
>> "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
>> All of them were stopped abnormally with "ctrl+c" and "killall -9
>> ".
>>
>
> Why -- did they hang?

I just fun my program for a few steps to see the speed and then I killed
it.

>
>
>  After that, I couldn't start to run that program again.
>>
>
> What exactly was the error?

There are not any messages.

>
>
>  I checked every nodes with "free -m" and I found that huge amount of
>> cached memory were used in each nodes.
>> Could this situation be caused by those 4 parameters? IS there anyway to
>> free theme?
>>
>
>
> Probably not.
>
> Can you send all the information listed here:
>
>http://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Could following situations caused by RDMA mcaparameters?

2009-04-22 Thread Jeff Squyres

On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:


I tried to increase speed of a program with openmpi-1.1.3


Did you mean 1.1.3 or 1.3.1?


by adding following 4 parameters into openmpi-mca-params.conf file.

mpi_leave_pinned=1
btl_openib_eager_rdma_num=128
btl_openib_max_eager_rdma=128
btl_openib_eager_limit=1024


If you meant 1.3.1 above, please see the following message about an  
important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:


http://www.open-mpi.org/community/lists/announce/2009/03/0029.php


and then, I ran my program twice(124 processes on 31 nodes). one  
with "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
All of them were stopped abnormally with "ctrl+c" and "killall -9  
".


Why -- did they hang?


After that, I couldn't start to run that program again.


What exactly was the error?

I checked every nodes with "free -m" and I found that huge amount of  
cached memory were used in each nodes.
Could this situation be caused by those 4 parameters? IS there  
anyway to free theme?



Probably not.

Can you send all the information listed here:

http://www.open-mpi.org/community/help/

--
Jeff Squyres
Cisco Systems