Re: [OMPI devel] OMPI-MIGRATE error

2011-01-28 Thread Hugo Meyer
Thanks to you Joshua.

I will try the procedure with this modifications and i will let you know how
it goes.

Best Regards.

Hugo Meyer

2011/1/27 Joshua Hursey 

> I believe that this is now fixed on the trunk. All the details are in the
> commit message:
>  https://svn.open-mpi.org/trac/ompi/changeset/24317
>
> In my testing yesterday, I did not test the scenario where the node with
> mpirun also contains processes (the test cluster I was using does not by
> default run this way). So I was able to reproduce by running on a single
> node. There were a couple bugs that emerged that are fixed in the commit.
> The two bugs that were hurting you were the TCP socket cleanup (which caused
> the looping of the automatic recovery), and the incorrect accounting of
> local process termination (which caused the modex errors).
>
> Let me know if that fixes the problems that you were seeing.
>
> Thanks for the bug report and your patience while I pursued a fix.
>
> -- Josh
>
> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
>
> > Hi Josh.
> >
> > Thanks for your reply. I'll tell you what i'm getting now from the
> executions in the next lines.
> > When i run without doing a checkpoint i get this output, and the process
> don' finish:
> >
> > [hmeyer@clus9 whoami]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
> ft-enable-cr-recovery ./whoami 10 10
> > Antes de MPI_Init
> > Antes de MPI_Init
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > Soy el número 1 (1)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (1)
> > Terminando, una instrucción antes del finalize
> >
> --
> > Error: The process below has failed. There is no checkpoint available for
> >this job, so we are terminating the application since automatic
> >recovery cannot occur.
> > Internal Name: [[41167,1],0]
> > MCW Rank: 0
> >
> >
> --
> > [clus9:04985] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc
> > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> >
> > If i make a checkpoint in another terminal of the mpirun process, during
> the execution, i get this output:
> >
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> >
> --
> > Notice: The job has been successfully recovered from the
> > last checkpoint.
> >
> --
> > Soy el número 1 (1)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (1)
> > Terminando, una instrucción antes del finalize
> > [clus9:06105] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_recovering_job
> > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26

Re: [OMPI devel] OFED question

2011-01-28 Thread Shamis, Pavel
The command line actually is not so magic, but unfortunately we have never had 
time to complete btl_openib_receive_queue documentation. In the follow ticket 
you may find some initial documentation: 
https://svn.open-mpi.org/trac/ompi/ticket/1260
It may be good idea define some user friendly flag to switch to XRC or even SRQ.

Regards,
Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory

On Jan 27, 2011, at 8:38 PM, Paul H. Hargrove wrote:

> 
> RFE:  Could OMPI implement a short-hand for Pasha's following magical 
> incantation?
> 
> On 1/27/2011 5:34 PM, Shamis, Pavel wrote:
>> --mca btl_openib_receive_queues 
>> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> HPC Research Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel