Thanks - I'll fix that bug!

On Feb 28, 2012, at 6:48 AM, pascal.dev...@bull.net wrote:

> devel-boun...@open-mpi.org a écrit sur 28/02/2012 10:54:15 :
> 
> > De : Ralph Castain <r...@open-mpi.org> 
> > A : Open MPI Developers <de...@open-mpi.org> 
> > Date : 28/02/2012 10:54 
> > Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile 
> > (on the trunk) 
> > Envoyé par : devel-boun...@open-mpi.org 
> > 
> > I'll see what I can do when next I have access to a slurm machine - 
> > hopefully in a day or two. 
> > 
> > Are you sure you are at the top of the trunk? I reviewed the code, 
> > and it clearly detects that the default hostile is empty and ignores
> > it if so. Like I said, I'm not seeing this behavior, and neither are
> > the slurm machines on MTT. 
> 
> I ran with a version from Feb 12th (I had a synchronization problem). 
> Now with the latest patches (Feb 27th), by default I have no more problem. 
> 
> But, ... it is no more possible to change the mca parameter 
> "orte_default_hostfile". 
> For example in $HOME/.openmpi/mca-params.conf I put: 
>    orte_default_hostfile=none 
> Then, even with ompi_info, I get a segfault: 
> 
> [xxxx:17426] *** Process received signal *** 
> [xxxx:17426] Signal: Segmentation fault (11) 
> [xxxx:17426] Signal code: Address not mapped (1) 
> [xxxx:17426] Failing at address: (nil) 
> [xxxx:17426] [ 0] /lib64/libpthread.so.0() [0x327220f490] 
> [xxxx:17426] [ 1] /lib64/libc.so.6() [0x3271f24676] 
> [xxxx:17426] [ 2] /..../lib/libopen-rte.so.0(orte_register_params+0xaac) 
> [0x7fa46989677a] 
> [xxxx:17426] [ 3] mpirun(orterun+0xeb) [0x4039ed] 
> [xxxx:17426] [ 4] mpirun(main+0x20) [0x4034b4] 
> [xxxx:17426] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3271e1ec9d] 
> [xxxx:17426] [ 6] mpirun() [0x4033d9] 
> [xxxx:17426] *** End of error message *** 
> 
> After a look at orte/runtime/orte_mca_params.c, I propose the following patch 
> : 
> --- a/orte/runtime/orte_mca_params.c    Mon Feb 27 15:53:14 2012 +0000 
> +++ b/orte/runtime/orte_mca_params.c    Tue Feb 28 14:44:11 2012 +0100 
> @@ -301,7 +301,7 @@ 
>          asprintf(&orte_default_hostfile, "%s/etc/openmpi-default-hostfile", 
> opal_install_dirs.prefix); 
>          /* flag that nothing was given */ 
>          orte_default_hostfile_given = false; 
> -    } else if (0 == strcmp(orte_default_hostfile, "none")) { 
> +    } else if (0 == strcmp(strval, "none")) { 
>          orte_default_hostfile = NULL; 
>          /* flag that it was given */ 
>          orte_default_hostfile_given = true; 
> 
> 
> > 
> > On Feb 28, 2012, at 1:25 AM, pascal.dev...@bull.net wrote: 
> > 
> > 
> > devel-boun...@open-mpi.org a écrit sur 27/02/2012 15:53:06 :
> > 
> > > De : Ralph Castain <r...@open-mpi.org> 
> > > A : Open MPI Developers <de...@open-mpi.org> 
> > > Date : 27/02/2012 16:17 
> > > Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile 
> > > (on the trunk) 
> > > Envoyé par : devel-boun...@open-mpi.org 
> > > 
> > > That's strange - I run on slurm frequently and never have this 
> > > problem, and my default hostfile is present and empty. Do you have 
> > > anything in your default mca param file that might be telling us to 
> > > use the hostfile? 
> > > 
> > > The only way I can find to get that behavior is if your default mca 
> > > param file includes the orte_default_hostfile value. In that case, 
> > > you are telling us to use the default hostfile, and so we will enforce 
> > > it. 
> > 
> > Hi Ralph, 
> > 
> > On my side, the default value of orte_default_hostfile is a pointer 
> > to etc/openmpi-default-hostfile. 
> > The command ompi_info -a gives : 
> > 
> > MCA orte: parameter "orte_default_hostfile" (current value: <..../
> > etc/openmpi-default-hostfile>, data source: default value) 
> > Name of the default hostfile (relative or absolute path, "none" to 
> > ignore environmental or default MCA param setting) 
> > 
> > The following files are empty: 
> >  - .../etc/openmpi-mca-params.conf 
> >  - $HOME/.openmpi/mca-params.conf 
> > Another solution for me is to put "orte_default_hostfile=none" in 
> > one of these files. 
> > 
> > Pascal 
> > 
> > > 
> > > On Feb 27, 2012, at 5:57 AM, pascal.dev...@bull.net wrote: 
> > > 
> > > Hi all, 
> > > 
> > > I have problems with the openmpi-default-hostfile since the 
> > > following patch on the trunk 
> > > 
> > > changeset:   19874:088fc6c84a9f 
> > > user:        rhc 
> > > date:        Wed Feb 01 17:40:44 2012 +0000 
> > > summary:     In accordance with prior releases, we are supposed to 
> > > default to looking at the openmpi-default-hostfile as a default 
> > > hostfile. Restore that behavior, but ignore the file if it is empty.
> > > Allow the user to ignore any MCA param setting pointing to a default
> > > hostfile by setting the param to "none" (via cmd line or whatever) -
> > > this allows them to override a setting in the system default MCA 
> > param file. 
> > > 
> > > According to the summary of this patch, the openmpi-default-hostfile
> > > is ignored if it is empty. 
> > > But, when I run my jobs with slurm + mpirun, I get the following message: 
> > > --------------------------------------------------------------------------
> > >  
> > > No nodes are available for this job, either due to a failure to 
> > > allocate nodes to the job, or allocated nodes being marked 
> > > as unavailable (e.g., down, rebooting, or a process attempting 
> > > to be relocated to another node when none are available). 
> > > --------------------------------------------------------------------------
> > >  
> > > 
> > > I am able to run my job if: 
> > >  - either I put my node(s) in the file etc/openmpi-default-hostfile 
> > >  - or use "-mca orte_default_hostfile=none" in the mpirun command line 
> > >  - or "export OMPI_MCA_orte_default_hostfile none" in my environment 
> > > 
> > > It appears that an empty openmpi-default-hostfile is not ignored. 
> > > This patch seems not be complete 
> > > 
> > >  Or do I misunderstand something ? 
> > > 
> > > Pascal Devèze_______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to