[ 
https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa reassigned TRAFODION-2692:
-------------------------------------------

    Assignee: Gonzalo E Correa

> Monitor fails to start when node names are not of the right form
> ----------------------------------------------------------------
>
>                 Key: TRAFODION-2692
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2692
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.2-incubating
>         Environment: I tried this on an OpenStack cluster, using Hortonworks 
> HDP 5.4. This is the code with the new elasticity feature.
>            Reporter: Hans Zeller
>            Assignee: Gonzalo E Correa
>
> When trying to install Trafodion on a cluster, I ran into various situations 
> where the monitor failed to start, based on how host names were configured 
> and specified. I used three kinds of names:
> NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the 
> mistake of just adding the nickname, not the actual name in the /etc/hosts 
> line.
> LN - a local, non-qualified name that is also the OpenStack instance name and 
> the host name.
> FQDN - the fully qualified domain host name
> {noformat}
> Case  Name specified  hostname command  sqconfig  What happened
>       in HDP          returns           contains
> ----  --------------  ----------------  --------  --------------------------
>   1   nickname        local name        nickname  sqstart returned an error,
>                                                   saying that sqstart must
>                                                   be executed on one of the
>                                                   nodes of the cluster
>   2   local name      local name        FQDN?     monitor core dump (1)
>   3   local name      FQDN              FQDN      monitor abends (2)
>   4   FQDN            FQDN              FQDN      install succeeds
> {noformat}
> Notes: (1) The core dump happened because of the following code in file 
> core/sqf/monitor/linux/cluster.cxx:
> {noformat}
>     // Build the monitor's configured view of the cluster
>     if ( IsRealCluster )
>     {   // Map node name to physical node id
>         // (for virtual nodes physical node equals "rank" (previously set))
>         MyPNID = clusterConfig->GetPNid( Node_name );
>     }
>     Nodes->AddNodes( );
>     MyNode = Nodes->GetNode(MyPNID);
>     Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ );
> {noformat}
> Node_name is a local name. The name of the nodes in the "Nodes" list is the 
> FQDN, so we don't find the node and MyPNID is set to -1. This leads to 
> dereferencing MyNode, which is a NULL pointer.
> Note 2: The third case is the same as the second, with two modifications: Use 
> the "hostname" command to set the host name to the FQDN, and edit /etc/hosts 
> to put the FQDN first in the line and the local name second (case 2 had it 
> the other way round). This time, we get past the problem described in case 2, 
> but we get an error from MPI, which is unable to communicate with all the 
> nodes (sorry, didn't record the exact error message).
> This is now the lines in /etc/hosts look like (same layout for all nodes of 
> the cluster):
> {noformat}
> # case 1
> 1.2.3.4 nickname1
> 1.2.3.5 nickname2
> # case 2
> 1.2.3.4 mynode1 mynode1.novalocal
> 1.2.3.5 mynode2 mynode2.novalocal
> # cases 3 and 4
> 1.2.3.4 mynode1.novalocal mynode1
> 1.2.3.5 mynode2.novalocal mynode2
> {noformat}
> My suggestion would be to identify the places where we read node names that 
> are provided by the user and where such node names are compared, and to 
> provide a comparison method that tolerates equivalent forms of names.
> There are related JIRAs: TRAFODION-2480 and TRAFODION-2391.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to