Hans Zeller created TRAFODION-2692: -------------------------------------- Summary: Monitor fails to start when node names are not of the right form Key: TRAFODION-2692 URL: https://issues.apache.org/jira/browse/TRAFODION-2692 Project: Apache Trafodion Issue Type: Bug Components: foundation Affects Versions: 2.2-incubating Environment: I tried this on an OpenStack cluster, using Hortonworks HDP 5.4 Reporter: Hans Zeller
When trying to install Trafodion on a cluster, I ran into various situations where the monitor failed to start, based on how host names were configured and specified. I used three kinds of names: NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the mistake of just adding the nickname, not the actual name in the /etc/hosts line. LN - a local, non-qualified name that is also the OpenStack instance name and the host name. FQDN - the fully qualified domain host name {noformat} Case Name specified hostname command sqconfig What happened in HDP returns contains ---- -------------- ---------------- -------- -------------------------- 1 nickname local name nickname sqstart returned an error, saying that sqstart must be executed on one of the nodes of the cluster 2 local name local name FQDN? monitor core dump (1) 3 local name FQDN FQDN monitor abends (2) 4 FQDN FQDN FQDN install succeeds {noformat} Notes: (1) The core dump happened because of the following code in file core/sqf/monitor/linux/cluster.cxx: {noformat} // Build the monitor's configured view of the cluster if ( IsRealCluster ) { // Map node name to physical node id // (for virtual nodes physical node equals "rank" (previously set)) MyPNID = clusterConfig->GetPNid( Node_name ); } Nodes->AddNodes( ); MyNode = Nodes->GetNode(MyPNID); Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ ); {noformat} Node_name is a local name. The name of the nodes in the "Nodes" list is the FQDN, so we don't find the node and MyPNID is set to -1. This leads to dereferencing MyNode, which is a NULL pointer. Note 2: The third case is the same as the second, with two modifications: Use the "hostname" command to set the host name to the FQDN, and edit /etc/hosts to put the FQDN first in the line and the local name second (case 2 had it the other way round). This time, we get past the problem described in case 2, but we get an error from MPI, which is unable to communicate with all the nodes (sorry, didn't record the exact error message). This is now the lines in /etc/hosts look like (same layout for all nodes of the cluster): {noformat} # case 1 1.2.3.4 nickname1 1.2.3.5 nickname2 # case 2 1.2.3.4 mynode1 mynode1.novalocal 1.2.3.5 mynode2 mynode2.novalocal # cases 3 and 4 1.2.3.4 mynode1.novalocal mynode1 1.2.3.5 mynode2.novalocal mynode2 {noformat} My suggestion would be to identify the places where we read node names that are provided by the user and where such node names are compared, and to provide a comparison method that tolerates equivalent forms of names. There are related JIRAs: TRAFODION-2480 and TRAFODION-2391. -- This message was sent by Atlassian JIRA (v6.4.14#64029)