[ https://issues.apache.org/jira/browse/TRAFODION-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108178#comment-16108178 ]
ASF GitHub Bot commented on TRAFODION-2692: ------------------------------------------- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/incubator-trafodion/pull/1192#discussion_r130491986 --- Diff: core/sqf/monitor/linux/monitor.cxx --- @@ -1089,6 +1090,22 @@ int main (int argc, char *argv[]) tmpptr++; } + // Remove the domain portion of the name if any --- End diff -- This is the third copy I've seen of code to remove the domain portion. Maybe they should be refactored into a separate function? > Monitor fails to start when node names are not of the right form > ---------------------------------------------------------------- > > Key: TRAFODION-2692 > URL: https://issues.apache.org/jira/browse/TRAFODION-2692 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation > Affects Versions: 2.2-incubating > Environment: I tried this on an OpenStack cluster, using Hortonworks > HDP 5.4. This is the code with the new elasticity feature. > Reporter: Hans Zeller > Assignee: Gonzalo E Correa > Fix For: 2.2-incubating > > > When trying to install Trafodion on a cluster, I ran into various situations > where the monitor failed to start, based on how host names were configured > and specified. I used three kinds of names: > NN - a "nickname", a name I made up and put into /etc/hosts. Note: I made the > mistake of just adding the nickname, not the actual name in the /etc/hosts > line. > LN - a local, non-qualified name that is also the OpenStack instance name and > the host name. > FQDN - the fully qualified domain host name > {noformat} > Case Name specified hostname command sqconfig What happened > in HDP returns contains > ---- -------------- ---------------- -------- -------------------------- > 1 nickname local name nickname sqstart returned an error, > saying that sqstart must > be executed on one of the > nodes of the cluster > 2 local name local name FQDN? monitor core dump (1) > 3 local name FQDN FQDN monitor abends (2) > 4 FQDN FQDN FQDN install succeeds > {noformat} > Notes: (1) The core dump happened because of the following code in file > core/sqf/monitor/linux/cluster.cxx: > {noformat} > // Build the monitor's configured view of the cluster > if ( IsRealCluster ) > { // Map node name to physical node id > // (for virtual nodes physical node equals "rank" (previously set)) > MyPNID = clusterConfig->GetPNid( Node_name ); > } > Nodes->AddNodes( ); > MyNode = Nodes->GetNode(MyPNID); > Nodes->SetupCluster( &Node, &LNode, &indexToPnid_ ); > {noformat} > Node_name is a local name. The name of the nodes in the "Nodes" list is the > FQDN, so we don't find the node and MyPNID is set to -1. This leads to > dereferencing MyNode, which is a NULL pointer. > Note 2: The third case is the same as the second, with two modifications: Use > the "hostname" command to set the host name to the FQDN, and edit /etc/hosts > to put the FQDN first in the line and the local name second (case 2 had it > the other way round). This time, we get past the problem described in case 2, > but we get an error from MPI, which is unable to communicate with all the > nodes (sorry, didn't record the exact error message). > This is now the lines in /etc/hosts look like (same layout for all nodes of > the cluster): > {noformat} > # case 1 > 1.2.3.4 nickname1 > 1.2.3.5 nickname2 > # case 2 > 1.2.3.4 mynode1 mynode1.novalocal > 1.2.3.5 mynode2 mynode2.novalocal > # cases 3 and 4 > 1.2.3.4 mynode1.novalocal mynode1 > 1.2.3.5 mynode2.novalocal mynode2 > {noformat} > My suggestion would be to identify the places where we read node names that > are provided by the user and where such node names are compared, and to > provide a comparison method that tolerates equivalent forms of names. > There are related JIRAs: TRAFODION-2480 and TRAFODION-2391. -- This message was sent by Atlassian JIRA (v6.4.14#64029)