[ 
https://issues.apache.org/jira/browse/TRAFODION-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153850#comment-15153850
 ] 

liu ming commented on TRAFODION-1834:
-------------------------------------

The issue is slightly different. The FQDN is not a problem.

The binder will build the initial node_map for a given table, at that time, the 
API invoked is called createNodeMapForHbase(), in that function, it will 
truncate the FQDN into hostname. And the NAClusterInfo class which collecting 
the hostname to nodeId mapping will also do the truncation. So it is a match.

But there is a new problem that this jira will fix:
In a system, there is a cluster with 10 nodes, HBase RS installed on 8 of them. 
Table bltest have 100M rows, 250G in one-replica size, split into 100 regions, 
evenly spread over 8 nodes.

do a very simple test: 
select [last 1]* from bltest;
Without this CQD set, it launched 10 ESPs over 10 nodes, and the ESP node_map 
printed as:
 
esp_2_node_map ......... (\NSK:-1:-1:-1:-1:-1:-1:-1:-1:-1:-1)
 
which means randomly locate each ESP
With this CQD, each time, the node_map is different, such as:
 
esp_2_node_map ......... (\NSK:0:0:0:0:0:0:0:0:0:0)
esp_2_node_map ......... (\NSK:7:7:7:7:7:7:7:7:7:7)
 
you can notice, the 10 ESP will be put into same node when CQD is ‘on’, and the 
node number is random. 
 
This bug is in optimizer’s NodeMap::getPopularNodeNumber() function, it tries 
to find out a most popular node. In above case, 10 ESPs try to read 100 
regions, so for the first ESP, it needs to read 10 regions, 0~9, but region 0 
and region 1 may be in different RS node, so this function is trying to find a 
RS which serves the most regions from Region 0 to Region 9, and locate first 
ESP there. 
But this function use an uninitialized array to do the job. It malloc a new 
buffer and use as counter array, but GCC is not always clear the newly alloc 
buffer, so you never know what init value in that array, so if node[0] init 
with a very big number, it will always win, and we saw:
esp_2_node_map ......... (\NSK:0:0:0:0:0:0:0:0:0:0)

a fix is to init the array.

And in this change, we remove the bias in getPopularNodeNumber() to lowest node 
ID, but make it random.

> ESP colocation (CQD TRAF_ALLOW_ESP_COLOCATION) not working when node names in 
> sqconfig are fully qualified
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: TRAFODION-1834
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-1834
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 1.3-incubating
>            Reporter: Atanu Mishra
>            Assignee: liu ming
>
> Paraphrasing Hans, who looked into this briefly: 
> Looking at the code, I can see that we remove anything from the first dot in 
> the node name of the region. Then we search for the unqualified node name in 
> the list of node names of the Trafodion cluster.
> Both of these clusters use FQDNs in their configuration, though, therefore we 
> won't find the node names.
> To Reproduce:
> Do an EXPLAIN of a parallel query with and without TRAF_ALLOW_ESP_COLOCATION 
> set, and the node map will not show a co-located plan when the CQD is on - if 
> the system uses FQDNs in its sqconfig file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to