[ https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322606#comment-16322606 ]
ASF GitHub Bot commented on TRAFODION-2881: ------------------------------------------- GitHub user zcorrea opened a pull request: https://github.com/apache/trafodion/pull/1392 [TRAFODION-2881] HA fixes Fixed multiple problems in monitor Allgather() socket reconnect logic. - Separated node down detection logic from communication errors and timeouts to better handle multiple failure scenarios - Better handling network resets - Additional trace information - Fixed 'node up' hang in monitor shell due to TmSync race condition You can merge this pull request into a Git repository by running: $ git pull https://github.com/zcorrea/trafodion TRAFODION-2881 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafodion/pull/1392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1392 ---- commit e832d827507521998567d4cc5d92e4239007d19a Author: Zalo Correa <zalo.correa@...> Date: 2018-01-11T17:32:11Z [TRAFODION-2881] HA fixes Fixed multiple problems in monitor Allgather() socket reconnect logic. - Separated node down detection logic from communication errors and timeouts to better handle multiple failure scenarios - Better handling network resets - Additional trace information - Fixed 'node up' hang in monitor shell due to TmSync race condition ---- > Multiple node failures occur during HA testing > ---------------------------------------------- > > Key: TRAFODION-2881 > URL: https://issues.apache.org/jira/browse/TRAFODION-2881 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation > Affects Versions: 2.3 > Reporter: Gonzalo E Correa > Assignee: Gonzalo E Correa > Fix For: 2.3 > > > Inflicting server failure in certain modes will cause multiple monitor > process to also bring their nodes down along with the intended target of the > test. > Server down modes: > init 6 > reboot -f > shutdown -r now > shell node down command > In addition, after a server down, the shell 'node up' command will also fail > intermittently. This requires a longevity HA test to down and up nodes over a > long period of time like 24-48 hours. -- This message was sent by Atlassian JIRA (v6.4.14#64029)