[ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322606#comment-16322606
 ] 

ASF GitHub Bot commented on TRAFODION-2881:
-------------------------------------------

GitHub user zcorrea opened a pull request:

    https://github.com/apache/trafodion/pull/1392

    [TRAFODION-2881] HA fixes

    Fixed multiple problems in monitor Allgather() socket reconnect logic.
    - Separated node down detection logic from communication errors and timeouts
      to better handle multiple failure scenarios
    - Better handling network resets
    - Additional trace information
    - Fixed 'node up' hang in monitor shell due to TmSync race condition

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zcorrea/trafodion TRAFODION-2881

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafodion/pull/1392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1392
    
----
commit e832d827507521998567d4cc5d92e4239007d19a
Author: Zalo Correa <zalo.correa@...>
Date:   2018-01-11T17:32:11Z

    [TRAFODION-2881] HA fixes
    Fixed multiple problems in monitor Allgather() socket reconnect logic.
    - Separated node down detection logic from communication errors and timeouts
      to better handle multiple failure scenarios
    - Better handling network resets
    - Additional trace information
    - Fixed 'node up' hang in monitor shell due to TmSync race condition

----


> Multiple node failures occur during HA testing
> ----------------------------------------------
>
>                 Key: TRAFODION-2881
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2881
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.3
>            Reporter: Gonzalo E Correa
>            Assignee: Gonzalo E Correa
>             Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In addition, after a server down, the shell 'node up' command will also fail 
> intermittently. This requires a longevity HA test to down and up nodes over a 
> long period of time like 24-48 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to