[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated ASTERIXDB-1076:
---------------------------------
    Description: 
When CPUs in the cluster are saturated for computations,  the heartbeat from 
slave nodes to the master node might get delayed.  In this case, the master 
node thinks a node fails, and can no longer adds the node back.  Hence, the 
entire cluster is not usable and an instance restart is needed.

Two things need to be fixed:
1.  (at least) expose AsterixDB configuration parameters to allow users to set 
a large heartbeat threshold;
2.  allow a node to leave and re-join a hyracks cluster.

In the long term, we might need to investigate better liveness check strategies.


To reproduce that issue,  just let slave nodes' CPUs overloaded and you will 
see that.
The exception " Asterix Cluster Global recovery is not yet complete and The 
system is in ACTIVE state" will be thrown for upcoming queries.

  was:
When CPUs in the cluster are saturated for computations,  the heartbeat from 
slave nodes to the master node might get delayed.  In this case, the master 
node thinks a node fails, and can no longer adds the node back.  Hence, the 
entire cluster is not usable and an instance restart is needed.

Two things need to be fixed:
1.  (at least) expose AsterixDB configuration parameters to allow users to set 
a large heartbeat threshold;
2.  allow a node to leave and re-join a hyracks cluster.

In the long term, we might need to investigate better liveness check strategies.


To reproduce that issue,  just let slave nodes' CPUs overloaded and you will 
see that.

        Summary: False failures cause denying new queries  (was: False failures 
triggers denying new queries)

> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Priority: Critical
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from 
> slave nodes to the master node might get delayed.  In this case, the master 
> node thinks a node fails, and can no longer adds the node back.  Hence, the 
> entire cluster is not usable and an instance restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to 
> set a large heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check 
> strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will 
> see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The 
> system is in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to