Re: [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries

Yingyi Bu Fri, 11 Sep 2015 17:06:07 -0700

Right, exposing the configuration parameters is a separate issue.

Best,
Yingyi


On Fri, Sep 11, 2015 at 5:03 PM, Ian Maxon (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741761#comment-14741761
> ]
>
> Ian Maxon commented on ASTERIXDB-1076:
> --------------------------------------
>
> Oh, it's good that the heartbeats are at least not stuck in the big ol'
> WorkQueue. I was under the impression that was how it was.
>
> For addressing 1), the parameters for controlling heartbeat interval exist
> in Hyracks but they're command line args to the CC. So actually it is
> possible to change them, you just put them in the normal place where -Xmx
> and so on belong in the asterix-configuration.xml (I think, haven't
> tried... :) )
> It'd probably be easier/clearer to migrate them to be their own attributes
> in that file, otherwise it's kind of impossible to tell that the option
> exists in the first place.
>
> > False failures cause denying new queries
> > ----------------------------------------
> >
> >                 Key: ASTERIXDB-1076
> >                 URL:
> https://issues.apache.org/jira/browse/ASTERIXDB-1076
> >             Project: Apache AsterixDB
> >          Issue Type: Bug
> >          Components: AsterixDB
> >            Reporter: Yingyi Bu
> >            Assignee: Yingyi Bu
> >            Priority: Critical
> >
> > When CPUs in the cluster are saturated for computations,  the heartbeat
> from slave nodes to the master node might get delayed.  In this case, the
> master node thinks a node fails, and can no longer adds the node back.
> Hence, the entire cluster is not usable and an instance restart is needed.
> > Two things need to be fixed:
> > 1.  (at least) expose AsterixDB configuration parameters to allow users
> to set a large heartbeat threshold;
> > 2.  allow a node to leave and re-join a hyracks cluster.
> > In the long term, we might need to investigate better liveness check
> strategies.
> > To reproduce that issue,  just let slave nodes' CPUs overloaded and you
> will see that.
> > The exception " Asterix Cluster Global recovery is not yet complete and
> The system is in ACTIVE state" will be thrown for upcoming queries.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries

Reply via email to