Re: Locality results in instance shut-down due to single bad instance

Steve Loughran Wed, 07 Jan 2015 01:53:24 -0800

the history of where things were is retained in the RoleHistory structures,
persisted to HDFS and reread on startup. for each component type, it's
sorted by most-recent-first.

When a container is needed, the AM looks in that history first, and looks
through the list of "previously used nodes for that component type".,
skipping any that already have an instance of that component running. The
chosen node is taken off the list, so there's no duplicates
 (exception: the component type doesn't have any locality, in which case
although the history is tracked, it's not used for placement)

When a placement on the node comes in, then its taken off the "pending list"

There's one small issue here: no way to tie requests to allocations. We
don't really care which request allocates a component to a node, we just
like to track outstanding requests for explicit nodes. The algorithm is
 -allocation to a requested node: remove node from "list of outstanding
explicit requests"
 -allocation to another node: do nothing while there are outstanding
requests
 -all outstanding requests satisfied: clean the list of outstanding
"placed" requests.

Now, fun happens when a container fails on a newly allocated node —and its
here there may be some policy tuning required.

It comes down to this: what is the best way to react when a component fails
to start, either immediately, or shortly after startup? This can be a sign
of a major problem "node doesn't run my app", or something transient "port
still considered in use"

If its a transient problem, there's no harm in asking again.

If its a permanent problem: we need to make the decision that this node is
bad —at least for that specific component.

I think right now, on a startup/launch time failure, the failing node is
placed at the back of the list of recently used nodes; the failure counts
of both the node and the component incremented. Although there's a YARN API
where an application can provide blacklist hints to YARN, we're not
currently using it.

I think what you may be seeing is that Slider is repeatedly asking for the
same node: it's failing and going to the back of the list of previously
used nodes, but at there is only one, it's being asked for again.

We can tune this -maybe- but it gets complex.

1. If the placement policy is STRICT, then we must ask for that previously
used node. (Though thinking about it, the component must have started at
least once at some point in the past...I don't know if the special case of
"previously allocated but never started" is detected and handled)

2. If the placement is location-preferred, default, how best to react to a
launch failure? Completely cut that node off the list of suitable targets?
Or try again a few more times? If its a transient problem, retry gives
locality without over-reacting. If its a permanent problem, then retrying
is the wrong policy.

What should we do here? We are tracking failures in NodeEntry entries, in a
map of the cluster built up (NodeMap), but not currently using failure
counts there to make decisions. If we do think about using it, we'll have
to think about not just keeping the count of failures, but resetting it on
an interval, the way we now do with component failure counts.

-steve

On 7 January 2015 at 02:50, Gour Saha <gs...@hortonworks.com> wrote:

> Nitin,
>
> I don't think we have a logic where we apply data locality and then upon a
> certain no of failures (threshold) try with "no data locality" at least
> once before giving up. It will be a good idea to file a JIRA with this
> requirement.
>
> -Gour
>
>
> On Tue, Jan 6, 2015 at 5:12 PM, Nitin Aggarwal <
> nitin3588.aggar...@gmail.com
> > wrote:
>
> > I am running HBase application, and I prefer data locality. I don't want
> to
> > give up locality by default. It's ok to lose locality in rare scenarios,
> > where something is wrong with one of the local nodes.
> > It's more of fail-safe that I am looking for, to give up locality, if it
> > cannot be satisfied.
> >
> > Thanks
> > Nitin
> >
> >
> > On Tue, Jan 6, 2015 at 4:52 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Here is the meaning of 2 (see PlacementPolicy):
> > >
> > >    * No data locality; do not bother trying to ask for any location
> > >
> > >    */
> > >
> > >   public static final int NO_DATA_LOCALITY = 2;
> > >
> > > On Tue, Jan 6, 2015 at 4:15 PM, Gour Saha <gs...@hortonworks.com>
> wrote:
> > >
> > > > Try setting property *yarn.component.placement.policy* to 2 for the
> > > > component, something like this -
> > > >
> > > >     "HBASE_MASTER": {
> > > >       "yarn.role.priority": "1",
> > > >       "yarn.component.instances": "1",
> > > >       "yarn.memory": "1500",
> > > >       "yarn.component.placement.policy": "2"
> > > >     },
> > > >
> > > > -Gour
> > > >
> > > > On Tue, Jan 6, 2015 at 3:33 PM, Nitin Aggarwal <
> > > > nitin3588.aggar...@gmail.com
> > > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We keep on running into scenario, where one of the node in the
> > cluster
> > > > went
> > > > > bad (either due to clock out of sync, no disk space etc.). As a
> > result
> > > > > container fails to start, and due to locality, container is
> assigned
> > on
> > > > the
> > > > > same machine again and again, and it fails again and again. After
> few
> > > > > failures, when failure threshold is reached (which is currently
> also
> > > not
> > > > > reset correctly. SLIDER-629), it triggers instance shut-down.
> > > > >
> > > > > Is there a way to give up locality, in case of multiple failures,
> to
> > > > avoid
> > > > > this scenario ?
> > > > >
> > > > > Thanks
> > > > > Nitin Aggarwal
> > > > >
> > > >
> > > > --
> > > > CONFIDENTIALITY NOTICE
> > > > NOTICE: This message is intended for the use of the individual or
> > entity
> > > to
> > > > which it is addressed and may contain information that is
> confidential,
> > > > privileged and exempt from disclosure under applicable law. If the
> > reader
> > > > of this message is not the intended recipient, you are hereby
> notified
> > > that
> > > > any printing, copying, dissemination, distribution, disclosure or
> > > > forwarding of this communication is strictly prohibited. If you have
> > > > received this communication in error, please contact the sender
> > > immediately
> > > > and delete it from your system. Thank You.
> > > >
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Locality results in instance shut-down due to single bad instance

Reply via email to