Re: Locality results in instance shut-down due to single bad instance

Nitin Aggarwal Thu, 08 Jan 2015 10:18:10 -0800

+1. Also, should the node with failure threshold reached, be ever
considered in future for allocation ?
It is possible that due to some temporary issues with few nodes, locality
node set, could have a lot more nodes than needed.


On Thu, Jan 8, 2015 at 9:19 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

> https://issues.apache.org/jira/browse/SLIDER-743 it is then.
>
> On 8 January 2015 at 14:26, Jon Maron <jma...@hortonworks.com> wrote:
>
> > +1.  A good way to provide the functionality while leveraging existing
> > mechanisms
> >
> > On Jan 8, 2015, at 8:46 AM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> > > +1 on that
> > >
> > > That's also what I meant when I said -
> > >>> I don't think we have a logic where we apply data locality and then
> > upon a
> > >>> certain no of failures (threshold) try with "no data locality" at
> least
> > >>> once before giving up. It will be a good idea to file a JIRA with
> this
> > >>> requirement.
> > >
> > > -Gour
> > >
> > > - Sent from my iPhone
> > >
> > >> On Jan 8, 2015, at 3:30 AM, Steve Loughran <ste...@hortonworks.com>
> > wrote:
> > >>
> > >> thinking about this some more, we could use our tracking of node
> > >> reliability to tune our placement decisions.
> > >>
> > >>
> > >>  1. We add a "recent failures" field to the node entries, alongside
> the
> > >>  "total failures"
> > >>  2. Our scheduled failure count resetter will set that field to zero,
> > >>  alongside the component failures
> > >>  3. When Slider has to request a new container, unless the placement
> > >>  policy is STRICT, we will continue to use the (persisted) placement
> > history
> > >>  4. Except now, if a node has a recent failure count above some
> > >>  threshold, we don't ask for a container on that node...we just ask
> for
> > >>  "anywhere" placement.
> > >>
> > >> What do people think?
> > >>
> > >>> On 7 January 2015 at 09:50, Steve Loughran <ste...@hortonworks.com>
> > wrote:
> > >>>
> > >>> the history of where things were is retained in the RoleHistory
> > >>> structures, persisted to HDFS and reread on startup. for each
> component
> > >>> type, it's sorted by most-recent-first.
> > >>>
> > >>> When a container is needed, the AM looks in that history first, and
> > looks
> > >>> through the list of "previously used nodes for that component type".,
> > >>> skipping any that already have an instance of that component running.
> > The
> > >>> chosen node is taken off the list, so there's no duplicates
> > >>> (exception: the component type doesn't have any locality, in which
> case
> > >>> although the history is tracked, it's not used for placement)
> > >>>
> > >>>
> > >>>
> > >>> When a placement on the node comes in, then its taken off the
> "pending
> > >>> list"
> > >>>
> > >>> There's one small issue here: no way to tie requests to allocations.
> We
> > >>> don't really care which request allocates a component to a node, we
> > just
> > >>> like to track outstanding requests for explicit nodes. The algorithm
> is
> > >>> -allocation to a requested node: remove node from "list of
> outstanding
> > >>> explicit requests"
> > >>> -allocation to another node: do nothing while there are outstanding
> > >>> requests
> > >>> -all outstanding requests satisfied: clean the list of outstanding
> > >>> "placed" requests.
> > >>>
> > >>> Now, fun happens when a container fails on a newly allocated node
> —and
> > its
> > >>> here there may be some policy tuning required.
> > >>>
> > >>> It comes down to this: what is the best way to react when a component
> > >>> fails to start, either immediately, or shortly after startup? This
> can
> > be a
> > >>> sign of a major problem "node doesn't run my app", or something
> > transient
> > >>> "port still considered in use"
> > >>>
> > >>> If its a transient problem, there's no harm in asking again.
> > >>>
> > >>> If its a permanent problem: we need to make the decision that this
> > node is
> > >>> bad —at least for that specific component.
> > >>>
> > >>> I think right now, on a startup/launch time failure, the failing node
> > is
> > >>> placed at the back of the list of recently used nodes; the failure
> > counts
> > >>> of both the node and the component incremented. Although there's a
> > YARN API
> > >>> where an application can provide blacklist hints to YARN, we're not
> > >>> currently using it.
> > >>>
> > >>> I think what you may be seeing is that Slider is repeatedly asking
> for
> > the
> > >>> same node: it's failing and going to the back of the list of
> previously
> > >>> used nodes, but at there is only one, it's being asked for again.
> > >>>
> > >>> We can tune this -maybe- but it gets complex.
> > >>>
> > >>> 1. If the placement policy is STRICT, then we must ask for that
> > previously
> > >>> used node. (Though thinking about it, the component must have started
> > at
> > >>> least once at some point in the past...I don't know if the special
> > case of
> > >>> "previously allocated but never started" is detected and handled)
> > >>>
> > >>> 2. If the placement is location-preferred, default, how best to react
> > to a
> > >>> launch failure? Completely cut that node off the list of suitable
> > targets?
> > >>> Or try again a few more times? If its a transient problem, retry
> gives
> > >>> locality without over-reacting. If its a permanent problem, then
> > retrying
> > >>> is the wrong policy.
> > >>>
> > >>> What should we do here? We are tracking failures in NodeEntry
> entries,
> > in
> > >>> a map of the cluster built up (NodeMap), but not currently using
> > failure
> > >>> counts there to make decisions. If we do think about using it, we'll
> > have
> > >>> to think about not just keeping the count of failures, but resetting
> > it on
> > >>> an interval, the way we now do with component failure counts.
> > >>>
> > >>> -steve
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>> On 7 January 2015 at 02:50, Gour Saha <gs...@hortonworks.com>
> wrote:
> > >>>>
> > >>>> Nitin,
> > >>>>
> > >>>> I don't think we have a logic where we apply data locality and then
> > upon a
> > >>>> certain no of failures (threshold) try with "no data locality" at
> > least
> > >>>> once before giving up. It will be a good idea to file a JIRA with
> this
> > >>>> requirement.
> > >>>>
> > >>>> -Gour
> > >>>>
> > >>>>
> > >>>> On Tue, Jan 6, 2015 at 5:12 PM, Nitin Aggarwal <
> > >>>> nitin3588.aggar...@gmail.com
> > >>>>> wrote:
> > >>>>
> > >>>>> I am running HBase application, and I prefer data locality. I don't
> > >>>> want to
> > >>>>> give up locality by default. It's ok to lose locality in rare
> > scenarios,
> > >>>>> where something is wrong with one of the local nodes.
> > >>>>> It's more of fail-safe that I am looking for, to give up locality,
> > if it
> > >>>>> cannot be satisfied.
> > >>>>>
> > >>>>> Thanks
> > >>>>> Nitin
> > >>>>>
> > >>>>>
> > >>>>>> On Tue, Jan 6, 2015 at 4:52 PM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>> Here is the meaning of 2 (see PlacementPolicy):
> > >>>>>>
> > >>>>>>  * No data locality; do not bother trying to ask for any location
> > >>>>>>
> > >>>>>>  */
> > >>>>>>
> > >>>>>> public static final int NO_DATA_LOCALITY = 2;
> > >>>>>>
> > >>>>>> On Tue, Jan 6, 2015 at 4:15 PM, Gour Saha <gs...@hortonworks.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Try setting property *yarn.component.placement.policy* to 2 for
> the
> > >>>>>>> component, something like this -
> > >>>>>>>
> > >>>>>>>   "HBASE_MASTER": {
> > >>>>>>>     "yarn.role.priority": "1",
> > >>>>>>>     "yarn.component.instances": "1",
> > >>>>>>>     "yarn.memory": "1500",
> > >>>>>>>     "yarn.component.placement.policy": "2"
> > >>>>>>>   },
> > >>>>>>>
> > >>>>>>> -Gour
> > >>>>>>>
> > >>>>>>> On Tue, Jan 6, 2015 at 3:33 PM, Nitin Aggarwal <
> > >>>>>>> nitin3588.aggar...@gmail.com
> > >>>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> We keep on running into scenario, where one of the node in the
> > >>>>> cluster
> > >>>>>>> went
> > >>>>>>>> bad (either due to clock out of sync, no disk space etc.). As a
> > >>>>> result
> > >>>>>>>> container fails to start, and due to locality, container is
> > >>>> assigned
> > >>>>> on
> > >>>>>>> the
> > >>>>>>>> same machine again and again, and it fails again and again.
> After
> > >>>> few
> > >>>>>>>> failures, when failure threshold is reached (which is currently
> > >>>> also
> > >>>>>> not
> > >>>>>>>> reset correctly. SLIDER-629), it triggers instance shut-down.
> > >>>>>>>>
> > >>>>>>>> Is there a way to give up locality, in case of multiple
> failures,
> > >>>> to
> > >>>>>>> avoid
> > >>>>>>>> this scenario ?
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>> Nitin Aggarwal
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> CONFIDENTIALITY NOTICE
> > >>>>>>> NOTICE: This message is intended for the use of the individual or
> > >>>>> entity
> > >>>>>> to
> > >>>>>>> which it is addressed and may contain information that is
> > >>>> confidential,
> > >>>>>>> privileged and exempt from disclosure under applicable law. If
> the
> > >>>>> reader
> > >>>>>>> of this message is not the intended recipient, you are hereby
> > >>>> notified
> > >>>>>> that
> > >>>>>>> any printing, copying, dissemination, distribution, disclosure or
> > >>>>>>> forwarding of this communication is strictly prohibited. If you
> > have
> > >>>>>>> received this communication in error, please contact the sender
> > >>>>>> immediately
> > >>>>>>> and delete it from your system. Thank You.
> > >>>>
> > >>>> --
> > >>>> CONFIDENTIALITY NOTICE
> > >>>> NOTICE: This message is intended for the use of the individual or
> > entity
> > >>>> to
> > >>>> which it is addressed and may contain information that is
> > confidential,
> > >>>> privileged and exempt from disclosure under applicable law. If the
> > reader
> > >>>> of this message is not the intended recipient, you are hereby
> notified
> > >>>> that
> > >>>> any printing, copying, dissemination, distribution, disclosure or
> > >>>> forwarding of this communication is strictly prohibited. If you have
> > >>>> received this communication in error, please contact the sender
> > >>>> immediately
> > >>>> and delete it from your system. Thank You.
> > >>
> > >> --
> > >> CONFIDENTIALITY NOTICE
> > >> NOTICE: This message is intended for the use of the individual or
> > entity to
> > >> which it is addressed and may contain information that is
> confidential,
> > >> privileged and exempt from disclosure under applicable law. If the
> > reader
> > >> of this message is not the intended recipient, you are hereby notified
> > that
> > >> any printing, copying, dissemination, distribution, disclosure or
> > >> forwarding of this communication is strictly prohibited. If you have
> > >> received this communication in error, please contact the sender
> > immediately
> > >> and delete it from your system. Thank You.
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Locality results in instance shut-down due to single bad instance

Reply via email to