Re: Locality results in instance shut-down due to single bad instance

Steve Loughran Thu, 08 Jan 2015 03:32:55 -0800

thinking about this some more, we could use our tracking of node
reliability to tune our placement decisions.



   1. We add a "recent failures" field to the node entries, alongside the
   "total failures"
   2. Our scheduled failure count resetter will set that field to zero,
   alongside the component failures
   3. When Slider has to request a new container, unless the placement
   policy is STRICT, we will continue to use the (persisted) placement history
   4. Except now, if a node has a recent failure count above some
   threshold, we don't ask for a container on that node...we just ask for
   "anywhere" placement.

What do people think?

On 7 January 2015 at 09:50, Steve Loughran <ste...@hortonworks.com> wrote:

> the history of where things were is retained in the RoleHistory
> structures, persisted to HDFS and reread on startup. for each component
> type, it's sorted by most-recent-first.
>
> When a container is needed, the AM looks in that history first, and looks
> through the list of "previously used nodes for that component type".,
> skipping any that already have an instance of that component running. The
> chosen node is taken off the list, so there's no duplicates
>  (exception: the component type doesn't have any locality, in which case
> although the history is tracked, it's not used for placement)
>
>
>
> When a placement on the node comes in, then its taken off the "pending
> list"
>
> There's one small issue here: no way to tie requests to allocations. We
> don't really care which request allocates a component to a node, we just
> like to track outstanding requests for explicit nodes. The algorithm is
>  -allocation to a requested node: remove node from "list of outstanding
> explicit requests"
>  -allocation to another node: do nothing while there are outstanding
> requests
>  -all outstanding requests satisfied: clean the list of outstanding
> "placed" requests.
>
> Now, fun happens when a container fails on a newly allocated node —and its
> here there may be some policy tuning required.
>
> It comes down to this: what is the best way to react when a component
> fails to start, either immediately, or shortly after startup? This can be a
> sign of a major problem "node doesn't run my app", or something transient
> "port still considered in use"
>
> If its a transient problem, there's no harm in asking again.
>
> If its a permanent problem: we need to make the decision that this node is
> bad —at least for that specific component.
>
> I think right now, on a startup/launch time failure, the failing node is
> placed at the back of the list of recently used nodes; the failure counts
> of both the node and the component incremented. Although there's a YARN API
> where an application can provide blacklist hints to YARN, we're not
> currently using it.
>
> I think what you may be seeing is that Slider is repeatedly asking for the
> same node: it's failing and going to the back of the list of previously
> used nodes, but at there is only one, it's being asked for again.
>
> We can tune this -maybe- but it gets complex.
>
> 1. If the placement policy is STRICT, then we must ask for that previously
> used node. (Though thinking about it, the component must have started at
> least once at some point in the past...I don't know if the special case of
> "previously allocated but never started" is detected and handled)
>
> 2. If the placement is location-preferred, default, how best to react to a
> launch failure? Completely cut that node off the list of suitable targets?
> Or try again a few more times? If its a transient problem, retry gives
> locality without over-reacting. If its a permanent problem, then retrying
> is the wrong policy.
>
> What should we do here? We are tracking failures in NodeEntry entries, in
> a map of the cluster built up (NodeMap), but not currently using failure
> counts there to make decisions. If we do think about using it, we'll have
> to think about not just keeping the count of failures, but resetting it on
> an interval, the way we now do with component failure counts.
>
> -steve
>
>
>
>
>
> On 7 January 2015 at 02:50, Gour Saha <gs...@hortonworks.com> wrote:
>
>> Nitin,
>>
>> I don't think we have a logic where we apply data locality and then upon a
>> certain no of failures (threshold) try with "no data locality" at least
>> once before giving up. It will be a good idea to file a JIRA with this
>> requirement.
>>
>> -Gour
>>
>>
>> On Tue, Jan 6, 2015 at 5:12 PM, Nitin Aggarwal <
>> nitin3588.aggar...@gmail.com
>> > wrote:
>>
>> > I am running HBase application, and I prefer data locality. I don't
>> want to
>> > give up locality by default. It's ok to lose locality in rare scenarios,
>> > where something is wrong with one of the local nodes.
>> > It's more of fail-safe that I am looking for, to give up locality, if it
>> > cannot be satisfied.
>> >
>> > Thanks
>> > Nitin
>> >
>> >
>> > On Tue, Jan 6, 2015 at 4:52 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >
>> > > Here is the meaning of 2 (see PlacementPolicy):
>> > >
>> > >    * No data locality; do not bother trying to ask for any location
>> > >
>> > >    */
>> > >
>> > >   public static final int NO_DATA_LOCALITY = 2;
>> > >
>> > > On Tue, Jan 6, 2015 at 4:15 PM, Gour Saha <gs...@hortonworks.com>
>> wrote:
>> > >
>> > > > Try setting property *yarn.component.placement.policy* to 2 for the
>> > > > component, something like this -
>> > > >
>> > > >     "HBASE_MASTER": {
>> > > >       "yarn.role.priority": "1",
>> > > >       "yarn.component.instances": "1",
>> > > >       "yarn.memory": "1500",
>> > > >       "yarn.component.placement.policy": "2"
>> > > >     },
>> > > >
>> > > > -Gour
>> > > >
>> > > > On Tue, Jan 6, 2015 at 3:33 PM, Nitin Aggarwal <
>> > > > nitin3588.aggar...@gmail.com
>> > > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > We keep on running into scenario, where one of the node in the
>> > cluster
>> > > > went
>> > > > > bad (either due to clock out of sync, no disk space etc.). As a
>> > result
>> > > > > container fails to start, and due to locality, container is
>> assigned
>> > on
>> > > > the
>> > > > > same machine again and again, and it fails again and again. After
>> few
>> > > > > failures, when failure threshold is reached (which is currently
>> also
>> > > not
>> > > > > reset correctly. SLIDER-629), it triggers instance shut-down.
>> > > > >
>> > > > > Is there a way to give up locality, in case of multiple failures,
>> to
>> > > > avoid
>> > > > > this scenario ?
>> > > > >
>> > > > > Thanks
>> > > > > Nitin Aggarwal
>> > > > >
>> > > >
>> > > > --
>> > > > CONFIDENTIALITY NOTICE
>> > > > NOTICE: This message is intended for the use of the individual or
>> > entity
>> > > to
>> > > > which it is addressed and may contain information that is
>> confidential,
>> > > > privileged and exempt from disclosure under applicable law. If the
>> > reader
>> > > > of this message is not the intended recipient, you are hereby
>> notified
>> > > that
>> > > > any printing, copying, dissemination, distribution, disclosure or
>> > > > forwarding of this communication is strictly prohibited. If you have
>> > > > received this communication in error, please contact the sender
>> > > immediately
>> > > > and delete it from your system. Thank You.
>> > > >
>> > >
>> >
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Locality results in instance shut-down due to single bad instance

Reply via email to