+1. Also, should the node with failure threshold reached, be ever considered in future for allocation ? It is possible that due to some temporary issues with few nodes, locality node set, could have a lot more nodes than needed.
On Thu, Jan 8, 2015 at 9:19 AM, Steve Loughran <ste...@hortonworks.com> wrote: > https://issues.apache.org/jira/browse/SLIDER-743 it is then. > > On 8 January 2015 at 14:26, Jon Maron <jma...@hortonworks.com> wrote: > > > +1. A good way to provide the functionality while leveraging existing > > mechanisms > > > > On Jan 8, 2015, at 8:46 AM, Gour Saha <gs...@hortonworks.com> wrote: > > > > > +1 on that > > > > > > That's also what I meant when I said - > > >>> I don't think we have a logic where we apply data locality and then > > upon a > > >>> certain no of failures (threshold) try with "no data locality" at > least > > >>> once before giving up. It will be a good idea to file a JIRA with > this > > >>> requirement. > > > > > > -Gour > > > > > > - Sent from my iPhone > > > > > >> On Jan 8, 2015, at 3:30 AM, Steve Loughran <ste...@hortonworks.com> > > wrote: > > >> > > >> thinking about this some more, we could use our tracking of node > > >> reliability to tune our placement decisions. > > >> > > >> > > >> 1. We add a "recent failures" field to the node entries, alongside > the > > >> "total failures" > > >> 2. Our scheduled failure count resetter will set that field to zero, > > >> alongside the component failures > > >> 3. When Slider has to request a new container, unless the placement > > >> policy is STRICT, we will continue to use the (persisted) placement > > history > > >> 4. Except now, if a node has a recent failure count above some > > >> threshold, we don't ask for a container on that node...we just ask > for > > >> "anywhere" placement. > > >> > > >> What do people think? > > >> > > >>> On 7 January 2015 at 09:50, Steve Loughran <ste...@hortonworks.com> > > wrote: > > >>> > > >>> the history of where things were is retained in the RoleHistory > > >>> structures, persisted to HDFS and reread on startup. for each > component > > >>> type, it's sorted by most-recent-first. > > >>> > > >>> When a container is needed, the AM looks in that history first, and > > looks > > >>> through the list of "previously used nodes for that component type"., > > >>> skipping any that already have an instance of that component running. > > The > > >>> chosen node is taken off the list, so there's no duplicates > > >>> (exception: the component type doesn't have any locality, in which > case > > >>> although the history is tracked, it's not used for placement) > > >>> > > >>> > > >>> > > >>> When a placement on the node comes in, then its taken off the > "pending > > >>> list" > > >>> > > >>> There's one small issue here: no way to tie requests to allocations. > We > > >>> don't really care which request allocates a component to a node, we > > just > > >>> like to track outstanding requests for explicit nodes. The algorithm > is > > >>> -allocation to a requested node: remove node from "list of > outstanding > > >>> explicit requests" > > >>> -allocation to another node: do nothing while there are outstanding > > >>> requests > > >>> -all outstanding requests satisfied: clean the list of outstanding > > >>> "placed" requests. > > >>> > > >>> Now, fun happens when a container fails on a newly allocated node > —and > > its > > >>> here there may be some policy tuning required. > > >>> > > >>> It comes down to this: what is the best way to react when a component > > >>> fails to start, either immediately, or shortly after startup? This > can > > be a > > >>> sign of a major problem "node doesn't run my app", or something > > transient > > >>> "port still considered in use" > > >>> > > >>> If its a transient problem, there's no harm in asking again. > > >>> > > >>> If its a permanent problem: we need to make the decision that this > > node is > > >>> bad —at least for that specific component. > > >>> > > >>> I think right now, on a startup/launch time failure, the failing node > > is > > >>> placed at the back of the list of recently used nodes; the failure > > counts > > >>> of both the node and the component incremented. Although there's a > > YARN API > > >>> where an application can provide blacklist hints to YARN, we're not > > >>> currently using it. > > >>> > > >>> I think what you may be seeing is that Slider is repeatedly asking > for > > the > > >>> same node: it's failing and going to the back of the list of > previously > > >>> used nodes, but at there is only one, it's being asked for again. > > >>> > > >>> We can tune this -maybe- but it gets complex. > > >>> > > >>> 1. If the placement policy is STRICT, then we must ask for that > > previously > > >>> used node. (Though thinking about it, the component must have started > > at > > >>> least once at some point in the past...I don't know if the special > > case of > > >>> "previously allocated but never started" is detected and handled) > > >>> > > >>> 2. If the placement is location-preferred, default, how best to react > > to a > > >>> launch failure? Completely cut that node off the list of suitable > > targets? > > >>> Or try again a few more times? If its a transient problem, retry > gives > > >>> locality without over-reacting. If its a permanent problem, then > > retrying > > >>> is the wrong policy. > > >>> > > >>> What should we do here? We are tracking failures in NodeEntry > entries, > > in > > >>> a map of the cluster built up (NodeMap), but not currently using > > failure > > >>> counts there to make decisions. If we do think about using it, we'll > > have > > >>> to think about not just keeping the count of failures, but resetting > > it on > > >>> an interval, the way we now do with component failure counts. > > >>> > > >>> -steve > > >>> > > >>> > > >>> > > >>> > > >>> > > >>>> On 7 January 2015 at 02:50, Gour Saha <gs...@hortonworks.com> > wrote: > > >>>> > > >>>> Nitin, > > >>>> > > >>>> I don't think we have a logic where we apply data locality and then > > upon a > > >>>> certain no of failures (threshold) try with "no data locality" at > > least > > >>>> once before giving up. It will be a good idea to file a JIRA with > this > > >>>> requirement. > > >>>> > > >>>> -Gour > > >>>> > > >>>> > > >>>> On Tue, Jan 6, 2015 at 5:12 PM, Nitin Aggarwal < > > >>>> nitin3588.aggar...@gmail.com > > >>>>> wrote: > > >>>> > > >>>>> I am running HBase application, and I prefer data locality. I don't > > >>>> want to > > >>>>> give up locality by default. It's ok to lose locality in rare > > scenarios, > > >>>>> where something is wrong with one of the local nodes. > > >>>>> It's more of fail-safe that I am looking for, to give up locality, > > if it > > >>>>> cannot be satisfied. > > >>>>> > > >>>>> Thanks > > >>>>> Nitin > > >>>>> > > >>>>> > > >>>>>> On Tue, Jan 6, 2015 at 4:52 PM, Ted Yu <yuzhih...@gmail.com> > wrote: > > >>>>>> > > >>>>>> Here is the meaning of 2 (see PlacementPolicy): > > >>>>>> > > >>>>>> * No data locality; do not bother trying to ask for any location > > >>>>>> > > >>>>>> */ > > >>>>>> > > >>>>>> public static final int NO_DATA_LOCALITY = 2; > > >>>>>> > > >>>>>> On Tue, Jan 6, 2015 at 4:15 PM, Gour Saha <gs...@hortonworks.com> > > >>>> wrote: > > >>>>>> > > >>>>>>> Try setting property *yarn.component.placement.policy* to 2 for > the > > >>>>>>> component, something like this - > > >>>>>>> > > >>>>>>> "HBASE_MASTER": { > > >>>>>>> "yarn.role.priority": "1", > > >>>>>>> "yarn.component.instances": "1", > > >>>>>>> "yarn.memory": "1500", > > >>>>>>> "yarn.component.placement.policy": "2" > > >>>>>>> }, > > >>>>>>> > > >>>>>>> -Gour > > >>>>>>> > > >>>>>>> On Tue, Jan 6, 2015 at 3:33 PM, Nitin Aggarwal < > > >>>>>>> nitin3588.aggar...@gmail.com > > >>>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hi, > > >>>>>>>> > > >>>>>>>> We keep on running into scenario, where one of the node in the > > >>>>> cluster > > >>>>>>> went > > >>>>>>>> bad (either due to clock out of sync, no disk space etc.). As a > > >>>>> result > > >>>>>>>> container fails to start, and due to locality, container is > > >>>> assigned > > >>>>> on > > >>>>>>> the > > >>>>>>>> same machine again and again, and it fails again and again. > After > > >>>> few > > >>>>>>>> failures, when failure threshold is reached (which is currently > > >>>> also > > >>>>>> not > > >>>>>>>> reset correctly. SLIDER-629), it triggers instance shut-down. > > >>>>>>>> > > >>>>>>>> Is there a way to give up locality, in case of multiple > failures, > > >>>> to > > >>>>>>> avoid > > >>>>>>>> this scenario ? > > >>>>>>>> > > >>>>>>>> Thanks > > >>>>>>>> Nitin Aggarwal > > >>>>>>> > > >>>>>>> -- > > >>>>>>> CONFIDENTIALITY NOTICE > > >>>>>>> NOTICE: This message is intended for the use of the individual or > > >>>>> entity > > >>>>>> to > > >>>>>>> which it is addressed and may contain information that is > > >>>> confidential, > > >>>>>>> privileged and exempt from disclosure under applicable law. If > the > > >>>>> reader > > >>>>>>> of this message is not the intended recipient, you are hereby > > >>>> notified > > >>>>>> that > > >>>>>>> any printing, copying, dissemination, distribution, disclosure or > > >>>>>>> forwarding of this communication is strictly prohibited. If you > > have > > >>>>>>> received this communication in error, please contact the sender > > >>>>>> immediately > > >>>>>>> and delete it from your system. Thank You. > > >>>> > > >>>> -- > > >>>> CONFIDENTIALITY NOTICE > > >>>> NOTICE: This message is intended for the use of the individual or > > entity > > >>>> to > > >>>> which it is addressed and may contain information that is > > confidential, > > >>>> privileged and exempt from disclosure under applicable law. If the > > reader > > >>>> of this message is not the intended recipient, you are hereby > notified > > >>>> that > > >>>> any printing, copying, dissemination, distribution, disclosure or > > >>>> forwarding of this communication is strictly prohibited. If you have > > >>>> received this communication in error, please contact the sender > > >>>> immediately > > >>>> and delete it from your system. Thank You. > > >> > > >> -- > > >> CONFIDENTIALITY NOTICE > > >> NOTICE: This message is intended for the use of the individual or > > entity to > > >> which it is addressed and may contain information that is > confidential, > > >> privileged and exempt from disclosure under applicable law. If the > > reader > > >> of this message is not the intended recipient, you are hereby notified > > that > > >> any printing, copying, dissemination, distribution, disclosure or > > >> forwarding of this communication is strictly prohibited. If you have > > >> received this communication in error, please contact the sender > > immediately > > >> and delete it from your system. Thank You. > > > > > > -- > > > CONFIDENTIALITY NOTICE > > > NOTICE: This message is intended for the use of the individual or > entity > > to > > > which it is addressed and may contain information that is confidential, > > > privileged and exempt from disclosure under applicable law. If the > reader > > > of this message is not the intended recipient, you are hereby notified > > that > > > any printing, copying, dissemination, distribution, disclosure or > > > forwarding of this communication is strictly prohibited. If you have > > > received this communication in error, please contact the sender > > immediately > > > and delete it from your system. Thank You. > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >