Re: ZK registry entries cleanup for slider app

Steve Loughran Mon, 04 Jan 2016 15:55:46 -0800

On 4 Jan 2016, at 18:59, Manoj Samel 
<manojsamelt...@gmail.com<mailto:manojsamelt...@gmail.com>> wrote:


Hi Steve,

Regarding your note "I've put it into HDP, but I think we need to accept
that it's not going to stay"

Would like to understand development process for slider w.r.t. code changes
in other parts of hadoop. E.g. do you put it in HDP first and then merge
into general hadoop branch ? Is there a timeline for when a slider change
that got into HDP gets into general hadoop codeline ?

We are unable to uptake HDP, hence asking

Thanks,

Manoj


That is the sole outstanding difference between branches that impacts slider, 
and it's a permanent annoyance to me.

The reason the patch hasn't gone in is that the YARN team weren't happy with it 
on the basis that (a) it was adding enough stuff to an RM that is becoming 
overloaded and (b) it gets complex with HA.

I managed to get the patch into HDP by virtue of checking it in myself, but I 
can't do that into the ASF codebase as you need a +1 vote from somebody else.

And...I've not gone back to revisit that code. I've sporadically updated it 
-it's fairly brittle to some changes in RM state, at least in tests.

I think I can make a good case for getting the automatic creation of user paths 
into the RM —it's got the credentials to do this in a secure cluster, and it 
means the behaviour is the same everywhere. I think to get that in, I'd have to 
cut out the idea of having the RM automatically clean up containers and app 
entries when they terminate.

Why?
 -There's the outstanding issue of what happens when ZK is down and there's an 
attempt to handle the failure/shutdown of many apps
 -It's the bit where some aspects of the registry design surface, other than 
just "we need a secure path"
 -as the registry is actually designed to live outside/before a YARN cluster is 
up and running, you can't be confident it'll always be there anyway.
 -if the RM fails then things will hang about anyway.

For slider, we can put in the patch to rm entries on container 
failure/termination, so only AM-lifespan entries become a problem. There's not 
much we can do there, except maybe add a new client slider yarn-registry 
operation. Would that suit?


Returning to YARN and YARN-913 wrapup, then, here's what we could do

  1.  Cut out from the TLA+ specification and code all notion of automatic 
cleanup. This isn't going to be backwards-incompatible, *as the RM-side feature 
was never in ASF Hadoop*. It does not, therefore, exist, as far as the ASF 
codebase is concerned. (We will need to keep the string constants as deprecated 
to stop compile/link problems)
  2.  Have a far more minimal RM-side registry which does nothing but path 
setup.
  3.  Get that patch into YARN
  4.  Update the distshell demo to set up an entry.

Like i said, I can't commit this if I do the work myself. But: I can review, 
vote for and commit the patch if someone else does it. Accordingly: if someone 
does want to sit down and do it, I will offer my services reviewing this code. 
And before they start —I'll even promise to update my current patch against 
hadoop-trunk

Any volunteers?

-Steve

Re: ZK registry entries cleanup for slider app

Reply via email to