RE: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Arijit Tarafdar Wed, 16 Aug 2017 10:36:03 -0700

We need the following scenarios to be supported:

1. Individual application submission.
2. Individual application status query.
3. Batch application status query.
4. Batch application status query by application status.
5. Batch application status query by application user.
6. Batch application status query by application user and status.

More application status query filters can be added but let's stop at that. Also 
we need to add throttling on top of it whose absence in LIVY creates lots of 
issues in production (that is for a subsequent discussion).

@Marcelo
Now given those requirements suppose we have an YARN API to bulk query the 
application and which also supports filtering are you proposing that Livy 
maintains the application state map (as cache) which will be updated by a 
single thread at regular interval by querying YARN? If true, the problems I see 
are the following:

1. Additional copy of states in Livy which can be queried from YARN on request.
2. The design is not event driven and may waste querying YARN unnecessarily 
when no actual user/external request is pending.
3. There will always be an issue with stale data and update latency between 
actual YARN state and Livy state map.
4. Size and latency of the response in bulk querying YARN is unknown.
5. YARN bulk API needs to support filtering at the query level.

Thanks, Arijit

-----Original Message-----
From: Nan Zhu [mailto:zhunanmcg...@gmail.com] 
Sent: Wednesday, August 16, 2017 10:31 AM
To: dev@livy.incubator.apache.org
Subject: Re: resolve the scalability problem caused by app monitoring in livy 
with an actor-based design

> I really don't understand what you mean. You need somewhere to keep 
> the
application handles you're monitoring regarding of the solution. The code 
making the YARN request needs to somehow update those handles. Whether there's 
a task per handle that is submitted to a thread pool, or some map or list 
tracking all available handles that are then updated by the single thread 
talking to YARN, it doesn't matter.

> In the first case your thread pool is the "shared data structure", in 
> the
second case this map of handles is the "shared data structure", so I don't 
understand why you think there is any difference.

I do not understand why there is no difference

In your words, when the thread pool is the "shared data structure", we do not 
need to involve any synchronization when applying CRUD to the handles.
If you shared "some map or list" between servlet threads and monitoring thread, 
you have to handle synchronization of this "map or list". And yes, if you have 
single monitoring thread, life can be easier (the potential cons about single 
thread to handle everything through bulk operation is another topic in this 
email)

> I'm proposing a different approach that I'm pretty sure is easier on 
> YARN,
which is a shared service that we should be trying not to unnecessarily 
overload. The least I'd expect is for you to consider the suggestion and 
actually explain why it wouldn't work, but so far you've just been deflecting 
feedback.

> You can, for example, see if such a bulk API exists and reply "I 
> couldn't
find it". I believe it must exist, after all I can go to the RM web UI and see 
all applications, and get a list of them from the YARN REST API. But if it 
doesn't exist, that would take care of my suggestion.

You mixed two topics again

Topic 1 - what you proposed: I keep trying to discuss about the pros & cons of 
single thread model, and I have said it for multiple times, it can make life 
easier, but with additional efforts on 1. synchronization over a map/list, 2. 
handling of exceptions (due to the dependency among all running apps), etc.

Topic 2 - How we discuss: I am surprised that you blamed my way to discuss even 
before you correct your own attitude. When we are conducting technical 
discussion, I believe you are not in a good position to make it health, e.g. 
when I said multiple thread can share the same RPC connections to avoid your 
concern about multiple tasks will keep opening/closing connections, you replied 
"Irrelevant" without any explanation, and also, I didn't see any concrete 
evidence from you about why Actor-based solution is not an option. On the other 
side, I have show you my concerns on bulk operations, which are worth more 
discussions even though until so far I just received some feedback like 
"Irrelevant" or "how about datacenter is down"

> "I would investigate" is a suggestion that you investigate that as 
> part
of proposing your change. It's not me saying that I'll do it myself (that would 
be "I will investigate").

OK, I found it,
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fstable%2Fhadoop-yarn%2Fhadoop-yarn-site%2FResourceManagerRest.html%23Cluster_Applications_API&data=02%7C01%7Carijitt%40microsoft.com%7C26e05599670540cc08b808d4e4cca4c7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636385014975825054&sdata=4jnMFOqIF4Pku30O30QCLIx29OTEGafB%2FTj3hvyg9pI%3D&reserved=0,
which is a restful API,

two concerns

1. change Livy's current impl is based on YarnClient not restful API, are we 
going to change it?

2. being different with RM UI which can only fetch 20 applications for every 
page, we need to fetch all applications (since it looks like we cannot do "not 
match" against application state through this API)

> I'm expecting that errors be handled regardless of the situation. If 
> YARN
returns an error to you, regardless of whether it was a request for a single 
application status or for a bunch of them, your code needs to handle it 
somehow. The handling will most probably be the same in both cases (retry), and 
that's my point.

yes, this is one of the possible solutions, the pros is it is simple and easy 
to handle, the cons is that it makes whether application A's state is stale 
depend on all the other applications, which also needs more discussions

Again, please make technical discussion as professional as possible

On Wed, Aug 16, 2017 at 9:44 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Wed, Aug 16, 2017 at 9:33 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
> >> What I proposed is having a single request to YARN to get all
> applications'
> > statuses, if that's possible. You'd still have multiple application
> handles
> > that are independent of each other. They'd all be updated separately 
> > from that one thread talking to YARN. This has nothing to do with a 
> > "shared
> data
> > structure". There's no shared data structure here to track 
> > application status.
> >
> > You are still avoiding the questions how you make all "application
> handles"
> > accessible to this thread
>
> I really don't understand what you mean. You need somewhere to keep 
> the application handles you're monitoring regarding of the solution.
> The code making the YARN request needs to somehow update those 
> handles. Whether there's a task per handle that is submitted to a 
> thread pool, or some map or list tracking all available handles that 
> are then updated by the single thread talking to YARN, it doesn't 
> matter.
>
> In the first case your thread pool is the "shared data structure", in 
> the second case this map of handles is the "shared data structure", so 
> I don't understand why you think there is any difference.
>
> I'm proposing a different approach that I'm pretty sure is easier on 
> YARN, which is a shared service that we should be trying not to 
> unnecessarily overload. The least I'd expect is for you to consider 
> the suggestion and actually explain why it wouldn't work, but so far 
> you've just been deflecting feedback.
>
> You can, for example, see if such a bulk API exists and reply "I 
> couldn't find it". I believe it must exist, after all I can go to the 
> RM web UI and see all applications, and get a list of them from the 
> YARN REST API. But if it doesn't exist, that would take care of my 
> suggestion.
>
> > "I would investigate whether there's any API in YARN to do a bulk 
> > get of running applications with a particular filter;" - from your 
> > email
> >
> > If you suggest something, please find evidence to support you
>
> "I would investigate" is a suggestion that you investigate that as 
> part of proposing your change. It's not me saying that I'll do it 
> myself (that would be "I will investigate").
>
> >> What if YARN goes down? What if your datacenter has a massive power
> > failure? You have to handle errors in any scenario.
> >
> > Again, I am describing one concrete scenario which is always 
> > involved in any bulk operation and even we go to bulk direction, you 
> > have to handle this. Since you proposed this bulk operation, I am 
> > asking you what's your expectation about this.
>
> I'm expecting that errors be handled regardless of the situation. If 
> YARN returns an error to you, regardless of whether it was a request 
> for a single application status or for a bunch of them, your code 
> needs to handle it somehow. The handling will most probably be the 
> same in both cases (retry), and that's my point.
>
> --
> Marcelo
>

RE: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Reply via email to