Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Nan Zhu Wed, 16 Aug 2017 10:31:49 -0700

> I really don't understand what you mean. You need somewhere to keep the
application handles you're monitoring regarding of the solution. The code
making the YARN request needs to somehow update those handles. Whether
there's a task per handle that is submitted to a thread pool, or some map
or list tracking all available handles that are then updated by the single
thread talking to YARN, it doesn't matter.


> In the first case your thread pool is the "shared data structure", in the
second case this map of handles is the "shared data structure", so I don't
understand why you think there is any difference.

I do not understand why there is no difference

In your words, when the thread pool is the "shared data structure", we do
not need to involve any synchronization when applying CRUD to the handles.
If you shared "some map or list" between servlet threads and monitoring
thread, you have to handle synchronization of this "map or list". And yes,
if you have single monitoring thread, life can be easier (the potential
cons about single thread to handle everything through bulk operation is
another topic in this email)

> I'm proposing a different approach that I'm pretty sure is easier on YARN,
which is a shared service that we should be trying not to unnecessarily
overload. The least I'd expect is for you to consider the suggestion and
actually explain why it wouldn't work, but so far you've just been
deflecting feedback.

> You can, for example, see if such a bulk API exists and reply "I couldn't
find it". I believe it must exist, after all I can go to the RM web UI and
see all applications, and get a list of them from the YARN REST API. But if
it doesn't exist, that would take care of my suggestion.

You mixed two topics again

Topic 1 - what you proposed: I keep trying to discuss about the pros & cons
of single thread model, and I have said it for multiple times, it can make
life easier, but with additional efforts on 1. synchronization over a
map/list, 2. handling of exceptions (due to the dependency among all
running apps), etc.

Topic 2 - How we discuss: I am surprised that you blamed my way to discuss
even before you correct your own attitude. When we are conducting technical
discussion, I believe you are not in a good position to make it health,
e.g. when I said multiple thread can share the same RPC connections to
avoid your concern about multiple tasks will keep opening/closing
connections, you replied "Irrelevant" without any explanation, and also, I
didn't see any concrete evidence from you about why Actor-based solution is
not an option. On the other side, I have show you my concerns on bulk
operations, which are worth more discussions even though until so far I
just received some feedback like "Irrelevant" or "how about datacenter is
down"


> "I would investigate" is a suggestion that you investigate that as part
of proposing your change. It's not me saying that I'll do it myself (that
would be "I will investigate").

OK, I found it,
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API,
which is a restful API,

two concerns

1. change Livy's current impl is based on YarnClient not restful API, are
we going to change it?

2. being different with RM UI which can only fetch 20 applications for
every page, we need to fetch all applications (since it looks like we
cannot do "not match" against application state through this API)


> I'm expecting that errors be handled regardless of the situation. If YARN
returns an error to you, regardless of whether it was a request for a
single application status or for a bunch of them, your code needs to handle
it somehow. The handling will most probably be the same in both cases
(retry), and that's my point.

yes, this is one of the possible solutions, the pros is it is simple and
easy to handle, the cons is that it makes whether application A's state is
stale depend on all the other applications, which also needs more
discussions


Again, please make technical discussion as professional as possible



On Wed, Aug 16, 2017 at 9:44 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Wed, Aug 16, 2017 at 9:33 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
> >> What I proposed is having a single request to YARN to get all
> applications'
> > statuses, if that's possible. You'd still have multiple application
> handles
> > that are independent of each other. They'd all be updated separately from
> > that one thread talking to YARN. This has nothing to do with a "shared
> data
> > structure". There's no shared data structure here to track application
> > status.
> >
> > You are still avoiding the questions how you make all "application
> handles"
> > accessible to this thread
>
> I really don't understand what you mean. You need somewhere to keep
> the application handles you're monitoring regarding of the solution.
> The code making the YARN request needs to somehow update those
> handles. Whether there's a task per handle that is submitted to a
> thread pool, or some map or list tracking all available handles that
> are then updated by the single thread talking to YARN, it doesn't
> matter.
>
> In the first case your thread pool is the "shared data structure", in
> the second case this map of handles is the "shared data structure", so
> I don't understand why you think there is any difference.
>
> I'm proposing a different approach that I'm pretty sure is easier on
> YARN, which is a shared service that we should be trying not to
> unnecessarily overload. The least I'd expect is for you to consider
> the suggestion and actually explain why it wouldn't work, but so far
> you've just been deflecting feedback.
>
> You can, for example, see if such a bulk API exists and reply "I
> couldn't find it". I believe it must exist, after all I can go to the
> RM web UI and see all applications, and get a list of them from the
> YARN REST API. But if it doesn't exist, that would take care of my
> suggestion.
>
> > "I would investigate whether there's any API in YARN to do a bulk get of
> > running applications with a particular filter;" - from your email
> >
> > If you suggest something, please find evidence to support you
>
> "I would investigate" is a suggestion that you investigate that as
> part of proposing your change. It's not me saying that I'll do it
> myself (that would be "I will investigate").
>
> >> What if YARN goes down? What if your datacenter has a massive power
> > failure? You have to handle errors in any scenario.
> >
> > Again, I am describing one concrete scenario which is always involved in
> > any bulk operation and even we go to bulk direction, you have to handle
> > this. Since you proposed this bulk operation, I am asking you what's your
> > expectation about this.
>
> I'm expecting that errors be handled regardless of the situation. If
> YARN returns an error to you, regardless of whether it was a request
> for a single application status or for a bunch of them, your code
> needs to handle it somehow. The handling will most probably be the
> same in both cases (retry), and that's my point.
>
> --
> Marcelo
>

Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Reply via email to