Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Marcelo Vanzin Wed, 16 Aug 2017 19:08:00 -0700

I like that approach on paper, although I currently don't have much
time to actually be able to review the PR and provide decent feedback.


I think that regardless of the approach, one goal should be to
probably separate what is being monitored from how it's being
monitored; that way you can later change the monitoring code to be
smarter without having to change the rest of the code that calls into
it. I remember reviewing this code when it first was submitted and it
could definitely use some refactoring.

> we have Livy Multi-Node HA i.e livy running on 6 servers for each cluster,

I'm not really familiar with how multi-node HA was implemented (I
stopped at session recovery), but why isn't a single server doing the
update and storing the results in ZK? Unless it's actually doing
load-balancing, it seems like that would avoid multiple servers having
to hit YARN.



On Wed, Aug 16, 2017 at 4:18 PM, Prabhu Kasinathan
<vasurampra...@gmail.com> wrote:
> As Meisam highlighted, in our case, we have Livy Multi-Node HA i.e livy
> running on 6 servers for each cluster, load-balanced, sharing livy metadata
> on zookeeper and running thousands of applications. With below changes, we
> are seeing good improvements due to batching the requests (one per livy
> node) instead of each livy node making multiple requests. Please review the
> changes and let us know if improvements needed or we are open to explore
> other alternative option if works.
>
>> We are making one big request to get ApplicationReports, Then we make an
>> individual + thread pool request to get the tracking URL, Spark UI URL,
>> YARN diagnostics, etc for each application separately. For our cluster
>> settings and our workloads, one big request turned out to be a better
>> solution. But we were limited to the API provided in YarnClient. With the
>> home-made REST client a separate request is not needed and that can change
>> the whole equation.
>
>
>
> On Wed, Aug 16, 2017 at 3:33 PM, Meisam Fathi <meisam.fa...@gmail.com>
> wrote:
>
>>
>> On Wed, Aug 16, 2017 at 2:09 PM Nan Zhu <zhunanmcg...@gmail.com> wrote:
>>
>>> With time goes, the reply from YARN can only be larger and larger. Given
>>> the consistent workload pattern, the cost of a large query can be
>>> eventually larger than individual request
>>>
>>
>> I am under the impression that there is a limit to the number of reports
>> that YARN retains, which is set by 
>> yarn.resourcemanager.max-completed-applications
>> in yarn.xml and defaults to 10,000. But I could be wrong about the
>> semantics of yarn.resourcemanager.max-completed-applications.
>>
>> I would say go with individual request + thread pool  or large batch for
>>> all first, if any performance issue is observed, add the optimization on
>>> top of it
>>>
>>
>> We are making one big request to get ApplicationReports, Then we make an
>> individual + thread pool request to get the tracking URL, Spark UI URL,
>> YARN diagnostics, etc for each application separately. For our cluster
>> settings and our workloads, one big request turned out to be a better
>> solution. But we were limited to the API provided in YarnClient. With the
>> home-made REST client a separate request is not needed and that can change
>> the whole equation.
>>
>> @Prabhu, can you chime in?
>>
>>
>>> However, even with rest API, there are some corner cases, e.g. a
>>> long running app lasting for days (training some models), and some short
>>> ones which last only for minutes
>>>
>>
>> We are running Spark streaming jobs on Livy that virtually run for ever.
>>
>> Thanks,
>> Meisam
>>



-- 
Marcelo

Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Reply via email to