I like that approach on paper, although I currently don't have much time to actually be able to review the PR and provide decent feedback.
I think that regardless of the approach, one goal should be to probably separate what is being monitored from how it's being monitored; that way you can later change the monitoring code to be smarter without having to change the rest of the code that calls into it. I remember reviewing this code when it first was submitted and it could definitely use some refactoring. > we have Livy Multi-Node HA i.e livy running on 6 servers for each cluster, I'm not really familiar with how multi-node HA was implemented (I stopped at session recovery), but why isn't a single server doing the update and storing the results in ZK? Unless it's actually doing load-balancing, it seems like that would avoid multiple servers having to hit YARN. On Wed, Aug 16, 2017 at 4:18 PM, Prabhu Kasinathan <vasurampra...@gmail.com> wrote: > As Meisam highlighted, in our case, we have Livy Multi-Node HA i.e livy > running on 6 servers for each cluster, load-balanced, sharing livy metadata > on zookeeper and running thousands of applications. With below changes, we > are seeing good improvements due to batching the requests (one per livy > node) instead of each livy node making multiple requests. Please review the > changes and let us know if improvements needed or we are open to explore > other alternative option if works. > >> We are making one big request to get ApplicationReports, Then we make an >> individual + thread pool request to get the tracking URL, Spark UI URL, >> YARN diagnostics, etc for each application separately. For our cluster >> settings and our workloads, one big request turned out to be a better >> solution. But we were limited to the API provided in YarnClient. With the >> home-made REST client a separate request is not needed and that can change >> the whole equation. > > > > On Wed, Aug 16, 2017 at 3:33 PM, Meisam Fathi <meisam.fa...@gmail.com> > wrote: > >> >> On Wed, Aug 16, 2017 at 2:09 PM Nan Zhu <zhunanmcg...@gmail.com> wrote: >> >>> With time goes, the reply from YARN can only be larger and larger. Given >>> the consistent workload pattern, the cost of a large query can be >>> eventually larger than individual request >>> >> >> I am under the impression that there is a limit to the number of reports >> that YARN retains, which is set by >> yarn.resourcemanager.max-completed-applications >> in yarn.xml and defaults to 10,000. But I could be wrong about the >> semantics of yarn.resourcemanager.max-completed-applications. >> >> I would say go with individual request + thread pool or large batch for >>> all first, if any performance issue is observed, add the optimization on >>> top of it >>> >> >> We are making one big request to get ApplicationReports, Then we make an >> individual + thread pool request to get the tracking URL, Spark UI URL, >> YARN diagnostics, etc for each application separately. For our cluster >> settings and our workloads, one big request turned out to be a better >> solution. But we were limited to the API provided in YarnClient. With the >> home-made REST client a separate request is not needed and that can change >> the whole equation. >> >> @Prabhu, can you chime in? >> >> >>> However, even with rest API, there are some corner cases, e.g. a >>> long running app lasting for days (training some models), and some short >>> ones which last only for minutes >>> >> >> We are running Spark streaming jobs on Livy that virtually run for ever. >> >> Thanks, >> Meisam >> -- Marcelo