Hi Maxim, I have submitted a first patch closely following your initial proposal. The patch needs another iteration or two, so please let me know what you think.
https://reviews.apache.org/r/35498/ Thanks, Stephan ________________________________________ From: Maxim Khutornenko <ma...@apache.org> Sent: Monday, June 1, 2015 6:30 PM To: dev@aurora.apache.org Subject: Re: non-prod SLA stats Hi Stephan, Thanks for you analysis. I must mention though that SLA algorithms were optimized for readability rather than CPU performance. Given the current minutely run cycle, I would not be concerned about calculation delay unless it threatens to break the schedule. In a large cluster with tens of thousands of SLA metrics (both prod and non-prod) the average observed SLA run time is around 4 seconds, which gives us plenty of headroom for growth. I am more concerned about the heap space used to store computed counters here. This may quickly become a bottleneck and as a side effect make our /vars endpoint unusable. Hence, the suggestion to make non-essential stats fully toggle-able. That said, if you envision a different use case with a much larger metric set or anticipate a more frequent run schedule - feel free to propose patches. I'd also encourage to invest some time into an SLA benchmark using our JMH-based harness to back your changes with real perf data. Thanks, Maxim On Mon, Jun 1, 2015 at 3:26 AM, Erb, Stephan <stephan....@blue-yonder.com> wrote: > Hi Maxim, > > introducing some toggles for metric collection should definitely work and can > be contributed via a simple pull request. > > However, if your are only concerned about a potential performance hit, we > might as well think about tuning the existing metric calculation. I have > skimmed the code, and there seem to be several more or less low-hanging > fruits: > > * The uptime computation performs the task enumeration and sorting operation > for every percentile, whereas this only needs to be done once. > * The current approach used to compute a percentile takes O(n log n) time. > There are alternative solutions running in only O(n) time. > * There are some unnecessary allocations, i.e., SlaUtil.percentile() is > always called on a temporary list. However, the first thing it does is to > create a copy of that list. > > How about that: I will file a ticket for non-prod SLA stats and contribute a > simple patch with toggles. If it turns out that these are unusable for > twitter-scale, we can look into basic performance tuning. > > Best Regards, > Stephan > ________________________________________ > From: Maxim Khutornenko <ma...@apache.org> > Sent: Friday, May 29, 2015 7:23 PM > To: dev@aurora.apache.org > Subject: Re: non-prod SLA stats > > Hi Stephan, > > Tracking the same set of metrics for all non-prod jobs could be > somewhat expensive on both collection and consumption sides. The only > metrics we currently chose to collect are MTTA/R to help us monitor > scheduling rate in view of reduced cluster capacity (AURORA-774). > Perhaps we could put non-prod collection behind a set of command line > switches (Arg<Boolean>)? E.g.: > > SLA_COLLECT_NON_PROD_MEDIANS > SLA_COLLECT_NON_PROD_JOB_UPTIMES > SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES > > These could be defined in SlaModule and injected into MetricCalculator > to let us finely tune the required non-prod collection set. What do > you think? > > Thanks, > Maxim > > On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan > <stephan....@blue-yonder.com> wrote: >> Hi everyone, >> >> we are are interested in the job uptime percentiles and the aggregate >> cluster uptime percentage not only for production jobs, but also for our >> non-production jobs. >> >> Are there any reasons why those are not available in a non-prod version, >> similar to the current handling of mtta and mttr [1]? If there are no >> objections, I will prepare a patch. >> >> Regards, >> Stephan >> >> [1] >> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69