Hi Maxim, introducing some toggles for metric collection should definitely work and can be contributed via a simple pull request.
However, if your are only concerned about a potential performance hit, we might as well think about tuning the existing metric calculation. I have skimmed the code, and there seem to be several more or less low-hanging fruits: * The uptime computation performs the task enumeration and sorting operation for every percentile, whereas this only needs to be done once. * The current approach used to compute a percentile takes O(n log n) time. There are alternative solutions running in only O(n) time. * There are some unnecessary allocations, i.e., SlaUtil.percentile() is always called on a temporary list. However, the first thing it does is to create a copy of that list. How about that: I will file a ticket for non-prod SLA stats and contribute a simple patch with toggles. If it turns out that these are unusable for twitter-scale, we can look into basic performance tuning. Best Regards, Stephan ________________________________________ From: Maxim Khutornenko <[email protected]> Sent: Friday, May 29, 2015 7:23 PM To: [email protected] Subject: Re: non-prod SLA stats Hi Stephan, Tracking the same set of metrics for all non-prod jobs could be somewhat expensive on both collection and consumption sides. The only metrics we currently chose to collect are MTTA/R to help us monitor scheduling rate in view of reduced cluster capacity (AURORA-774). Perhaps we could put non-prod collection behind a set of command line switches (Arg<Boolean>)? E.g.: SLA_COLLECT_NON_PROD_MEDIANS SLA_COLLECT_NON_PROD_JOB_UPTIMES SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES These could be defined in SlaModule and injected into MetricCalculator to let us finely tune the required non-prod collection set. What do you think? Thanks, Maxim On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan <[email protected]> wrote: > Hi everyone, > > we are are interested in the job uptime percentiles and the aggregate cluster > uptime percentage not only for production jobs, but also for our > non-production jobs. > > Are there any reasons why those are not available in a non-prod version, > similar to the current handling of mtta and mttr [1]? If there are no > objections, I will prepare a patch. > > Regards, > Stephan > > [1] > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69
