Hi Maxim,

introducing some toggles for metric collection should definitely work and can 
be contributed via a simple pull request. 

However, if your are only concerned about a potential performance hit, we might 
as well think about tuning the existing metric calculation. I have skimmed the 
code, and there seem to be several more or less low-hanging fruits:

* The uptime computation performs the task enumeration and sorting operation 
for every percentile, whereas this only needs to be done once.
* The current approach used to compute a percentile takes O(n log n) time. 
There are alternative solutions running in only O(n) time.
* There are some unnecessary allocations, i.e., SlaUtil.percentile() is always 
called on a temporary list. However, the first thing it does is to create a 
copy of that list.

How about that: I will file a ticket for non-prod SLA stats and contribute a 
simple patch with toggles. If it turns out that these are unusable for 
twitter-scale, we can look into basic performance tuning.

Best Regards,
Stephan
________________________________________
From: Maxim Khutornenko <[email protected]>
Sent: Friday, May 29, 2015 7:23 PM
To: [email protected]
Subject: Re: non-prod SLA stats

Hi Stephan,

Tracking the same set of metrics for all non-prod jobs could be
somewhat expensive on both collection and consumption sides. The only
metrics we currently chose to collect are MTTA/R to help us monitor
scheduling rate in view of reduced cluster capacity (AURORA-774).
Perhaps we could put non-prod collection behind a set of command line
switches (Arg<Boolean>)? E.g.:

SLA_COLLECT_NON_PROD_MEDIANS
SLA_COLLECT_NON_PROD_JOB_UPTIMES
SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES

These could be defined in SlaModule and injected into MetricCalculator
to let us finely tune the required non-prod collection set. What do
you think?

Thanks,
Maxim

On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan
<[email protected]> wrote:
> Hi everyone,
>
> we are are interested in the job uptime percentiles and the aggregate cluster 
> uptime percentage not only for production jobs, but also for our 
> non-production jobs.
>
> Are there any reasons why those are not available in a non-prod version, 
> similar to the current handling of mtta and mttr [1]?  If there are no 
> objections, I will prepare a patch.
>
> Regards,
> Stephan
>
> [1] 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69

Reply via email to