Re: non-prod SLA stats

Erb, Stephan Tue, 16 Jun 2015 02:14:44 -0700

Hi Maxim,

I have submitted a first patch closely following your initial proposal.  The 
patch needs another iteration or two, so please let me know what you think.


https://reviews.apache.org/r/35498/ 

Thanks,
Stephan

________________________________________
From: Maxim Khutornenko <[email protected]>
Sent: Monday, June 1, 2015 6:30 PM
To: [email protected]
Subject: Re: non-prod SLA stats

Hi Stephan,

Thanks for you analysis. I must mention though that SLA algorithms
were optimized for readability rather than CPU performance.

Given the current minutely run cycle, I would not be concerned about
calculation delay unless it threatens to break the schedule. In a
large cluster with tens of thousands of SLA metrics (both prod and
non-prod) the average observed SLA run time is around 4 seconds, which
gives us plenty of headroom for growth.

I am more concerned about the heap space used to store computed
counters here. This may quickly become a bottleneck and as a side
effect make our /vars endpoint unusable. Hence, the suggestion to make
non-essential stats fully toggle-able.

That said, if you envision a different use case with a much larger
metric set or anticipate a more frequent run schedule - feel free to
propose patches. I'd also encourage to invest some time into an SLA
benchmark using our JMH-based harness to back your changes with real
perf data.

Thanks,
Maxim

On Mon, Jun 1, 2015 at 3:26 AM, Erb, Stephan
<[email protected]> wrote:
> Hi Maxim,
>
> introducing some toggles for metric collection should definitely work and can 
> be contributed via a simple pull request.
>
> However, if your are only concerned about a potential performance hit, we 
> might as well think about tuning the existing metric calculation. I have 
> skimmed the code, and there seem to be several more or less low-hanging 
> fruits:
>
> * The uptime computation performs the task enumeration and sorting operation 
> for every percentile, whereas this only needs to be done once.
> * The current approach used to compute a percentile takes O(n log n) time. 
> There are alternative solutions running in only O(n) time.
> * There are some unnecessary allocations, i.e., SlaUtil.percentile() is 
> always called on a temporary list. However, the first thing it does is to 
> create a copy of that list.
>
> How about that: I will file a ticket for non-prod SLA stats and contribute a 
> simple patch with toggles. If it turns out that these are unusable for 
> twitter-scale, we can look into basic performance tuning.
>
> Best Regards,
> Stephan
> ________________________________________
> From: Maxim Khutornenko <[email protected]>
> Sent: Friday, May 29, 2015 7:23 PM
> To: [email protected]
> Subject: Re: non-prod SLA stats
>
> Hi Stephan,
>
> Tracking the same set of metrics for all non-prod jobs could be
> somewhat expensive on both collection and consumption sides. The only
> metrics we currently chose to collect are MTTA/R to help us monitor
> scheduling rate in view of reduced cluster capacity (AURORA-774).
> Perhaps we could put non-prod collection behind a set of command line
> switches (Arg<Boolean>)? E.g.:
>
> SLA_COLLECT_NON_PROD_MEDIANS
> SLA_COLLECT_NON_PROD_JOB_UPTIMES
> SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES
>
> These could be defined in SlaModule and injected into MetricCalculator
> to let us finely tune the required non-prod collection set. What do
> you think?
>
> Thanks,
> Maxim
>
> On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan
> <[email protected]> wrote:
>> Hi everyone,
>>
>> we are are interested in the job uptime percentiles and the aggregate 
>> cluster uptime percentage not only for production jobs, but also for our 
>> non-production jobs.
>>
>> Are there any reasons why those are not available in a non-prod version, 
>> similar to the current handling of mtta and mttr [1]?  If there are no 
>> objections, I will prepare a patch.
>>
>> Regards,
>> Stephan
>>
>> [1] 
>> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69

Re: non-prod SLA stats

Reply via email to