Re: [openstack-dev] [Ceilometer] Approximate alarming

2014-04-28 Thread Eoghan Glynn


- Original Message -
 Hey everyone!
 
 I’d like to get your gut reaction on an idea for the future of alarming.
 Should I or should I not put it up for debate at the design summit?

Hi Nejc,

Yes this is certainly worthy of discussion at the design summit.

Because the algorithms being discussed are quite technical, it
would be good to prepare an etherpad in advance with a gentle
introduction to the ideas involved, e.g. some worked examples etc.

This would make for a more profitable discussion at summit, where
you don't have to burn too much time explaining the intricacies of
the streaming approaches.

Cheers,
Eoghan


 ---TL;DR
 Online algorithms for computing stream statistics over sliding windows would
 allow us to provide sample statistics within an error bound (e.g. The
 average cpu utilization in the last hour was 85% +/- 1%”), while
 significantly reducing the load and memory requirements of the computation.
 —
 
 Alarm evaluation currently recalculates the aggregate values each time the
 alarm is evaluated, which is problematic because of the load it puts on the
 system. There have been multiple ideas on how to solve this problem, from
 precalculating aggregate values
 (https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values)
 to re-architecting the alarms into the sample pipeline
 (https://wiki.openstack.org/wiki/Ceilometer/AlarmImprovements). While
 Sandy's suggestions make sense from the performance viewpoint, the problem
 of scalability remains. Samples in the pipeline need to be kept in-memory
 for the whole evaluation window, which requires O(N) memory for a window of
 size N.
 
 We could tackle this problem by using cutting edge research in streaming
 algorithms, namely the papers by Datar et al. [1], and Arasu et al. [2].
 They provide algorithms for computing stream statistics over sliding
 windows, such as *count, avg, min, max* and even *percentile*, **online**
 and with polylogarithmic space requirements. The tradeoff is of course
 precision, but the algorithms are bounded on the relative error - which
 could be specified by the user.
 
 If we can tell the user The average cpu utilization in the last hour was 85%
 +/- 1%, would that not be enough for most use cases, while severely
 reducing the load on the system? We could still support *error_rate=0*,
 which would simply use O(N) space and provide a precise answer for the cases
 where such an answer is needed.
 
 These algorithms were developed with telcos and computer network monitoring
 in mind, in which information about current network performance—latency,
 bandwidth, etc.—is generated online and is used to monitor and adjust
 network performance dynamically[1]. IIUC the main user of alarms is Heat
 autoscaling, which is exactly the kind of problem suitable to 'soft'
 calculations, with a certain tolerance for error.
 
 [1] Datar, Mayur, et al. Maintaining stream statistics over sliding
 windows. *SIAM Journal on Computing* 31.6 (2002): 1794-1813. PDF @
 http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf
 
 [2] Arasu, Arvind, and Gurmeet Singh Manku. Approximate counts and quantiles
 over sliding windows. *Proceedings of the twenty-third ACM
 SIGMOD-SIGACT-SIGART symposium on Principles of database systems.* ACM,
 2004. PDF @ http://ilpubs.stanford.edu:8090/624/1/2003-72.pdf
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Ceilometer] Approximate alarming

2014-04-17 Thread Nejc Saje
Hey everyone!

I’d like to get your gut reaction on an idea for the future of alarming. Should 
I or should I not put it up for debate at the design summit?

---TL;DR
Online algorithms for computing stream statistics over sliding windows would 
allow us to provide sample statistics within an error bound (e.g. The average 
cpu utilization in the last hour was 85% +/- 1%”), while significantly reducing 
the load and memory requirements of the computation.
—

Alarm evaluation currently recalculates the aggregate values each time the 
alarm is evaluated, which is problematic because of the load it puts on the 
system. There have been multiple ideas on how to solve this problem, from 
precalculating aggregate values 
(https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values)
 to re-architecting the alarms into the sample pipeline 
(https://wiki.openstack.org/wiki/Ceilometer/AlarmImprovements). While Sandy's 
suggestions make sense from the performance viewpoint, the problem of 
scalability remains. Samples in the pipeline need to be kept in-memory for the 
whole evaluation window, which requires O(N) memory for a window of size N.

We could tackle this problem by using cutting edge research in streaming 
algorithms, namely the papers by Datar et al. [1], and Arasu et al. [2]. They 
provide algorithms for computing stream statistics over sliding windows, such 
as *count, avg, min, max* and even *percentile*, **online** and with 
polylogarithmic space requirements. The tradeoff is of course precision, but 
the algorithms are bounded on the relative error - which could be specified by 
the user.

If we can tell the user The average cpu utilization in the last hour was 85% 
+/- 1%, would that not be enough for most use cases, while severely reducing 
the load on the system? We could still support *error_rate=0*, which would 
simply use O(N) space and provide a precise answer for the cases where such an 
answer is needed.

These algorithms were developed with telcos and computer network monitoring in 
mind, in which information about current network performance—latency, 
bandwidth, etc.—is generated online and is used to monitor and adjust network 
performance dynamically[1]. IIUC the main user of alarms is Heat autoscaling, 
which is exactly the kind of problem suitable to 'soft' calculations, with a 
certain tolerance for error.

[1] Datar, Mayur, et al. Maintaining stream statistics over sliding windows. 
*SIAM Journal on Computing* 31.6 (2002): 1794-1813. PDF @ 
http://ilpubs.stanford.edu:8090/504/1/2001-34.pdf

[2] Arasu, Arvind, and Gurmeet Singh Manku. Approximate counts and quantiles 
over sliding windows. *Proceedings of the twenty-third ACM 
SIGMOD-SIGACT-SIGART symposium on Principles of database systems.* ACM, 2004. 
PDF @ http://ilpubs.stanford.edu:8090/624/1/2003-72.pdf


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev