Hi All,

I have implemented another frequency counting algorithm[1] which is a classic 
algorithm for mining frequent items in a data stream. Basically users can 
specify a minimum average value of frequent items with an error value. 

This algorithm will accept two user-specified parameters: a support threshold s 
[0-1] and an error parameter e [0-1] such that e << s and recommended number 
for e is normally s/10 or s/20. Let N denote the current length
of the stream, i.e., the number of tuples seen so far. At any point of time, 
this algorithm can be asked to produce a list of events along with their 
estimated frequencies. The answers produced by this algorithm will have the
following guarantees:
1. All item(set)s whose true frequency exceeds sN are outputs.
 There are no false negatives .
2. No events whose true frequency is less than (s - e) N 
is output.
 3. Estimated frequencies are less  than the true frequencies
by at most eN.
 .
The incoming stream is conceptually divided in to buckets of width w = 
ceiling(1/e) transactions each. Buckets are labeled with bucket ids , starting 
from 1.We denote the current bucket id  by bcurrent whose value is 
ceiling(N/w). For an element , we denote its true frequency in the stream seen 
so far by "fe"  . Note that e and w are fixed for a data stream while N, 
bcurrent and fe are the variables whose value changes when the stream progress.
Here the data structure ,  is a set of entries are tuples with the form of 
(event, f, delta), where event is the actual event and "f" is the  is an 
integer representing its estimated frequency, and delta is the maximum possible 
error in "fe".

 In this algorithm every new event will anyways added in to the data-structure 
optimistically and there is no initial condition to enter in to the 
data-structure but iteratively if certain events are not match to the condition 
provided
by the user those events will be removed when N%windowSize = 0, during this I 
output those events as expired-events in the window. For the incoming events if 
event is already exists I output those as current-events and if its a new
event I just add it to the data-structure and iterate through all the events 
available and find the matching events based on s and e values and only those 
events will be output.

I am not sure based on the window definition this is the correct approach or 
may be windows aren't the best way to associate this algorithm in to siddhi. I 
have attached my patch to jira[2] and if you can look that would be great.
    
Siddhi Query will looks like below,
from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
                                                       "select symbol, price " +
                                                       "insert into StockQuote;


[1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts 
over data streams. In Proceedings of the 28th international conference on Very 
Large Data Bases (VLDB '02). VLDB Endowment 346-357.

Regards
Lahiru
On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:

> Hi All,
> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
> 
>> Hi All,
>> 
>> I am planning to evaluate different event stream clustering algorithms as 
>> part of my studies(I am a graduate student at indiana University). I think 
>> Siddhi is a good place to experiment this, As per my understanding based on 
>> the docs Siddhi doesn't have a stream clustering interface I can use 
>> directly to plug my own algorithm. So I am thinking of first come up an 
>> interface for different clustering algorithms and add implementation of 
>> algorithms for each event stream by invoking an operation like 
>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of 
>> query language. If the second option is more consistent with current model I 
>> can wrap-up the work in that way but initially focussing on first approach 
>> will be easier for me. So each algorithm can be associated to a desired 
>> event Stream or can be associated globally. If its associated with each 
>> stream algorithm will run local to each stream otherwise it will run in 
>> global context. Based on the algorithm I can provide a way to configure it 
>> with parameters.
>> 
> I am sure I have confused with above implementation details, after looking in 
> to Siddhi extension points I figured out I just have to implement a new 
> window type. I have implemented one algorithm to keep the most frequent 
> events 
> came in a event stream. So queries can looks like below,
> 
> from  cseEventStream#window.frequent(2) " +
>                                                        "select symbol, price 
> " +
>                                                        "insert into 
> StockQuote;
> 
> There are multiple algorithms to keep the most frequent events in a given 
> window size for now I just implemented a simple algorithm[1] with the 
> processing complexity of O(1) and space complexity O(n) where n is the limit 
> of the most frequent items. I have created a patch and attached it to jira[2].
> 
> [1]   Jayadev,        and     David   Gries   Misra,  "Finding        
> repeated        elements,"      in      Science of      computer        
> programming     2,      no.     
> 2     (1982): 143-152.
> [2]https://wso2.org/jira/browse/CEP-877
> 
> Thanks
> Lahiru
>> To start this I hope to implement a frequent item set mining algorithm which 
>> can be used to find out most frequent items of an event stream. Search 
>> engines use these kind of data to find out most frequent searches in a given 
>> time window and optimize the search queries. I can start with some 
>> algorithms like Misra-Gries algorithm[1] and Manku and   Motwani [2] and 
>> then move towards more of data clustering algorithms. For the time being I 
>> will write the clustering results in to a file and later I think I can use 
>> more stable storage (either wso2 registry or other prefered way in wso2 
>> product stack). If Siddhi or WSO2 CEP already have the capability of 
>> frequent item mining I will start with a more classification type algorithm.
>> 
>> Your feedback will be very useful for my work. If you have requirement for 
>> any specific type of algorithms based on the real client interactions you 
>> have, I would like to know them and implement them with Siddhi and do the 
>> comparison.
>> 
>> Thanks
>> Lahiru
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to