Hi Lahiru, Sorry for not responding earlier. I was traveling last week.
I guess you know frequent item set != clustering algorithms. Can we have a chat sometime to discuss details about the implementation. --Srinath On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <lginn...@indiana.edu> wrote: > Hi All, > > I have implemented another frequency counting algorithm[1] which is a > classic algorithm for mining frequent items in a data stream. Basically > users can specify a minimum average value of frequent items with an error > value. > > This algorithm will accept two user-specified parameters: a support > threshold s [0-1] and an error parameter e [0-1] such that e << s and > recommended number for e is normally s/10 or s/20. Let N denote the > current length > of the stream, i.e., the number of tuples seen so far. At any point of > time, this algorithm can be asked to produce a list of events along with > their estimated frequencies. The answers produced by this algorithm will > have the > following guarantees: > 1. All item(set)s whose true frequency exceeds sN are outputs. > There are no false negatives . > 2. No events whose true frequency is less than (s - e) N > is output. > 3. Estimated frequencies are less than the true frequencies > by at most eN. > . > The incoming stream is conceptually divided in to buckets of width w = > ceiling(1/e) transactions each. Buckets are labeled with bucket ids , > starting from 1.We denote the current bucket id by bcurrent whose value > is ceiling(N/w). For an element , we denote its true frequency in the > stream seen so far by "fe" . Note that e and w are fixed for a data > stream while N, bcurrent and fe are the variables whose value changes when > the stream progress. > Here the data structure , is a set of entries are tuples with the form > of (event, f, delta), where event is the actual event and "f" is the is an > integer representing its estimated frequency, and delta is the maximum > possible error in "fe". > > In this algorithm every new event will anyways added in to the > data-structure optimistically and there is no initial condition to enter in > to the data-structure but iteratively if certain events are not match to > the condition provided > by the user those events will be removed when N%windowSize = 0, during > this I output those events as expired-events in the window. For the > incoming events if event is already exists I output those as current-events > and if its a new > event I just add it to the data-structure and iterate through all the > events available and find the matching events based on s and e values and > only those events will be output. > > I am not sure based on the window definition this is the correct approach > or may be windows aren't the best way to associate this algorithm in to > siddhi. I have attached my patch to jira[2] and if you can look that would > be great. > > Siddhi Query will looks like below, > from cseEventStream#window.lossyFrequent(0.1,0.01) " + > "select symbol, > price " + > "insert into > StockQuote; > > > [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency > counts over data streams. In Proceedings of the 28th international > conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357. > > Regards > Lahiru > On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote: > > Hi All, > On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote: > > Hi All, > > I am planning to evaluate different event stream clustering algorithms as > part of my studies(I am a graduate student at indiana University). I think > Siddhi is a good place to experiment this, As per my understanding based on > the docs Siddhi doesn't have a stream clustering interface I can use > directly to plug my own algorithm. So I am thinking of first come up an > interface for different clustering algorithms and add implementation of > algorithms for each event stream by invoking an operation like > SiddhiManager.addQuery. Or I can make the algorithm configure as part of > query language. If the second option is more consistent with current model > I can wrap-up the work in that way but initially focussing on first > approach will be easier for me. So each algorithm can be associated to a > desired event Stream or can be associated globally. If its associated with > each stream algorithm will run local to each stream otherwise it will run > in global context. Based on the algorithm I can provide a way to configure > it with parameters. > > I am sure I have confused with above implementation details, after looking > in to Siddhi extension points I figured out I just have to implement a new > window type. I have implemented one algorithm to keep the most frequent > events > came in a event stream. So queries can looks like below, > > from cseEventStream#window.frequent(2) " + > "select symbol, > price " + > "insert into > StockQuote; > > There are multiple algorithms to keep the most frequent events in a given > window size for now I just implemented a simple algorithm[1] with the > processing complexity of O(1) and space complexity O(n) where n is the > limit of the most frequent items. I have created a patch and attached it to > jira[2]. > > [1] Jayadev, and David Gries Misra, "Finding repeated elements," in > Science of computer programming 2, no. > 2 (1982): 143-152. > [2]https://wso2.org/jira/browse/CEP-877 > > > Thanks > Lahiru > > To start this I hope to implement a frequent item set mining algorithm > which can be used to find out most frequent items of an event stream. > Search engines use these kind of data to find out most frequent searches in > a given time window and optimize the search queries. I can start with some > algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and > then move towards more of data clustering algorithms. For the time being I > will write the clustering results in to a file and later I think I can use > more stable storage (either wso2 registry or other prefered way in wso2 > product stack). If Siddhi or WSO2 CEP already have the capability of > frequent item mining I will start with a more classification type algorithm. > > Your feedback will be very useful for my work. If you have requirement for > any specific type of algorithms based on the real client interactions you > have, I would like to know them and implement them with Siddhi and do the > comparison. > > Thanks > Lahiru > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- ============================ Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture