On Fri, Jun 27, 2014 at 9:05 AM, Srinath Perera <srin...@wso2.com> wrote:
> Hi All, > > Lahiru and myself had a call today morning. > > Plan is to > 1) Lahiru to look at hoeffding tree and other classification algorithms > and select one to implement. He will compare the performance against MOA or > some other implementation. > 2) then we will use it for a Fraud analysis scenario as a proof of its > validity. > > Then we will decide how to continue from that point. > > Mohan could you look at the patch? Lahiru will write test cases that you > can use to verify. > Sure.. Will do.. Thanks, Mohan > > --Srinath > > > On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <lginn...@indiana.edu> > wrote: > >> Hi Srinath, >> >> Thanks for the response. >> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote: >> >> Hi Lahiru, >> >> Sorry for not responding earlier. I was traveling last week. >> >> I guess you know frequent item set != clustering algorithms. >> >> +1 >> >> >> Can we have a chat sometime to discuss details about the implementation. >> >> Yes sure, that would be very useful. >> >> Thanks >> Lahiru >> >> >> --Srinath >> >> >> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <lginn...@indiana.edu >> > wrote: >> >>> Hi All, >>> >>> I have implemented another frequency counting algorithm[1] which is a >>> classic algorithm for mining frequent items in a data stream. Basically >>> users can specify a minimum average value of frequent items with an error >>> value. >>> >>> This algorithm will accept two user-specified parameters: a support >>> threshold s [0-1] and an error parameter e [0-1] such that e << s and >>> recommended number for e is normally s/10 or s/20. Let N denote the >>> current length >>> of the stream, i.e., the number of tuples seen so far. At any point of >>> time, this algorithm can be asked to produce a list of events along >>> with their estimated frequencies. The answers produced by this >>> algorithm will have the >>> following guarantees: >>> 1. All item(set)s whose true frequency exceeds sN are outputs. >>> There are no false negatives . >>> 2. No events whose true frequency is less than (s - e) N >>> is output. >>> 3. Estimated frequencies are less than the true frequencies >>> by at most eN. >>> . >>> The incoming stream is conceptually divided in to buckets of width w = >>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , >>> starting from 1.We denote the current bucket id by bcurrent whose >>> value is ceiling(N/w). For an element , we denote its true frequency in the >>> stream seen so far by "fe" . Note that e and w are fixed for a data >>> stream while N, bcurrent and fe are the variables whose value changes when >>> the stream progress. >>> Here the data structure , is a set of entries are tuples with the form >>> of (event, f, delta), where event is the actual event and "f" is the is an >>> integer representing its estimated frequency, and delta is the maximum >>> possible error in "fe". >>> >>> In this algorithm every new event will anyways added in to the >>> data-structure optimistically and there is no initial condition to enter in >>> to the data-structure but iteratively if certain events are not match to >>> the condition provided >>> by the user those events will be removed when N%windowSize = 0, during >>> this I output those events as expired-events in the window. For the >>> incoming events if event is already exists I output those as current-events >>> and if its a new >>> event I just add it to the data-structure and iterate through all the >>> events available and find the matching events based on s and e values and >>> only those events will be output. >>> >>> I am not sure based on the window definition this is the correct >>> approach or may be windows aren't the best way to associate this algorithm >>> in to siddhi. I have attached my patch to jira[2] and if you can look that >>> would be great. >>> >>> Siddhi Query will looks like below, >>> from cseEventStream#window.lossyFrequent(0.1,0.01) " + >>> "select symbol, >>> price " + >>> "insert into >>> StockQuote; >>> >>> >>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency >>> counts over data streams. In Proceedings of the 28th international >>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357. >>> >>> Regards >>> Lahiru >>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote: >>> >>> Hi All, >>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote: >>> >>> Hi All, >>> >>> I am planning to evaluate different event stream clustering algorithms >>> as part of my studies(I am a graduate student at indiana University). I >>> think Siddhi is a good place to experiment this, As per my understanding >>> based on the docs Siddhi doesn't have a stream clustering interface I can >>> use directly to plug my own algorithm. So I am thinking of first come up an >>> interface for different clustering algorithms and add implementation of >>> algorithms for each event stream by invoking an operation like >>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of >>> query language. If the second option is more consistent with current model >>> I can wrap-up the work in that way but initially focussing on first >>> approach will be easier for me. So each algorithm can be associated to a >>> desired event Stream or can be associated globally. If its associated with >>> each stream algorithm will run local to each stream otherwise it will run >>> in global context. Based on the algorithm I can provide a way to configure >>> it with parameters. >>> >>> I am sure I have confused with above implementation details, after >>> looking in to Siddhi extension points I figured out I just have to >>> implement a new window type. I have implemented one algorithm to keep the >>> most frequent events >>> came in a event stream. So queries can looks like below, >>> >>> from cseEventStream#window.frequent(2) " + >>> "select symbol, >>> price " + >>> "insert into >>> StockQuote; >>> >>> There are multiple algorithms to keep the most frequent events in a >>> given window size for now I just implemented a simple algorithm[1] with the >>> processing complexity of O(1) and space complexity O(n) where n is the >>> limit of the most frequent items. I have created a patch and attached it to >>> jira[2]. >>> >>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in >>> Science of computer programming 2, no. >>> 2 (1982): 143-152. >>> [2]https://wso2.org/jira/browse/CEP-877 >>> >>> >>> Thanks >>> Lahiru >>> >>> To start this I hope to implement a frequent item set mining algorithm >>> which can be used to find out most frequent items of an event stream. >>> Search engines use these kind of data to find out most frequent searches in >>> a given time window and optimize the search queries. I can start with some >>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and >>> then move towards more of data clustering algorithms. For the time being I >>> will write the clustering results in to a file and later I think I can use >>> more stable storage (either wso2 registry or other prefered way in wso2 >>> product stack). If Siddhi or WSO2 CEP already have the capability of >>> frequent item mining I will start with a more classification type algorithm. >>> >>> Your feedback will be very useful for my work. If you have requirement >>> for any specific type of algorithms based on the real client interactions >>> you have, I would like to know them and implement them with Siddhi and do >>> the comparison. >>> >>> Thanks >>> Lahiru >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >>> >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> ============================ >> Srinath Perera, Ph.D. >> http://people.apache.org/~hemapani/ >> http://srinathsview.blogspot.com/ >> _______________________________________________ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> >> >> _______________________________________________ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > ============================ > Srinath Perera, Ph.D. > http://people.apache.org/~hemapani/ > http://srinathsview.blogspot.com/ > -- *V. Mohanadarshan* *Software Engineer,* *Data Technologies Team,* *WSO2, Inc. http://wso2.com <http://wso2.com> * *lean.enterprise.middleware.* email: mo...@wso2.com phone:(+94) 771117673
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture