Hi Lahiru,

Sorry for not responding earlier. I was traveling last week.

I guess you know frequent item set !=  clustering algorithms.

Can we have a chat sometime to discuss details about the implementation.

--Srinath


On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <lginn...@indiana.edu>
wrote:

> Hi All,
>
> I have implemented another frequency counting algorithm[1] which is a
> classic algorithm for mining frequent items in a data stream. Basically
> users can specify a minimum average value of frequent items with an error
> value.
>
> This algorithm will accept two user-specified parameters: a support
> threshold s [0-1] and an error parameter e [0-1] such that e << s and
> recommended number for e is normally s/10 or s/20. Let N denote the
> current length
> of the stream, i.e., the number of tuples seen so far. At any point of
> time, this algorithm can be asked to produce a list of events along with
> their estimated frequencies. The answers produced by this algorithm will
> have the
> following guarantees:
> 1. All item(set)s whose true frequency exceeds sN are outputs.
>  There are no false negatives .
> 2. No events whose true frequency is less than (s - e) N
> is output.
>  3. Estimated frequencies are less  than the true frequencies
> by at most eN.
>  .
> The incoming stream is conceptually divided in to buckets of width w =
> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
> starting from 1.We denote the current bucket id  by bcurrent whose value
> is ceiling(N/w). For an element , we denote its true frequency in the
> stream seen so far by "fe"  . Note that e and w are fixed for a data
> stream while N, bcurrent and fe are the variables whose value changes when
> the stream progress.
> Here the data structure ,  is a set of entries are tuples with the form
> of (event, f, delta), where event is the actual event and "f" is the  is an
> integer representing its estimated frequency, and delta is the maximum
> possible error in "fe".
>
>  In this algorithm every new event will anyways added in to the
> data-structure optimistically and there is no initial condition to enter in
> to the data-structure but iteratively if certain events are not match to
> the condition provided
> by the user those events will be removed when N%windowSize = 0, during
> this I output those events as expired-events in the window. For the
> incoming events if event is already exists I output those as current-events
> and if its a new
> event I just add it to the data-structure and iterate through all the
> events available and find the matching events based on s and e values and
> only those events will be output.
>
> I am not sure based on the window definition this is the correct approach
> or may be windows aren't the best way to associate this algorithm in to
> siddhi. I have attached my patch to jira[2] and if you can look that would
> be great.
>
> Siddhi Query will looks like below,
> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>                                                        "select symbol,
> price " +
>                                                        "insert into
> StockQuote;
>
>
> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency
> counts over data streams. In Proceedings of the 28th international
> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>
> Regards
> Lahiru
> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>
> Hi All,
> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>
> Hi All,
>
> I am planning to evaluate different event stream clustering algorithms as
> part of my studies(I am a graduate student at indiana University). I think
> Siddhi is a good place to experiment this, As per my understanding based on
> the docs Siddhi doesn't have a stream clustering interface I can use
> directly to plug my own algorithm. So I am thinking of first come up an
> interface for different clustering algorithms and add implementation of
> algorithms for each event stream by invoking an operation like
> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
> query language. If the second option is more consistent with current model
> I can wrap-up the work in that way but initially focussing on first
> approach will be easier for me. So each algorithm can be associated to a
> desired event Stream or can be associated globally. If its associated with
> each stream algorithm will run local to each stream otherwise it will run
> in global context. Based on the algorithm I can provide a way to configure
> it with parameters.
>
> I am sure I have confused with above implementation details, after looking
> in to Siddhi extension points I figured out I just have to implement a new
> window type. I have implemented one algorithm to keep the most frequent
> events
> came in a event stream. So queries can looks like below,
>
> from  cseEventStream#window.frequent(2) " +
>                                                        "select symbol,
> price " +
>                                                        "insert into
> StockQuote;
>
> There are multiple algorithms to keep the most frequent events in a given
> window size for now I just implemented a simple algorithm[1] with the
> processing complexity of O(1) and space complexity O(n) where n is the
> limit of the most frequent items. I have created a patch and attached it to
> jira[2].
>
> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
> Science of computer programming 2, no.
> 2 (1982): 143-152.
> [2]https://wso2.org/jira/browse/CEP-877
>
>
> Thanks
> Lahiru
>
> To start this I hope to implement a frequent item set mining algorithm
> which can be used to find out most frequent items of an event stream.
> Search engines use these kind of data to find out most frequent searches in
> a given time window and optimize the search queries. I can start with some
> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and
> then move towards more of data clustering algorithms. For the time being I
> will write the clustering results in to a file and later I think I can use
> more stable storage (either wso2 registry or other prefered way in wso2
> product stack). If Siddhi or WSO2 CEP already have the capability of
> frequent item mining I will start with a more classification type algorithm.
>
> Your feedback will be very useful for my work. If you have requirement for
> any specific type of algorithms based on the real client interactions you
> have, I would like to know them and implement them with Siddhi and do the
> comparison.
>
> Thanks
> Lahiru
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
>
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
============================
Srinath Perera, Ph.D.
   http://people.apache.org/~hemapani/
   http://srinathsview.blogspot.com/
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to