Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Mohanadarshan Vivekanandalingam Thu, 26 Jun 2014 22:23:26 -0700

On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <lginn...@indiana.edu>
wrote:


> Hi Mohan,
>

Hi Lahiru,


>
> I wrote some samples but I can write more test-cases and provide another
> patch. Please feel free to change the naming of the windows as you like.
>

Really appreciate your contribution.. Sure, I'll start look into this..

Thanks,
Mohan


> Regards
> Lahiru
>
> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>
> Hi All,
>
> Lahiru and myself had a call today morning.
>
> Plan is to
> 1) Lahiru to look at hoeffding tree and other classification algorithms
> and select one to implement. He will compare the performance against MOA or
> some other implementation.
> 2) then we will use it for a  Fraud analysis scenario as a proof of its
> validity.
>
> Then we will decide how to continue from that point.
>
> Mohan could you look at the patch? Lahiru will write test cases that you
> can use to verify.
>
> --Srinath
>
>
> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <lginn...@indiana.edu>
> wrote:
>
>> Hi Srinath,
>>
>> Thanks for the response.
>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>
>> Hi Lahiru,
>>
>> Sorry for not responding earlier. I was traveling last week.
>>
>> I guess you know frequent item set !=  clustering algorithms.
>>
>> +1
>>
>>
>> Can we have a chat sometime to discuss details about the implementation.
>>
>> Yes sure, that would be very useful.
>>
>> Thanks
>> Lahiru
>>
>>
>> --Srinath
>>
>>
>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <lginn...@indiana.edu
>> > wrote:
>>
>>> Hi All,
>>>
>>> I have implemented another frequency counting algorithm[1] which is a
>>> classic algorithm for mining frequent items in a data stream. Basically
>>> users can specify a minimum average value of frequent items with an error
>>> value.
>>>
>>> This algorithm will accept two user-specified parameters: a support
>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and
>>> recommended number for e is normally s/10 or s/20. Let N denote the
>>> current length
>>> of the stream, i.e., the number of tuples seen so far. At any point of
>>> time, this algorithm can be asked to produce a list of events along
>>> with their estimated frequencies. The answers produced by this
>>> algorithm will have the
>>> following guarantees:
>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>  There are no false negatives .
>>> 2. No events whose true frequency is less than (s - e) N
>>> is output.
>>>  3. Estimated frequencies are less  than the true frequencies
>>> by at most eN.
>>>  .
>>> The incoming stream is conceptually divided in to buckets of width w =
>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
>>> starting from 1.We denote the current bucket id  by bcurrent whose
>>> value is ceiling(N/w). For an element , we denote its true frequency in the
>>> stream seen so far by "fe"  . Note that e and w are fixed for a data
>>> stream while N, bcurrent and fe are the variables whose value changes when
>>> the stream progress.
>>> Here the data structure ,  is a set of entries are tuples with the form
>>> of (event, f, delta), where event is the actual event and "f" is the  is an
>>> integer representing its estimated frequency, and delta is the maximum
>>> possible error in "fe".
>>>
>>>  In this algorithm every new event will anyways added in to the
>>> data-structure optimistically and there is no initial condition to enter in
>>> to the data-structure but iteratively if certain events are not match to
>>> the condition provided
>>> by the user those events will be removed when N%windowSize = 0, during
>>> this I output those events as expired-events in the window. For the
>>> incoming events if event is already exists I output those as current-events
>>> and if its a new
>>> event I just add it to the data-structure and iterate through all the
>>> events available and find the matching events based on s and e values and
>>> only those events will be output.
>>>
>>> I am not sure based on the window definition this is the correct
>>> approach or may be windows aren't the best way to associate this algorithm
>>> in to siddhi. I have attached my patch to jira[2] and if you can look that
>>> would be great.
>>>
>>> Siddhi Query will looks like below,
>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>                                                        "select symbol,
>>> price " +
>>>                                                        "insert into
>>> StockQuote;
>>>
>>>
>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency
>>> counts over data streams. In Proceedings of the 28th international
>>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>>
>>> Regards
>>> Lahiru
>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>
>>> Hi All,
>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>
>>> Hi All,
>>>
>>> I am planning to evaluate different event stream clustering algorithms
>>> as part of my studies(I am a graduate student at indiana University). I
>>> think Siddhi is a good place to experiment this, As per my understanding
>>> based on the docs Siddhi doesn't have a stream clustering interface I can
>>> use directly to plug my own algorithm. So I am thinking of first come up an
>>> interface for different clustering algorithms and add implementation of
>>> algorithms for each event stream by invoking an operation like
>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
>>> query language. If the second option is more consistent with current model
>>> I can wrap-up the work in that way but initially focussing on first
>>> approach will be easier for me. So each algorithm can be associated to a
>>> desired event Stream or can be associated globally. If its associated with
>>> each stream algorithm will run local to each stream otherwise it will run
>>> in global context. Based on the algorithm I can provide a way to configure
>>> it with parameters.
>>>
>>> I am sure I have confused with above implementation details, after
>>> looking in to Siddhi extension points I figured out I just have to
>>> implement a new window type. I have implemented one algorithm to keep the
>>> most frequent events
>>> came in a event stream. So queries can looks like below,
>>>
>>> from  cseEventStream#window.frequent(2) " +
>>>                                                        "select symbol,
>>> price " +
>>>                                                        "insert into
>>> StockQuote;
>>>
>>> There are multiple algorithms to keep the most frequent events in a
>>> given window size for now I just implemented a simple algorithm[1] with the
>>> processing complexity of O(1) and space complexity O(n) where n is the
>>> limit of the most frequent items. I have created a patch and attached it to
>>> jira[2].
>>>
>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>>> Science of computer programming 2, no.
>>> 2 (1982): 143-152.
>>> [2]https://wso2.org/jira/browse/CEP-877
>>>
>>>
>>> Thanks
>>> Lahiru
>>>
>>> To start this I hope to implement a frequent item set mining algorithm
>>> which can be used to find out most frequent items of an event stream.
>>> Search engines use these kind of data to find out most frequent searches in
>>> a given time window and optimize the search queries. I can start with some
>>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and
>>> then move towards more of data clustering algorithms. For the time being I
>>> will write the clustering results in to a file and later I think I can use
>>> more stable storage (either wso2 registry or other prefered way in wso2
>>> product stack). If Siddhi or WSO2 CEP already have the capability of
>>> frequent item mining I will start with a more classification type algorithm.
>>>
>>> Your feedback will be very useful for my work. If you have requirement
>>> for any specific type of algorithms based on the real client interactions
>>> you have, I would like to know them and implement them with Siddhi and do
>>> the comparison.
>>>
>>> Thanks
>>> Lahiru
>>> _______________________________________________
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>>  _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>    http://people.apache.org/~hemapani/
>    http://srinathsview.blogspot.com/
>  _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
>


-- 
*V. Mohanadarshan*
*Software Engineer,*
*Data Technologies Team,*
*WSO2, Inc. http://wso2.com <http://wso2.com> *
*lean.enterprise.middleware.*

email: mo...@wso2.com
phone:(+94) 771117673

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to