Hi All,

Lahiru and myself had a call today morning.

Plan is to
1) Lahiru to look at hoeffding tree and other classification algorithms and
select one to implement. He will compare the performance against MOA or
some other implementation.
2) then we will use it for a  Fraud analysis scenario as a proof of its
validity.

Then we will decide how to continue from that point.

Mohan could you look at the patch? Lahiru will write test cases that you
can use to verify.

--Srinath


On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <lginn...@indiana.edu>
wrote:

> Hi Srinath,
>
> Thanks for the response.
> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>
> Hi Lahiru,
>
> Sorry for not responding earlier. I was traveling last week.
>
> I guess you know frequent item set !=  clustering algorithms.
>
> +1
>
>
> Can we have a chat sometime to discuss details about the implementation.
>
> Yes sure, that would be very useful.
>
> Thanks
> Lahiru
>
>
> --Srinath
>
>
> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <lginn...@indiana.edu>
> wrote:
>
>> Hi All,
>>
>> I have implemented another frequency counting algorithm[1] which is a
>> classic algorithm for mining frequent items in a data stream. Basically
>> users can specify a minimum average value of frequent items with an error
>> value.
>>
>> This algorithm will accept two user-specified parameters: a support
>> threshold s [0-1] and an error parameter e [0-1] such that e << s and
>> recommended number for e is normally s/10 or s/20. Let N denote the
>> current length
>> of the stream, i.e., the number of tuples seen so far. At any point of
>> time, this algorithm can be asked to produce a list of events along with
>> their estimated frequencies. The answers produced by this algorithm will
>> have the
>> following guarantees:
>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>  There are no false negatives .
>> 2. No events whose true frequency is less than (s - e) N
>> is output.
>>  3. Estimated frequencies are less  than the true frequencies
>> by at most eN.
>>  .
>> The incoming stream is conceptually divided in to buckets of width w =
>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
>> starting from 1.We denote the current bucket id  by bcurrent whose value
>> is ceiling(N/w). For an element , we denote its true frequency in the
>> stream seen so far by "fe"  . Note that e and w are fixed for a data
>> stream while N, bcurrent and fe are the variables whose value changes when
>> the stream progress.
>> Here the data structure ,  is a set of entries are tuples with the form
>> of (event, f, delta), where event is the actual event and "f" is the  is an
>> integer representing its estimated frequency, and delta is the maximum
>> possible error in "fe".
>>
>>  In this algorithm every new event will anyways added in to the
>> data-structure optimistically and there is no initial condition to enter in
>> to the data-structure but iteratively if certain events are not match to
>> the condition provided
>> by the user those events will be removed when N%windowSize = 0, during
>> this I output those events as expired-events in the window. For the
>> incoming events if event is already exists I output those as current-events
>> and if its a new
>> event I just add it to the data-structure and iterate through all the
>> events available and find the matching events based on s and e values and
>> only those events will be output.
>>
>> I am not sure based on the window definition this is the correct approach
>> or may be windows aren't the best way to associate this algorithm in to
>> siddhi. I have attached my patch to jira[2] and if you can look that would
>> be great.
>>
>> Siddhi Query will looks like below,
>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>                                                        "select symbol,
>> price " +
>>                                                        "insert into
>> StockQuote;
>>
>>
>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency
>> counts over data streams. In Proceedings of the 28th international
>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>
>> Regards
>> Lahiru
>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>
>> Hi All,
>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>
>> Hi All,
>>
>> I am planning to evaluate different event stream clustering algorithms as
>> part of my studies(I am a graduate student at indiana University). I think
>> Siddhi is a good place to experiment this, As per my understanding based on
>> the docs Siddhi doesn't have a stream clustering interface I can use
>> directly to plug my own algorithm. So I am thinking of first come up an
>> interface for different clustering algorithms and add implementation of
>> algorithms for each event stream by invoking an operation like
>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
>> query language. If the second option is more consistent with current model
>> I can wrap-up the work in that way but initially focussing on first
>> approach will be easier for me. So each algorithm can be associated to a
>> desired event Stream or can be associated globally. If its associated with
>> each stream algorithm will run local to each stream otherwise it will run
>> in global context. Based on the algorithm I can provide a way to configure
>> it with parameters.
>>
>> I am sure I have confused with above implementation details, after
>> looking in to Siddhi extension points I figured out I just have to
>> implement a new window type. I have implemented one algorithm to keep the
>> most frequent events
>> came in a event stream. So queries can looks like below,
>>
>> from  cseEventStream#window.frequent(2) " +
>>                                                        "select symbol,
>> price " +
>>                                                        "insert into
>> StockQuote;
>>
>> There are multiple algorithms to keep the most frequent events in a given
>> window size for now I just implemented a simple algorithm[1] with the
>> processing complexity of O(1) and space complexity O(n) where n is the
>> limit of the most frequent items. I have created a patch and attached it to
>> jira[2].
>>
>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>> Science of computer programming 2, no.
>> 2 (1982): 143-152.
>> [2]https://wso2.org/jira/browse/CEP-877
>>
>>
>> Thanks
>> Lahiru
>>
>> To start this I hope to implement a frequent item set mining algorithm
>> which can be used to find out most frequent items of an event stream.
>> Search engines use these kind of data to find out most frequent searches in
>> a given time window and optimize the search queries. I can start with some
>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and
>> then move towards more of data clustering algorithms. For the time being I
>> will write the clustering results in to a file and later I think I can use
>> more stable storage (either wso2 registry or other prefered way in wso2
>> product stack). If Siddhi or WSO2 CEP already have the capability of
>> frequent item mining I will start with a more classification type algorithm.
>>
>> Your feedback will be very useful for my work. If you have requirement
>> for any specific type of algorithms based on the real client interactions
>> you have, I would like to know them and implement them with Siddhi and do
>> the comparison.
>>
>> Thanks
>> Lahiru
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>    http://people.apache.org/~hemapani/
>    http://srinathsview.blogspot.com/
>  _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
>
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
============================
Srinath Perera, Ph.D.
   http://people.apache.org/~hemapani/
   http://srinathsview.blogspot.com/
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to