Re: Mahout GSoC 2010 proposal: Association Mining

Robin Anil Mon, 05 Apr 2010 15:01:55 -0700

PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
refactored out or even a vanilla implementation of FPGrowth is not so
difficult to re-create by re-using the existing methods.


Robin


On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil <robin.a...@gmail.com> wrote:

> Hi Lukas,
>                Sorry for being late to getting back to you on this.
> Association rule mining is a great addition to FPGrowth. I am not sure I
> understand GUHA method well but then again I understood Ted's LLR after some
> deep reading. Could you put up an interesting example to help us understand
> this method. Maybe starting from a transaction of shopping cart item ? A
> great demo is big plus for a GSOC project.
>
> Robin
>
>
> On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <lukas.vl...@gmail.com>wrote:
>
>> Hello,
>>
>> I would like to apply for Mahout GSoC 2010. My proposal is to implement
>> Association Mining algorithm utilizing existing PFPGrowth implementation (
>> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>>
>> As for the Assoiciation Mining I would like to implement a very general
>> algorithm based on old and established method called GUHA [1]. Short and
>> informative description of GUHA method can be found here [2]. Very
>> exhaustive theoretical foundation can be found here [3]. Since the GUHA
>> method has been developing from 60's and is very rich now and given the
>> limited time during GSoC program it would be wise to reduce scope of
>> initial
>> implementation.
>>
>> There are two main topics that I would like to focus on during GSoC:
>>
>> 1) Enhancing and generalization of PFPGworth:
>> Frequent pattern mining is usually part of each association maining task.
>> In
>> Mahout there is existing implementation of PFPGrowth algorithm. I would
>> like
>> to utilize this implementaion but it would be necessary to enhance or
>> generalize it first. The goal here would be to keep current functionality
>> and performance of PFPGrowth but allow more generalized input/output data
>> and conditions. We will need to get frequencies of very rare patterns thus
>> if it will show up that mining only top K is limiting factor then we would
>> need to allow relaxation of this approach. Also we will need to take
>> account
>> on negative features (those missing in individual transaction). Negative
>> features can be either directly supported by implementation of FP
>> algorithm
>> or it can be coded at the feature level (decision about what is better
>> approach will be part of the GSoC program). It should be noted that for
>> the
>> GSoC program we will narrow scope of association mining antecedents and
>> succedents to conjunctions of data features only.
>>
>> 2) API for custom association mining functions based on 4ft table:
>> Association mining in GUHA sense means testing hypothesis on four-fold
>> table
>> (4ft) [see 2. item 5]. There has been proposed a lot of association
>> functions for GUHA, some of them are based on statistical tests (for
>> example
>> Fischer test, Chi-square test), some are based on frequent analysis while
>> not explicitly refering to statistical tests but in both cases all
>> frequencies from four-fold table are needed. Some tests/functions do not
>> require all frequencies, however; keeping this association mining
>> implementation general means that we are targeting for all frequencies
>> from
>> 4ft. The goal here would be to provide implementation of few GUHA
>> functions
>> and design API for custom function based on 4ft table (if the custom
>> function can be expressed using 4ft table frequencies then it should be
>> very
>> easy to implement it for association mining job).
>>
>> General use case scenario:
>> Before the association mining task is executed one would have to decide
>> which features can be on the left hand side of the rule (antecedents) and
>> which on the right hand side of the rule (succedents). It may be practical
>> to limit number of features on both sides (rules with many features may
>> not
>> be very useful). Then a specific test or function for the 4ft table will
>> be
>> specified with additional custom treasholds.
>>
>> Note: The terminology used in GUHA theory is not be unified. Various
>> researches used different terminology. This may be a source of confusion.
>>
>> My background:
>> I have been working as a full-time Java developer for 9 years. Currently,
>> I
>> am working for JBoss (thus being paid for working on open source). A few
>> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
>> association mining is the topic I am focusing on right now. Implementing
>> association mining in context of Mahout makes sense because it is one of
>> the
>> areas of data mining which is not yet covered in Mahout at all. MapReduce
>> implementation can be one possible way how to tackle the challenge of
>> mining
>> association rules from large data.
>>
>> Regards,
>> Lukas Vlcek
>>
>> [1]
>>
>> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
>> [2] http://www.cs.cas.cz/coufal/guha/index.htm
>> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
>> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
>>
>
>

Re: Mahout GSoC 2010 proposal: Association Mining

Reply via email to