PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily refactored out or even a vanilla implementation of FPGrowth is not so difficult to re-create by re-using the existing methods.
Robin On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil <robin.a...@gmail.com> wrote: > Hi Lukas, > Sorry for being late to getting back to you on this. > Association rule mining is a great addition to FPGrowth. I am not sure I > understand GUHA method well but then again I understood Ted's LLR after some > deep reading. Could you put up an interesting example to help us understand > this method. Maybe starting from a transaction of shopping cart item ? A > great demo is big plus for a GSOC project. > > Robin > > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <lukas.vl...@gmail.com>wrote: > >> Hello, >> >> I would like to apply for Mahout GSoC 2010. My proposal is to implement >> Association Mining algorithm utilizing existing PFPGrowth implementation ( >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). >> >> As for the Assoiciation Mining I would like to implement a very general >> algorithm based on old and established method called GUHA [1]. Short and >> informative description of GUHA method can be found here [2]. Very >> exhaustive theoretical foundation can be found here [3]. Since the GUHA >> method has been developing from 60's and is very rich now and given the >> limited time during GSoC program it would be wise to reduce scope of >> initial >> implementation. >> >> There are two main topics that I would like to focus on during GSoC: >> >> 1) Enhancing and generalization of PFPGworth: >> Frequent pattern mining is usually part of each association maining task. >> In >> Mahout there is existing implementation of PFPGrowth algorithm. I would >> like >> to utilize this implementaion but it would be necessary to enhance or >> generalize it first. The goal here would be to keep current functionality >> and performance of PFPGrowth but allow more generalized input/output data >> and conditions. We will need to get frequencies of very rare patterns thus >> if it will show up that mining only top K is limiting factor then we would >> need to allow relaxation of this approach. Also we will need to take >> account >> on negative features (those missing in individual transaction). Negative >> features can be either directly supported by implementation of FP >> algorithm >> or it can be coded at the feature level (decision about what is better >> approach will be part of the GSoC program). It should be noted that for >> the >> GSoC program we will narrow scope of association mining antecedents and >> succedents to conjunctions of data features only. >> >> 2) API for custom association mining functions based on 4ft table: >> Association mining in GUHA sense means testing hypothesis on four-fold >> table >> (4ft) [see 2. item 5]. There has been proposed a lot of association >> functions for GUHA, some of them are based on statistical tests (for >> example >> Fischer test, Chi-square test), some are based on frequent analysis while >> not explicitly refering to statistical tests but in both cases all >> frequencies from four-fold table are needed. Some tests/functions do not >> require all frequencies, however; keeping this association mining >> implementation general means that we are targeting for all frequencies >> from >> 4ft. The goal here would be to provide implementation of few GUHA >> functions >> and design API for custom function based on 4ft table (if the custom >> function can be expressed using 4ft table frequencies then it should be >> very >> easy to implement it for association mining job). >> >> General use case scenario: >> Before the association mining task is executed one would have to decide >> which features can be on the left hand side of the rule (antecedents) and >> which on the right hand side of the rule (succedents). It may be practical >> to limit number of features on both sides (rules with many features may >> not >> be very useful). Then a specific test or function for the 4ft table will >> be >> specified with additional custom treasholds. >> >> Note: The terminology used in GUHA theory is not be unified. Various >> researches used different terminology. This may be a source of confusion. >> >> My background: >> I have been working as a full-time Java developer for 9 years. Currently, >> I >> am working for JBoss (thus being paid for working on open source). A few >> years ago I also started Ph.D. studies in Ostrava, Czech Republic and >> association mining is the topic I am focusing on right now. Implementing >> association mining in context of Mahout makes sense because it is one of >> the >> areas of data mining which is not yet covered in Mahout at all. MapReduce >> implementation can be one possible way how to tackle the challenge of >> mining >> association rules from large data. >> >> Regards, >> Lukas Vlcek >> >> [1] >> >> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC >> [2] http://www.cs.cas.cz/coufal/guha/index.htm >> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html< >> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html> >> > >