Re: Mahout GSoC 2010 proposal: Association Mining

Robin Anil Fri, 09 Apr 2010 05:11:27 -0700

Hi Lukáš,
            It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.


Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček <[email protected]> wrote:

> Robin,
>
> I think it does not make sense for me to catch with GSoC timeline now as I
> am quite busy with other stuff. However, I will develop the proposal for
> Association Mining (or GUHA if you like) and keep this discussion going on.
> I am really interested in contributing some implementation to Mahout but as
> of now the GCoS timeline is not of any help to me.
>
> Let me look at this in detail and I will get back to mahout community with
> more details.
>
> As for the use cases for association mining there can be find a lot
> examples
> in literature. When it comes to missing or negative attributes of the data
> (of the transaction) I think there can be a lot of examples as well. One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket. I also remember that
> Simpson's paradox is often referred in connection with this issue. As for
> GUHA the power of it is that it has well developed theory background. This
> for example means that it stated formalized framework for various analysis
> that have probably origin in psychology and similar kind of soft-science
> and
> "association-like-functions" between data attributes can be expressed using
> 4ft table members and user thresholds.
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>
> I am thinking of starting a wiki page for Mahout about association mining
> ... this could help, what do you think?
>
> Regards,
> Lukas
>
> On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil <[email protected]> wrote:
>
> > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> > refactored out or even a vanilla implementation of FPGrowth is not so
> > difficult to re-create by re-using the existing methods.
> >
> > Robin
> >
> >
> > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil <[email protected]> wrote:
> >
> > > Hi Lukas,
> > >                Sorry for being late to getting back to you on this.
> > > Association rule mining is a great addition to FPGrowth. I am not sure
> I
> > > understand GUHA method well but then again I understood Ted's LLR after
> > some
> > > deep reading. Could you put up an interesting example to help us
> > understand
> > > this method. Maybe starting from a transaction of shopping cart item ?
> A
> > > great demo is big plus for a GSOC project.
> > >
> > > Robin
> > >
> > >
> > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <[email protected]
> > >wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to apply for Mahout GSoC 2010. My proposal is to
> implement
> > >> Association Mining algorithm utilizing existing PFPGrowth
> implementation
> > (
> > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> > >>
> > >> As for the Assoiciation Mining I would like to implement a very
> general
> > >> algorithm based on old and established method called GUHA [1]. Short
> and
> > >> informative description of GUHA method can be found here [2]. Very
> > >> exhaustive theoretical foundation can be found here [3]. Since the
> GUHA
> > >> method has been developing from 60's and is very rich now and given
> the
> > >> limited time during GSoC program it would be wise to reduce scope of
> > >> initial
> > >> implementation.
> > >>
> > >> There are two main topics that I would like to focus on during GSoC:
> > >>
> > >> 1) Enhancing and generalization of PFPGworth:
> > >> Frequent pattern mining is usually part of each association maining
> > task.
> > >> In
> > >> Mahout there is existing implementation of PFPGrowth algorithm. I
> would
> > >> like
> > >> to utilize this implementaion but it would be necessary to enhance or
> > >> generalize it first. The goal here would be to keep current
> > functionality
> > >> and performance of PFPGrowth but allow more generalized input/output
> > data
> > >> and conditions. We will need to get frequencies of very rare patterns
> > thus
> > >> if it will show up that mining only top K is limiting factor then we
> > would
> > >> need to allow relaxation of this approach. Also we will need to take
> > >> account
> > >> on negative features (those missing in individual transaction).
> Negative
> > >> features can be either directly supported by implementation of FP
> > >> algorithm
> > >> or it can be coded at the feature level (decision about what is better
> > >> approach will be part of the GSoC program). It should be noted that
> for
> > >> the
> > >> GSoC program we will narrow scope of association mining antecedents
> and
> > >> succedents to conjunctions of data features only.
> > >>
> > >> 2) API for custom association mining functions based on 4ft table:
> > >> Association mining in GUHA sense means testing hypothesis on four-fold
> > >> table
> > >> (4ft) [see 2. item 5]. There has been proposed a lot of association
> > >> functions for GUHA, some of them are based on statistical tests (for
> > >> example
> > >> Fischer test, Chi-square test), some are based on frequent analysis
> > while
> > >> not explicitly refering to statistical tests but in both cases all
> > >> frequencies from four-fold table are needed. Some tests/functions do
> not
> > >> require all frequencies, however; keeping this association mining
> > >> implementation general means that we are targeting for all frequencies
> > >> from
> > >> 4ft. The goal here would be to provide implementation of few GUHA
> > >> functions
> > >> and design API for custom function based on 4ft table (if the custom
> > >> function can be expressed using 4ft table frequencies then it should
> be
> > >> very
> > >> easy to implement it for association mining job).
> > >>
> > >> General use case scenario:
> > >> Before the association mining task is executed one would have to
> decide
> > >> which features can be on the left hand side of the rule (antecedents)
> > and
> > >> which on the right hand side of the rule (succedents). It may be
> > practical
> > >> to limit number of features on both sides (rules with many features
> may
> > >> not
> > >> be very useful). Then a specific test or function for the 4ft table
> will
> > >> be
> > >> specified with additional custom treasholds.
> > >>
> > >> Note: The terminology used in GUHA theory is not be unified. Various
> > >> researches used different terminology. This may be a source of
> > confusion.
> > >>
> > >> My background:
> > >> I have been working as a full-time Java developer for 9 years.
> > Currently,
> > >> I
> > >> am working for JBoss (thus being paid for working on open source). A
> few
> > >> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> > >> association mining is the topic I am focusing on right now.
> Implementing
> > >> association mining in context of Mahout makes sense because it is one
> of
> > >> the
> > >> areas of data mining which is not yet covered in Mahout at all.
> > MapReduce
> > >> implementation can be one possible way how to tackle the challenge of
> > >> mining
> > >> association rules from large data.
> > >>
> > >> Regards,
> > >> Lukas Vlcek
> > >>
> > >> [1]
> > >>
> > >>
> >
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> > >> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> > >> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
> > <
> > >> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
> > >>
> > >
> > >
> >
>

Re: Mahout GSoC 2010 proposal: Association Mining

Reply via email to