Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted,

do you think you can give some good links to paper or orther resources about
mentioned approaches? I would like to look at it after the weekend.
As far as I can see the association mining (and the guha method in its
original form) is not meant to be a predictive method but rather data
exploratory method (although having some kind of predictive power but not
formaly supported in the theoretical background). However, comparing
association mining to other approaches can be very interesting topic as
well.

Best regards,
Lukas

On Fri, Apr 9, 2010 at 8:03 PM, Ted Dunning  wrote:

> Lukas,
>
> The strongest alternative for this kind of application (and the normal
> choice for large scale applications) is on-line gradient descent learning
> with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
> some outcome (click or purchase or signup) from a variety of large
> vocabulary features.  As such association mining is usually just a
> pre-processing step before actual learning is applied.  There is some
> indication that an efficient sparse on-line gradient descent algorithm
> applied to features and combinations could do just as well, especially if
> the learning on combinations is applied in several passes.
>
> These on-line algorithms have the virtue of being extremely fast and with
> feature sharding, have substantial potential for parallel implementation.
>
> What do you think about these two methods?  Can you compare them?
>
> On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:
>
> >  One
> > example would be analysis of click stream, where you can learn that those
> > people visiting some negative comment on product blog never enter order
> > form. Not saying this is best example but in general this is the essence
> of
> > it. You simply need to take all possible values from the transaction into
> > account even if it is missing in the market basket
> >
> > The biggest challenge in implementing this would be the fact that the
> > analysis have to deal with all the data (not just the most frequent
> > patterns) and combinations. It is very resource expensive.
> >
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas,

The strongest alternative for this kind of application (and the normal
choice for large scale applications) is on-line gradient descent learning
with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
some outcome (click or purchase or signup) from a variety of large
vocabulary features.  As such association mining is usually just a
pre-processing step before actual learning is applied.  There is some
indication that an efficient sparse on-line gradient descent algorithm
applied to features and combinations could do just as well, especially if
the learning on combinations is applied in several passes.

These on-line algorithms have the virtue of being extremely fast and with
feature sharding, have substantial potential for parallel implementation.

What do you think about these two methods?  Can you compare them?

On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:

>  One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
Hi Lukáš,
It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.

Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček  wrote:

> Robin,
>
> I think it does not make sense for me to catch with GSoC timeline now as I
> am quite busy with other stuff. However, I will develop the proposal for
> Association Mining (or GUHA if you like) and keep this discussion going on.
> I am really interested in contributing some implementation to Mahout but as
> of now the GCoS timeline is not of any help to me.
>
> Let me look at this in detail and I will get back to mahout community with
> more details.
>
> As for the use cases for association mining there can be find a lot
> examples
> in literature. When it comes to missing or negative attributes of the data
> (of the transaction) I think there can be a lot of examples as well. One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket. I also remember that
> Simpson's paradox is often referred in connection with this issue. As for
> GUHA the power of it is that it has well developed theory background. This
> for example means that it stated formalized framework for various analysis
> that have probably origin in psychology and similar kind of soft-science
> and
> "association-like-functions" between data attributes can be expressed using
> 4ft table members and user thresholds.
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>
> I am thinking of starting a wiki page for Mahout about association mining
> ... this could help, what do you think?
>
> Regards,
> Lukas
>
> On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:
>
> > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> > refactored out or even a vanilla implementation of FPGrowth is not so
> > difficult to re-create by re-using the existing methods.
> >
> > Robin
> >
> >
> > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
> >
> > > Hi Lukas,
> > >Sorry for being late to getting back to you on this.
> > > Association rule mining is a great addition to FPGrowth. I am not sure
> I
> > > understand GUHA method well but then again I understood Ted's LLR after
> > some
> > > deep reading. Could you put up an interesting example to help us
> > understand
> > > this method. Maybe starting from a transaction of shopping cart item ?
> A
> > > great demo is big plus for a GSOC project.
> > >
> > > Robin
> > >
> > >
> > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  > >wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to apply for Mahout GSoC 2010. My proposal is to
> implement
> > >> Association Mining algorithm utilizing existing PFPGrowth
> implementation
> > (
> > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> > >>
> > >> As for the Assoiciation Mining I would like to implement a very
> general
> > >> algorithm based on old and established method called GUHA [1]. Short
> and
> > >> informative description of GUHA method can be found here [2]. Very
> > >> exhaustive theoretical foundation can be found here [3]. Since the
> GUHA
> > >> method has been developing from 60's and is very rich now and given
> the
> > >> limited time during GSoC program it would be wise to reduce scope of
> > >> initial
> > >> implementation.
> > >>
> > >> There are two main topics that I would like to focus on during GSoC:
> > >>
> > >> 1) Enhancing and generalization of PFPGworth:
> > >> Frequent pattern mining is usually part of each association maining
> > task.
> > >> In
> > >> Mahout there is existing implementation of PFPGrowth algorithm. I
> would
> > >> like
> > >> to utilize this implementaion but it would be necessary to enhance or
> > >> generalize it first. The goal here would be to keep current
> > functionality
> > >> and performance of PFPGrowth but allow more generalized input/output
> > data
> > >> and conditions. We will need to get frequencies of very rare patterns
> > thus
> > >> if it will show up that mining only top K is limiting factor then we
> > would
> > >> need to all

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Robin,

I think it does not make sense for me to catch with GSoC timeline now as I
am quite busy with other stuff. However, I will develop the proposal for
Association Mining (or GUHA if you like) and keep this discussion going on.
I am really interested in contributing some implementation to Mahout but as
of now the GCoS timeline is not of any help to me.

Let me look at this in detail and I will get back to mahout community with
more details.

As for the use cases for association mining there can be find a lot examples
in literature. When it comes to missing or negative attributes of the data
(of the transaction) I think there can be a lot of examples as well. One
example would be analysis of click stream, where you can learn that those
people visiting some negative comment on product blog never enter order
form. Not saying this is best example but in general this is the essence of
it. You simply need to take all possible values from the transaction into
account even if it is missing in the market basket. I also remember that
Simpson's paradox is often referred in connection with this issue. As for
GUHA the power of it is that it has well developed theory background. This
for example means that it stated formalized framework for various analysis
that have probably origin in psychology and similar kind of soft-science and
"association-like-functions" between data attributes can be expressed using
4ft table members and user thresholds.

The biggest challenge in implementing this would be the fact that the
analysis have to deal with all the data (not just the most frequent
patterns) and combinations. It is very resource expensive.

I am thinking of starting a wiki page for Mahout about association mining
... this could help, what do you think?

Regards,
Lukas

On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:

> PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> refactored out or even a vanilla implementation of FPGrowth is not so
> difficult to re-create by re-using the existing methods.
>
> Robin
>
>
> On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
>
> > Hi Lukas,
> >Sorry for being late to getting back to you on this.
> > Association rule mining is a great addition to FPGrowth. I am not sure I
> > understand GUHA method well but then again I understood Ted's LLR after
> some
> > deep reading. Could you put up an interesting example to help us
> understand
> > this method. Maybe starting from a transaction of shopping cart item ? A
> > great demo is big plus for a GSOC project.
> >
> > Robin
> >
> >
> > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  >wrote:
> >
> >> Hello,
> >>
> >> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> >> Association Mining algorithm utilizing existing PFPGrowth implementation
> (
> >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> >>
> >> As for the Assoiciation Mining I would like to implement a very general
> >> algorithm based on old and established method called GUHA [1]. Short and
> >> informative description of GUHA method can be found here [2]. Very
> >> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> >> method has been developing from 60's and is very rich now and given the
> >> limited time during GSoC program it would be wise to reduce scope of
> >> initial
> >> implementation.
> >>
> >> There are two main topics that I would like to focus on during GSoC:
> >>
> >> 1) Enhancing and generalization of PFPGworth:
> >> Frequent pattern mining is usually part of each association maining
> task.
> >> In
> >> Mahout there is existing implementation of PFPGrowth algorithm. I would
> >> like
> >> to utilize this implementaion but it would be necessary to enhance or
> >> generalize it first. The goal here would be to keep current
> functionality
> >> and performance of PFPGrowth but allow more generalized input/output
> data
> >> and conditions. We will need to get frequencies of very rare patterns
> thus
> >> if it will show up that mining only top K is limiting factor then we
> would
> >> need to allow relaxation of this approach. Also we will need to take
> >> account
> >> on negative features (those missing in individual transaction). Negative
> >> features can be either directly supported by implementation of FP
> >> algorithm
> >> or it can be coded at the feature level (decision about what is better
> >> approach will be part of the GSoC program). It should be noted that for
> >> the
> >> GSoC program we will narrow scope of association mining antecedents and
> >> succedents to conjunctions of data features only.
> >>
> >> 2) API for custom association mining functions based on 4ft table:
> >> Association mining in GUHA sense means testing hypothesis on four-fold
> >> table
> >> (4ft) [see 2. item 5]. There has been proposed a lot of association
> >> functions for GUHA, some of them are based on statistical tests (for
> >> example
> >> Fischer test, C

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
refactored out or even a vanilla implementation of FPGrowth is not so
difficult to re-create by re-using the existing methods.

Robin


On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:

> Hi Lukas,
>Sorry for being late to getting back to you on this.
> Association rule mining is a great addition to FPGrowth. I am not sure I
> understand GUHA method well but then again I understood Ted's LLR after some
> deep reading. Could you put up an interesting example to help us understand
> this method. Maybe starting from a transaction of shopping cart item ? A
> great demo is big plus for a GSOC project.
>
> Robin
>
>
> On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček wrote:
>
>> Hello,
>>
>> I would like to apply for Mahout GSoC 2010. My proposal is to implement
>> Association Mining algorithm utilizing existing PFPGrowth implementation (
>> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>>
>> As for the Assoiciation Mining I would like to implement a very general
>> algorithm based on old and established method called GUHA [1]. Short and
>> informative description of GUHA method can be found here [2]. Very
>> exhaustive theoretical foundation can be found here [3]. Since the GUHA
>> method has been developing from 60's and is very rich now and given the
>> limited time during GSoC program it would be wise to reduce scope of
>> initial
>> implementation.
>>
>> There are two main topics that I would like to focus on during GSoC:
>>
>> 1) Enhancing and generalization of PFPGworth:
>> Frequent pattern mining is usually part of each association maining task.
>> In
>> Mahout there is existing implementation of PFPGrowth algorithm. I would
>> like
>> to utilize this implementaion but it would be necessary to enhance or
>> generalize it first. The goal here would be to keep current functionality
>> and performance of PFPGrowth but allow more generalized input/output data
>> and conditions. We will need to get frequencies of very rare patterns thus
>> if it will show up that mining only top K is limiting factor then we would
>> need to allow relaxation of this approach. Also we will need to take
>> account
>> on negative features (those missing in individual transaction). Negative
>> features can be either directly supported by implementation of FP
>> algorithm
>> or it can be coded at the feature level (decision about what is better
>> approach will be part of the GSoC program). It should be noted that for
>> the
>> GSoC program we will narrow scope of association mining antecedents and
>> succedents to conjunctions of data features only.
>>
>> 2) API for custom association mining functions based on 4ft table:
>> Association mining in GUHA sense means testing hypothesis on four-fold
>> table
>> (4ft) [see 2. item 5]. There has been proposed a lot of association
>> functions for GUHA, some of them are based on statistical tests (for
>> example
>> Fischer test, Chi-square test), some are based on frequent analysis while
>> not explicitly refering to statistical tests but in both cases all
>> frequencies from four-fold table are needed. Some tests/functions do not
>> require all frequencies, however; keeping this association mining
>> implementation general means that we are targeting for all frequencies
>> from
>> 4ft. The goal here would be to provide implementation of few GUHA
>> functions
>> and design API for custom function based on 4ft table (if the custom
>> function can be expressed using 4ft table frequencies then it should be
>> very
>> easy to implement it for association mining job).
>>
>> General use case scenario:
>> Before the association mining task is executed one would have to decide
>> which features can be on the left hand side of the rule (antecedents) and
>> which on the right hand side of the rule (succedents). It may be practical
>> to limit number of features on both sides (rules with many features may
>> not
>> be very useful). Then a specific test or function for the 4ft table will
>> be
>> specified with additional custom treasholds.
>>
>> Note: The terminology used in GUHA theory is not be unified. Various
>> researches used different terminology. This may be a source of confusion.
>>
>> My background:
>> I have been working as a full-time Java developer for 9 years. Currently,
>> I
>> am working for JBoss (thus being paid for working on open source). A few
>> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
>> association mining is the topic I am focusing on right now. Implementing
>> association mining in context of Mahout makes sense because it is one of
>> the
>> areas of data mining which is not yet covered in Mahout at all. MapReduce
>> implementation can be one possible way how to tackle the challenge of
>> mining
>> association rules from large data.
>>
>> Regards,
>> Lukas Vlcek
>>
>> [1]
>>
>> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_pro

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
Hi Lukas,
   Sorry for being late to getting back to you on this.
Association rule mining is a great addition to FPGrowth. I am not sure I
understand GUHA method well but then again I understood Ted's LLR after some
deep reading. Could you put up an interesting example to help us understand
this method. Maybe starting from a transaction of shopping cart item ? A
great demo is big plus for a GSOC project.

Robin

On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  wrote:

> Hello,
>
> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> Association Mining algorithm utilizing existing PFPGrowth implementation (
> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>
> As for the Assoiciation Mining I would like to implement a very general
> algorithm based on old and established method called GUHA [1]. Short and
> informative description of GUHA method can be found here [2]. Very
> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> method has been developing from 60's and is very rich now and given the
> limited time during GSoC program it would be wise to reduce scope of
> initial
> implementation.
>
> There are two main topics that I would like to focus on during GSoC:
>
> 1) Enhancing and generalization of PFPGworth:
> Frequent pattern mining is usually part of each association maining task.
> In
> Mahout there is existing implementation of PFPGrowth algorithm. I would
> like
> to utilize this implementaion but it would be necessary to enhance or
> generalize it first. The goal here would be to keep current functionality
> and performance of PFPGrowth but allow more generalized input/output data
> and conditions. We will need to get frequencies of very rare patterns thus
> if it will show up that mining only top K is limiting factor then we would
> need to allow relaxation of this approach. Also we will need to take
> account
> on negative features (those missing in individual transaction). Negative
> features can be either directly supported by implementation of FP algorithm
> or it can be coded at the feature level (decision about what is better
> approach will be part of the GSoC program). It should be noted that for the
> GSoC program we will narrow scope of association mining antecedents and
> succedents to conjunctions of data features only.
>
> 2) API for custom association mining functions based on 4ft table:
> Association mining in GUHA sense means testing hypothesis on four-fold
> table
> (4ft) [see 2. item 5]. There has been proposed a lot of association
> functions for GUHA, some of them are based on statistical tests (for
> example
> Fischer test, Chi-square test), some are based on frequent analysis while
> not explicitly refering to statistical tests but in both cases all
> frequencies from four-fold table are needed. Some tests/functions do not
> require all frequencies, however; keeping this association mining
> implementation general means that we are targeting for all frequencies from
> 4ft. The goal here would be to provide implementation of few GUHA functions
> and design API for custom function based on 4ft table (if the custom
> function can be expressed using 4ft table frequencies then it should be
> very
> easy to implement it for association mining job).
>
> General use case scenario:
> Before the association mining task is executed one would have to decide
> which features can be on the left hand side of the rule (antecedents) and
> which on the right hand side of the rule (succedents). It may be practical
> to limit number of features on both sides (rules with many features may not
> be very useful). Then a specific test or function for the 4ft table will be
> specified with additional custom treasholds.
>
> Note: The terminology used in GUHA theory is not be unified. Various
> researches used different terminology. This may be a source of confusion.
>
> My background:
> I have been working as a full-time Java developer for 9 years. Currently, I
> am working for JBoss (thus being paid for working on open source). A few
> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> association mining is the topic I am focusing on right now. Implementing
> association mining in context of Mahout makes sense because it is one of
> the
> areas of data mining which is not yet covered in Mahout at all. MapReduce
> implementation can be one possible way how to tackle the challenge of
> mining
> association rules from large data.
>
> Regards,
> Lukas Vlcek
>
> [1]
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
>


Mahout GSoC 2010 proposal: Association Mining

2010-03-28 Thread Lukáš Vlček
Hello,

I would like to apply for Mahout GSoC 2010. My proposal is to implement
Association Mining algorithm utilizing existing PFPGrowth implementation (
http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).

As for the Assoiciation Mining I would like to implement a very general
algorithm based on old and established method called GUHA [1]. Short and
informative description of GUHA method can be found here [2]. Very
exhaustive theoretical foundation can be found here [3]. Since the GUHA
method has been developing from 60's and is very rich now and given the
limited time during GSoC program it would be wise to reduce scope of initial
implementation.

There are two main topics that I would like to focus on during GSoC:

1) Enhancing and generalization of PFPGworth:
Frequent pattern mining is usually part of each association maining task. In
Mahout there is existing implementation of PFPGrowth algorithm. I would like
to utilize this implementaion but it would be necessary to enhance or
generalize it first. The goal here would be to keep current functionality
and performance of PFPGrowth but allow more generalized input/output data
and conditions. We will need to get frequencies of very rare patterns thus
if it will show up that mining only top K is limiting factor then we would
need to allow relaxation of this approach. Also we will need to take account
on negative features (those missing in individual transaction). Negative
features can be either directly supported by implementation of FP algorithm
or it can be coded at the feature level (decision about what is better
approach will be part of the GSoC program). It should be noted that for the
GSoC program we will narrow scope of association mining antecedents and
succedents to conjunctions of data features only.

2) API for custom association mining functions based on 4ft table:
Association mining in GUHA sense means testing hypothesis on four-fold table
(4ft) [see 2. item 5]. There has been proposed a lot of association
functions for GUHA, some of them are based on statistical tests (for example
Fischer test, Chi-square test), some are based on frequent analysis while
not explicitly refering to statistical tests but in both cases all
frequencies from four-fold table are needed. Some tests/functions do not
require all frequencies, however; keeping this association mining
implementation general means that we are targeting for all frequencies from
4ft. The goal here would be to provide implementation of few GUHA functions
and design API for custom function based on 4ft table (if the custom
function can be expressed using 4ft table frequencies then it should be very
easy to implement it for association mining job).

General use case scenario:
Before the association mining task is executed one would have to decide
which features can be on the left hand side of the rule (antecedents) and
which on the right hand side of the rule (succedents). It may be practical
to limit number of features on both sides (rules with many features may not
be very useful). Then a specific test or function for the 4ft table will be
specified with additional custom treasholds.

Note: The terminology used in GUHA theory is not be unified. Various
researches used different terminology. This may be a source of confusion.

My background:
I have been working as a full-time Java developer for 9 years. Currently, I
am working for JBoss (thus being paid for working on open source). A few
years ago I also started Ph.D. studies in Ostrava, Czech Republic and
association mining is the topic I am focusing on right now. Implementing
association mining in context of Mahout makes sense because it is one of the
areas of data mining which is not yet covered in Mahout at all. MapReduce
implementation can be one possible way how to tackle the challenge of mining
association rules from large data.

Regards,
Lukas Vlcek

[1]
http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
[2] http://www.cs.cas.cz/coufal/guha/index.htm
[3] 
http://www.uivt.cas.cz/~hajek/guhabook/index.html