Re: Mahout GSoC 2010: Association Mining

2010-04-09 Thread Ted Dunning
Neal, I think that this might well be a useful contribution to Mahout, but,
if I am not mistaken, I think that the deadline for student proposals for
GSoC has just passed.

That likely means that making this contribution an official GSoC project is
not possible.  I am sure that the Mahout community would welcome you as a
contributor even without official Google status.  If you would like to do
this, go ahead and propose what you want to do (when JIRA comes back or just
by email discussion) and you can get started.

On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark  wrote:

> Hello,
>
> I just wanted to introduce myself. I am a MSc. Computer Science
> student at the University of Victoria. My research over the past year
> has been focused on developing and implementing an Apriori based
> frequent item-set mining algorithm for mining large data sets at low
> support counts.
>
>
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
>
> The main finding of the above report is that support levels as low as
> 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> On a 100 core cluster all frequent k2 pairs can calculated in
> approximately 6 minutes.
>
> I currently have an optimized k2 Hadoop implementation and algorithm
> for generating frequent pairs and I am currently extending my work to
> items of any length. The analysis of the extended approach will be
> complete within the next two weeks.
>
> Would you be interesting in moving forward with such an implementation
>  as a GSoC project? If so any comments/feedback would be very much
> appreciated. If you are interested I can create a proposal and submit
> it to your issue tracker when it comes back online.
>
> Thanks,
>
> Neal.
>


Mahout GSoC 2010: Association Mining

2010-04-09 Thread Neal Clark
Hello,

I just wanted to introduce myself. I am a MSc. Computer Science
student at the University of Victoria. My research over the past year
has been focused on developing and implementing an Apriori based
frequent item-set mining algorithm for mining large data sets at low
support counts.

https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en

The main finding of the above report is that support levels as low as
0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
On a 100 core cluster all frequent k2 pairs can calculated in
approximately 6 minutes.

I currently have an optimized k2 Hadoop implementation and algorithm
for generating frequent pairs and I am currently extending my work to
items of any length. The analysis of the extended approach will be
complete within the next two weeks.

Would you be interesting in moving forward with such an implementation
 as a GSoC project? If so any comments/feedback would be very much
appreciated. If you are interested I can create a proposal and submit
it to your issue tracker when it comes back online.

Thanks,

Neal.


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted,

do you think you can give some good links to paper or orther resources about
mentioned approaches? I would like to look at it after the weekend.
As far as I can see the association mining (and the guha method in its
original form) is not meant to be a predictive method but rather data
exploratory method (although having some kind of predictive power but not
formaly supported in the theoretical background). However, comparing
association mining to other approaches can be very interesting topic as
well.

Best regards,
Lukas

On Fri, Apr 9, 2010 at 8:03 PM, Ted Dunning  wrote:

> Lukas,
>
> The strongest alternative for this kind of application (and the normal
> choice for large scale applications) is on-line gradient descent learning
> with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
> some outcome (click or purchase or signup) from a variety of large
> vocabulary features.  As such association mining is usually just a
> pre-processing step before actual learning is applied.  There is some
> indication that an efficient sparse on-line gradient descent algorithm
> applied to features and combinations could do just as well, especially if
> the learning on combinations is applied in several passes.
>
> These on-line algorithms have the virtue of being extremely fast and with
> feature sharding, have substantial potential for parallel implementation.
>
> What do you think about these two methods?  Can you compare them?
>
> On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:
>
> >  One
> > example would be analysis of click stream, where you can learn that those
> > people visiting some negative comment on product blog never enter order
> > form. Not saying this is best example but in general this is the essence
> of
> > it. You simply need to take all possible values from the transaction into
> > account even if it is missing in the market basket
> >
> > The biggest challenge in implementing this would be the fact that the
> > analysis have to deal with all the data (not just the most frequent
> > patterns) and combinations. It is very resource expensive.
> >
>


[jira] Updated: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-09 Thread Richard Simon Just (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Simon Just updated MAHOUT-371:
--

Description: 
Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
Recommender using Hadoop

Student Name: Richard Simon Just

Student E-mail:i...@richardsimonjust.co.uk

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract: 
During the Netflix Prize Challenge one of the most popular forms of Recommender 
algorithm was that of Matrix Factorisation, in particular Singular Value 
Decomposition (SVD). As such this proposal looks to implement a distributed 
version of one of the most successful SVD-based recommender algorithms from the 
Netflix competition. Namely, the SVD++ algorithm. 
The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
feedback[1]. That is to say that it is able to take into account not just 
explicit user preferences, but also feedback such as, in the case of a company 
like Netflix, whether a movie has been rented. Implicit feedback assumes that 
the fact of there being some correlation between the user and the item is more 
important that whether the correlation is positive or negative. Implicit 
feedback would account for an item has being rated, but not what the rating was.
The implementation will include testing, in-depth documentation and a 
demo/tutorial. If there is time, I will also look to developing the algorithm 
into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
of the SVD algorithm by taking into account temporal dynamics. Temporal 
dynamics addresses the way user preferences in items and their behaviour in how 
they rate items can change over time. According to [2] the gains in accuracy 
implementing timeSVD++ are significantly bigger than the gains going from SVD 
to SVD++. 
The overall project will provide three deliverables:
1. The basic framework for distributed SVD-based recommender
2. A distributed SVD++ implementation and demo
3. A distributed timeSVD++ implementation



Detailed Description:

The SVD++ algorithm uses the principle of categorising users and items into 
factors, combined with regularisation and implicit feedback to predict how much 
a user is likely to match an item. Factors are abstract categorises that are 
created from comparing the data presented. Factor values are grades saying how 
much each user/item is related to that category. For example with the Netflix 
data a factor could loosely correspond to a movie genre, a director or story 
line. The more factors used the more detailed the categories are likely to 
become, and thus the more accurate the predictions are likely to become. 

Implicit feedback is based on the theory that a user having any sort of 
relationship to an item is more important that whether they rated it, or what 
rating they gave. The assumption is that even if a user does not like an item, 
or has not rated it, the very fact that they choose to have some interaction 
with it indicates something about their preferences. In the Netflix case this 
would be represented by whether a user has rated a movie or not,  it could also 
take into account the user's rental history. 

As well as the actual implementation of the code the project has two other 
deliverable focuses. The readability, documentation, testing of the code; and a 
full tutorial and demo of the code. It is felt that without these things the 
implementation is not really complete or fully usable. 

The recommender consists of 5 main parts. The job class that runs the code, an 
input conversion section, a training section, a re-trainer section and a 
prediction section. A brief overview of these sections follows.


The SVD++  Classes:

The Recommender Job class:
This class is the foundation of the recommender and allows it to run on Hadoop 
by implementing the Tool interface through AbstractJob. This class will parse 
any user arguments and setup the jobs that will run the algorithm on 
Map/Reduce, much in the same way Mahout's other distributed recommenders, do 
such as RecommenderJob.


Input Mapper/Reducer Classes:
The Mapper will take the input data and convert it to key value pairs in the 
form of a hadoop Writeable. The Reducer will take the Mapper Writeables and 
create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and 
ToUserVectorReducer.

Training Mapper/Reducer Classes:
This phase creates the factor matrices that will be used to create the 
predictions later. It does this by making a prediction of a known rating using 
the SVD, comparing it against the known rating, calculating the error and 
updating the factors accordingly. The Mapper will loop through the training of 
the factor matrices. The Reducer will collect the outputs of the Mapper to 
create the dense factor matrices.

Re-trainer Mapper/Reducer Classes:
This is a separate job that

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas,

The strongest alternative for this kind of application (and the normal
choice for large scale applications) is on-line gradient descent learning
with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
some outcome (click or purchase or signup) from a variety of large
vocabulary features.  As such association mining is usually just a
pre-processing step before actual learning is applied.  There is some
indication that an efficient sparse on-line gradient descent algorithm
applied to features and combinations could do just as well, especially if
the learning on combinations is applied in several passes.

These on-line algorithms have the virtue of being extremely fast and with
feature sharding, have substantial potential for parallel implementation.

What do you think about these two methods?  Can you compare them?

On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:

>  One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>


[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)

2010-04-09 Thread Yinghua Hu (JIRA)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) 
-

 Key: MAHOUT-374
 URL: https://issues.apache.org/jira/browse/MAHOUT-374
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
 Environment: Ubuntu 9.10
Reporter: Yinghua Hu




* Introduction

In recent years, multicore computer becomes main stream. However, the potential 
of multicore has not been fully exploited for machine learning because the lack 
of good programming framework for multicore. Recently, Chu et. al. [CHU07] 
adapt Google's map-reduce paradigm [DEA04] and implemented 10 different machine 
learning algorithms on multicore processor. Their results show almost double 
speedup on average with a dual processor. Their work has inspired a lot of 
interesting projects under Mahout of Apache Software Foundation.

An artificial neural network (ANN) is a supervised learning tool inspired by 
the biological nervous system. It can capture and represent complex 
input/output relationships. The neural network has capability to represent both 
linear and non-linear relationships by learning the relationships directly from 
the data being modeled. Artificial neural network have been widely applied for 
classification, data processing and function approximation.

We propose to implement an artificial neural network with back-propagation 
under map-reduce framework. Success of delivery should include a fast neural 
network customized for multicore computer.



* Methodology

In this section, we briefly introduce map-reduce and back propagated neural 
network.

- Map-reduce

Map-reduce [DEA04] is a programming model developed by Google. It gets its name 
from the map and reduce primitives in the functional language such as Lisp. Its 
inventors realize that most of their distributed computation involve applying a 
map operation and reduce operation. Where map operation is to compute a set of 
intermediate key/value pairs for each logical record in the input and reduce 
operation is applied on all the values that share the same key in order to 
combine the derived data appropriately. The reduce process allow users to 
handle large value list difficult to fit in memory.

Map-reduce makes it possible to have a simple interface enabling automatic 
parallelization and distribution of large-scale computation. The programming 
model can be applied on multiple core personal computer as well as large 
clusters of commodity PCs.

Here is an example of map-reduce from [DEA04]. To count the number of 
occurrences of each word in a large collection of documents, the user can write 
like the pseudo code below:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

   EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

   result += ParseInt(v);

   Emit(AsString(result))

The map function above emits each word and its associated counts (one in this 
example). The reduce function sums together all counts emitted for a particular 
word.

The Map-reduce model has been successfully applied to a large variety of 
problems, such as Google's Web search service, sorting, data mining etc. It 
helps to reduce the amount of data sent across the network. In addition, its 
easiness to use makes the parallel and distributed computing achievable even 
for inexperienced programmer.

[CHU07] provide a multicore map-reduce framework that is capable of 
implementing algorithm fitting the Statistical Query Model. Neural network is 
one of the ten machine learning algorithms fitting the requirement.

- Neural Network

Neural network is one of many modern technology inspired by biology. It is a 
model which can perform simple computation such as linear regression as well as 
very complex nonlinear calculation. Without doubt, neural network is one of the 
most popular machine learning methodology.

The simplest artificial neural network is a single-layer perceptron as shown in 
Figure 1.  It contains a single layer of adaptive weights.  The output is 
calculated by applying an activation function f to the weighted sum of inputs:

http://www.cs.ucf.edu/~yhu/gsoc/formula1.JPG (1)

http://www.cs.ucf.edu/~yhu/gsoc/fig1.JPG
Figure 1 - Single-layer Perceptron

The limitation of a single-layer perceptron is that it can only model linear 
relationships between an output and the inputs. This is overcome by multi-layer 
perceptrons. Multi-layer perceptrons contain several layers of adaptive 
weights. In Figure 2, a hidden layer was added into the single layer perceptron 
in Figure 1  to form a two-layer perceptron. If there is an activation function 
g for the first layer of adaptive weights

GSOC Create Sql adapters proposal

2010-04-09 Thread Necati Batur
Hi,

Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all
the algorithms to use;

Necati Batur ; necatiba...@gmail.com



Mahout / Mahout - 332 : Assigned Mentor is Robin Anil



Proposal Abstract:

It would be useful to use thrift as the protocol with the noSQL systems, as
opposed to the native API of them so that a nice abstraction could be made
for all the NoSQL systems in general and specific thrift client
implementations added to maximize code re-use. Even if someone were to make
the port for 1 NoSQL client, having the demarcation would help to pick up
and port.

Detailed Description:

The data adapters fort he higher level languages will require the good
capability of using data structures and some information about finite
mathematics that I am confident on that issues.Then,the code given in svn
repository seems to need some improvements and also documetation.

Briefly,I would do the following operations fort his project



   1. Understand the underlying maths for adapters
   2. Determine the data structures that would be used for adapters
   3. Implement the neccassary methods to handle creation of these
   structures
   4. Some test cases that we probably would need to check whether our code
   cover all the issues required by a data retrieve operations
   5. New iterations on the code to robust the algorithms
   6. Documentation of overall project to join our particular Project to
   overall scope

Additional Information:

First of all,I am very excited to join an organization like GSOC and most
importantly work for a big open source Project apache.I am looking for a
good collaboration and new challenges on software development.Especially
information management issues sound great to me.I am confident to work with
all new technologies.I took the data structures I , II courses at university
so I am ok with data structures.Most importantly I am interested in
databases.From my software engineering courses experience I know how to work
on a project by iterative development and timelining


[jira] Created: (MAHOUT-373) VectorDumper/VectorHelper doesn't dump values when dictionary is present

2010-04-09 Thread Drew Farris (JIRA)
VectorDumper/VectorHelper doesn't dump values when dictionary is present


 Key: MAHOUT-373
 URL: https://issues.apache.org/jira/browse/MAHOUT-373
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor


Dumping term vectors with a dictionary using:

{{mahout vectordump -s vector-output/chunk-0 -dt sequencefile -d 
dictionary-output}}

gives me output like the following with no values, just the indexes expanded 
into their dictionary entries:


{code}
Name: 0001-11055 elts: {513:discard, 7199:empty,...
{code}

While dumping the same vector without a dictionary using

{{mahout vectordump -s vector-output/chunk-0}}

gives me output that includes indexes and values:

{code}
Name: 0001-11055 elts: {513:1.0, 7199:1.0...
{code}


Would it make sense for the dictionary based output to include values as well? 
Anyone opposed to modifying VectorHelper.vectorToString(Vector, String[]) to do 
so?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Kris Jack (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855381#action_12855381
 ] 

Kris Jack commented on MAHOUT-372:
--

Thanks for your reply.  I'll run it using the command line parameters and 
hopefully get it working faster.  Thanks for letting me know also about the 
other mailing list, I'll use that in the future for such questions.

> Partitioning Collaborative Filtering Job into Maps and Reduces
> --
>
> Key: MAHOUT-372
> URL: https://issues.apache.org/jira/browse/MAHOUT-372
> Project: Mahout
>  Issue Type: Question
>  Components: Collaborative Filtering
>Affects Versions: 0.4
> Environment: Ubuntu Koala
>Reporter: Kris Jack
>Assignee: Sean Owen
> Fix For: 0.4
>
>
> I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main 
> on my hadoop cluster and it partitions the job in 2 although I have more than 
> 2 nodes available.  I was reading that the partitioning could be changed by 
> setting the JobConf's conf.setNumMapTasks(int num) and 
> conf.setNumReduceTasks(int num).
> Would I be right in assuming that this would speed up the processing by 
> increasing these, say to 4)?  Can this code be partitioned into many 
> reducers?  If so, would setting them in the protected AbstractJob::JobConf 
> prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-372.
--

   Resolution: Fixed
Fix Version/s: 0.4
 Assignee: Sean Owen

Yes, sure there's no particular limit to the number of mappers or reducers. 

These are Hadoop params, which you can set on the command line with, for 
example:
-Dmapred.map.tasks=10 -Dmapred.reduce.tasks=10

Reopen if that doesn't quite answer the question. (We can also discuss on 
mahout-u...@apache.org, perhaps, if this isn't necessarily a bug or enhancement 
request.)

> Partitioning Collaborative Filtering Job into Maps and Reduces
> --
>
> Key: MAHOUT-372
> URL: https://issues.apache.org/jira/browse/MAHOUT-372
> Project: Mahout
>  Issue Type: Question
>  Components: Collaborative Filtering
>Affects Versions: 0.4
> Environment: Ubuntu Koala
>Reporter: Kris Jack
>Assignee: Sean Owen
> Fix For: 0.4
>
>
> I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main 
> on my hadoop cluster and it partitions the job in 2 although I have more than 
> 2 nodes available.  I was reading that the partitioning could be changed by 
> setting the JobConf's conf.setNumMapTasks(int num) and 
> conf.setNumReduceTasks(int num).
> Would I be right in assuming that this would speed up the processing by 
> increasing these, say to 4)?  Can this code be partitioned into many 
> reducers?  If so, would setting them in the protected AbstractJob::JobConf 
> prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
Hi Lukáš,
It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.

Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček  wrote:

> Robin,
>
> I think it does not make sense for me to catch with GSoC timeline now as I
> am quite busy with other stuff. However, I will develop the proposal for
> Association Mining (or GUHA if you like) and keep this discussion going on.
> I am really interested in contributing some implementation to Mahout but as
> of now the GCoS timeline is not of any help to me.
>
> Let me look at this in detail and I will get back to mahout community with
> more details.
>
> As for the use cases for association mining there can be find a lot
> examples
> in literature. When it comes to missing or negative attributes of the data
> (of the transaction) I think there can be a lot of examples as well. One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket. I also remember that
> Simpson's paradox is often referred in connection with this issue. As for
> GUHA the power of it is that it has well developed theory background. This
> for example means that it stated formalized framework for various analysis
> that have probably origin in psychology and similar kind of soft-science
> and
> "association-like-functions" between data attributes can be expressed using
> 4ft table members and user thresholds.
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>
> I am thinking of starting a wiki page for Mahout about association mining
> ... this could help, what do you think?
>
> Regards,
> Lukas
>
> On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:
>
> > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> > refactored out or even a vanilla implementation of FPGrowth is not so
> > difficult to re-create by re-using the existing methods.
> >
> > Robin
> >
> >
> > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
> >
> > > Hi Lukas,
> > >Sorry for being late to getting back to you on this.
> > > Association rule mining is a great addition to FPGrowth. I am not sure
> I
> > > understand GUHA method well but then again I understood Ted's LLR after
> > some
> > > deep reading. Could you put up an interesting example to help us
> > understand
> > > this method. Maybe starting from a transaction of shopping cart item ?
> A
> > > great demo is big plus for a GSOC project.
> > >
> > > Robin
> > >
> > >
> > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  > >wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to apply for Mahout GSoC 2010. My proposal is to
> implement
> > >> Association Mining algorithm utilizing existing PFPGrowth
> implementation
> > (
> > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> > >>
> > >> As for the Assoiciation Mining I would like to implement a very
> general
> > >> algorithm based on old and established method called GUHA [1]. Short
> and
> > >> informative description of GUHA method can be found here [2]. Very
> > >> exhaustive theoretical foundation can be found here [3]. Since the
> GUHA
> > >> method has been developing from 60's and is very rich now and given
> the
> > >> limited time during GSoC program it would be wise to reduce scope of
> > >> initial
> > >> implementation.
> > >>
> > >> There are two main topics that I would like to focus on during GSoC:
> > >>
> > >> 1) Enhancing and generalization of PFPGworth:
> > >> Frequent pattern mining is usually part of each association maining
> > task.
> > >> In
> > >> Mahout there is existing implementation of PFPGrowth algorithm. I
> would
> > >> like
> > >> to utilize this implementaion but it would be necessary to enhance or
> > >> generalize it first. The goal here would be to keep current
> > functionality
> > >> and performance of PFPGrowth but allow more generalized input/output
> > data
> > >> and conditions. We will need to get frequencies of very rare patterns
> > thus
> > >> if it will show up that mining only top K is limiting factor then we
> > would
> > >> need to all

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Robin,

I think it does not make sense for me to catch with GSoC timeline now as I
am quite busy with other stuff. However, I will develop the proposal for
Association Mining (or GUHA if you like) and keep this discussion going on.
I am really interested in contributing some implementation to Mahout but as
of now the GCoS timeline is not of any help to me.

Let me look at this in detail and I will get back to mahout community with
more details.

As for the use cases for association mining there can be find a lot examples
in literature. When it comes to missing or negative attributes of the data
(of the transaction) I think there can be a lot of examples as well. One
example would be analysis of click stream, where you can learn that those
people visiting some negative comment on product blog never enter order
form. Not saying this is best example but in general this is the essence of
it. You simply need to take all possible values from the transaction into
account even if it is missing in the market basket. I also remember that
Simpson's paradox is often referred in connection with this issue. As for
GUHA the power of it is that it has well developed theory background. This
for example means that it stated formalized framework for various analysis
that have probably origin in psychology and similar kind of soft-science and
"association-like-functions" between data attributes can be expressed using
4ft table members and user thresholds.

The biggest challenge in implementing this would be the fact that the
analysis have to deal with all the data (not just the most frequent
patterns) and combinations. It is very resource expensive.

I am thinking of starting a wiki page for Mahout about association mining
... this could help, what do you think?

Regards,
Lukas

On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:

> PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> refactored out or even a vanilla implementation of FPGrowth is not so
> difficult to re-create by re-using the existing methods.
>
> Robin
>
>
> On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
>
> > Hi Lukas,
> >Sorry for being late to getting back to you on this.
> > Association rule mining is a great addition to FPGrowth. I am not sure I
> > understand GUHA method well but then again I understood Ted's LLR after
> some
> > deep reading. Could you put up an interesting example to help us
> understand
> > this method. Maybe starting from a transaction of shopping cart item ? A
> > great demo is big plus for a GSOC project.
> >
> > Robin
> >
> >
> > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  >wrote:
> >
> >> Hello,
> >>
> >> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> >> Association Mining algorithm utilizing existing PFPGrowth implementation
> (
> >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> >>
> >> As for the Assoiciation Mining I would like to implement a very general
> >> algorithm based on old and established method called GUHA [1]. Short and
> >> informative description of GUHA method can be found here [2]. Very
> >> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> >> method has been developing from 60's and is very rich now and given the
> >> limited time during GSoC program it would be wise to reduce scope of
> >> initial
> >> implementation.
> >>
> >> There are two main topics that I would like to focus on during GSoC:
> >>
> >> 1) Enhancing and generalization of PFPGworth:
> >> Frequent pattern mining is usually part of each association maining
> task.
> >> In
> >> Mahout there is existing implementation of PFPGrowth algorithm. I would
> >> like
> >> to utilize this implementaion but it would be necessary to enhance or
> >> generalize it first. The goal here would be to keep current
> functionality
> >> and performance of PFPGrowth but allow more generalized input/output
> data
> >> and conditions. We will need to get frequencies of very rare patterns
> thus
> >> if it will show up that mining only top K is limiting factor then we
> would
> >> need to allow relaxation of this approach. Also we will need to take
> >> account
> >> on negative features (those missing in individual transaction). Negative
> >> features can be either directly supported by implementation of FP
> >> algorithm
> >> or it can be coded at the feature level (decision about what is better
> >> approach will be part of the GSoC program). It should be noted that for
> >> the
> >> GSoC program we will narrow scope of association mining antecedents and
> >> succedents to conjunctions of data features only.
> >>
> >> 2) API for custom association mining functions based on 4ft table:
> >> Association mining in GUHA sense means testing hypothesis on four-fold
> >> table
> >> (4ft) [see 2. item 5]. There has been proposed a lot of association
> >> functions for GUHA, some of them are based on statistical tests (for
> >> example
> >> Fischer test, C

[jira] Created: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Kris Jack (JIRA)
Partitioning Collaborative Filtering Job into Maps and Reduces
--

 Key: MAHOUT-372
 URL: https://issues.apache.org/jira/browse/MAHOUT-372
 Project: Mahout
  Issue Type: Question
  Components: Collaborative Filtering
Affects Versions: 0.4
 Environment: Ubuntu Koala
Reporter: Kris Jack


I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main on 
my hadoop cluster and it partitions the job in 2 although I have more than 2 
nodes available.  I was reading that the partitioning could be changed by 
setting the JobConf's conf.setNumMapTasks(int num) and 
conf.setNumReduceTasks(int num).

Would I be right in assuming that this would speed up the processing by 
increasing these, say to 4)?  Can this code be partitioned into many reducers?  
If so, would setting them in the protected AbstractJob::JobConf 
prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost

Timeline including Apache internal deadlines:

http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC

Mentors, please also click on the ranking link to the ranking explanation [1] 
for more information on how to rank student proposals.

Isabel

[1] 
http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process


signature.asc
Description: This is a digitally signed message part.