Re: Mahout GSoC 2010: Association Mining
Neal, I think that this might well be a useful contribution to Mahout, but, if I am not mistaken, I think that the deadline for student proposals for GSoC has just passed. That likely means that making this contribution an official GSoC project is not possible. I am sure that the Mahout community would welcome you as a contributor even without official Google status. If you would like to do this, go ahead and propose what you want to do (when JIRA comes back or just by email discussion) and you can get started. On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark wrote: > Hello, > > I just wanted to introduce myself. I am a MSc. Computer Science > student at the University of Victoria. My research over the past year > has been focused on developing and implementing an Apriori based > frequent item-set mining algorithm for mining large data sets at low > support counts. > > > https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en > > The main finding of the above report is that support levels as low as > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated. > On a 100 core cluster all frequent k2 pairs can calculated in > approximately 6 minutes. > > I currently have an optimized k2 Hadoop implementation and algorithm > for generating frequent pairs and I am currently extending my work to > items of any length. The analysis of the extended approach will be > complete within the next two weeks. > > Would you be interesting in moving forward with such an implementation > as a GSoC project? If so any comments/feedback would be very much > appreciated. If you are interested I can create a proposal and submit > it to your issue tracker when it comes back online. > > Thanks, > > Neal. >
Mahout GSoC 2010: Association Mining
Hello, I just wanted to introduce myself. I am a MSc. Computer Science student at the University of Victoria. My research over the past year has been focused on developing and implementing an Apriori based frequent item-set mining algorithm for mining large data sets at low support counts. https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en The main finding of the above report is that support levels as low as 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated. On a 100 core cluster all frequent k2 pairs can calculated in approximately 6 minutes. I currently have an optimized k2 Hadoop implementation and algorithm for generating frequent pairs and I am currently extending my work to items of any length. The analysis of the extended approach will be complete within the next two weeks. Would you be interesting in moving forward with such an implementation as a GSoC project? If so any comments/feedback would be very much appreciated. If you are interested I can create a proposal and submit it to your issue tracker when it comes back online. Thanks, Neal.
Re: Mahout GSoC 2010 proposal: Association Mining
Ted, do you think you can give some good links to paper or orther resources about mentioned approaches? I would like to look at it after the weekend. As far as I can see the association mining (and the guha method in its original form) is not meant to be a predictive method but rather data exploratory method (although having some kind of predictive power but not formaly supported in the theoretical background). However, comparing association mining to other approaches can be very interesting topic as well. Best regards, Lukas On Fri, Apr 9, 2010 at 8:03 PM, Ted Dunning wrote: > Lukas, > > The strongest alternative for this kind of application (and the normal > choice for large scale applications) is on-line gradient descent learning > with an L_1 or L_1 + L_2 regularization. The typical goal is to predict > some outcome (click or purchase or signup) from a variety of large > vocabulary features. As such association mining is usually just a > pre-processing step before actual learning is applied. There is some > indication that an efficient sparse on-line gradient descent algorithm > applied to features and combinations could do just as well, especially if > the learning on combinations is applied in several passes. > > These on-line algorithms have the virtue of being extremely fast and with > feature sharding, have substantial potential for parallel implementation. > > What do you think about these two methods? Can you compare them? > > On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček wrote: > > > One > > example would be analysis of click stream, where you can learn that those > > people visiting some negative comment on product blog never enter order > > form. Not saying this is best example but in general this is the essence > of > > it. You simply need to take all possible values from the transaction into > > account even if it is missing in the market basket > > > > The biggest challenge in implementing this would be the fact that the > > analysis have to deal with all the data (not just the most frequent > > patterns) and combinations. It is very resource expensive. > > >
[jira] Updated: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Simon Just updated MAHOUT-371: -- Description: Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ Recommender using Hadoop Student Name: Richard Simon Just Student E-mail:i...@richardsimonjust.co.uk Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: During the Netflix Prize Challenge one of the most popular forms of Recommender algorithm was that of Matrix Factorisation, in particular Singular Value Decomposition (SVD). As such this proposal looks to implement a distributed version of one of the most successful SVD-based recommender algorithms from the Netflix competition. Namely, the SVD++ algorithm. The SVD++ improves upon other basic SVD algorithms by incorporating implicit feedback[1]. That is to say that it is able to take into account not just explicit user preferences, but also feedback such as, in the case of a company like Netflix, whether a movie has been rented. Implicit feedback assumes that the fact of there being some correlation between the user and the item is more important that whether the correlation is positive or negative. Implicit feedback would account for an item has being rated, but not what the rating was. The implementation will include testing, in-depth documentation and a demo/tutorial. If there is time, I will also look to developing the algorithm into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results of the SVD algorithm by taking into account temporal dynamics. Temporal dynamics addresses the way user preferences in items and their behaviour in how they rate items can change over time. According to [2] the gains in accuracy implementing timeSVD++ are significantly bigger than the gains going from SVD to SVD++. The overall project will provide three deliverables: 1. The basic framework for distributed SVD-based recommender 2. A distributed SVD++ implementation and demo 3. A distributed timeSVD++ implementation Detailed Description: The SVD++ algorithm uses the principle of categorising users and items into factors, combined with regularisation and implicit feedback to predict how much a user is likely to match an item. Factors are abstract categorises that are created from comparing the data presented. Factor values are grades saying how much each user/item is related to that category. For example with the Netflix data a factor could loosely correspond to a movie genre, a director or story line. The more factors used the more detailed the categories are likely to become, and thus the more accurate the predictions are likely to become. Implicit feedback is based on the theory that a user having any sort of relationship to an item is more important that whether they rated it, or what rating they gave. The assumption is that even if a user does not like an item, or has not rated it, the very fact that they choose to have some interaction with it indicates something about their preferences. In the Netflix case this would be represented by whether a user has rated a movie or not, it could also take into account the user's rental history. As well as the actual implementation of the code the project has two other deliverable focuses. The readability, documentation, testing of the code; and a full tutorial and demo of the code. It is felt that without these things the implementation is not really complete or fully usable. The recommender consists of 5 main parts. The job class that runs the code, an input conversion section, a training section, a re-trainer section and a prediction section. A brief overview of these sections follows. The SVD++ Classes: The Recommender Job class: This class is the foundation of the recommender and allows it to run on Hadoop by implementing the Tool interface through AbstractJob. This class will parse any user arguments and setup the jobs that will run the algorithm on Map/Reduce, much in the same way Mahout's other distributed recommenders, do such as RecommenderJob. Input Mapper/Reducer Classes: The Mapper will take the input data and convert it to key value pairs in the form of a hadoop Writeable. The Reducer will take the Mapper Writeables and create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and ToUserVectorReducer. Training Mapper/Reducer Classes: This phase creates the factor matrices that will be used to create the predictions later. It does this by making a prediction of a known rating using the SVD, comparing it against the known rating, calculating the error and updating the factors accordingly. The Mapper will loop through the training of the factor matrices. The Reducer will collect the outputs of the Mapper to create the dense factor matrices. Re-trainer Mapper/Reducer Classes: This is a separate job that
Re: Mahout GSoC 2010 proposal: Association Mining
Lukas, The strongest alternative for this kind of application (and the normal choice for large scale applications) is on-line gradient descent learning with an L_1 or L_1 + L_2 regularization. The typical goal is to predict some outcome (click or purchase or signup) from a variety of large vocabulary features. As such association mining is usually just a pre-processing step before actual learning is applied. There is some indication that an efficient sparse on-line gradient descent algorithm applied to features and combinations could do just as well, especially if the learning on combinations is applied in several passes. These on-line algorithms have the virtue of being extremely fast and with feature sharding, have substantial potential for parallel implementation. What do you think about these two methods? Can you compare them? On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček wrote: > One > example would be analysis of click stream, where you can learn that those > people visiting some negative comment on product blog never enter order > form. Not saying this is best example but in general this is the essence of > it. You simply need to take all possible values from the transaction into > account even if it is missing in the market basket > > The biggest challenge in implementing this would be the fact that the > analysis have to deal with all the data (not just the most frequent > patterns) and combinations. It is very resource expensive. >
[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) - Key: MAHOUT-374 URL: https://issues.apache.org/jira/browse/MAHOUT-374 Project: Mahout Issue Type: New Feature Components: Classification Environment: Ubuntu 9.10 Reporter: Yinghua Hu * Introduction In recent years, multicore computer becomes main stream. However, the potential of multicore has not been fully exploited for machine learning because the lack of good programming framework for multicore. Recently, Chu et. al. [CHU07] adapt Google's map-reduce paradigm [DEA04] and implemented 10 different machine learning algorithms on multicore processor. Their results show almost double speedup on average with a dual processor. Their work has inspired a lot of interesting projects under Mahout of Apache Software Foundation. An artificial neural network (ANN) is a supervised learning tool inspired by the biological nervous system. It can capture and represent complex input/output relationships. The neural network has capability to represent both linear and non-linear relationships by learning the relationships directly from the data being modeled. Artificial neural network have been widely applied for classification, data processing and function approximation. We propose to implement an artificial neural network with back-propagation under map-reduce framework. Success of delivery should include a fast neural network customized for multicore computer. * Methodology In this section, we briefly introduce map-reduce and back propagated neural network. - Map-reduce Map-reduce [DEA04] is a programming model developed by Google. It gets its name from the map and reduce primitives in the functional language such as Lisp. Its inventors realize that most of their distributed computation involve applying a map operation and reduce operation. Where map operation is to compute a set of intermediate key/value pairs for each logical record in the input and reduce operation is applied on all the values that share the same key in order to combine the derived data appropriately. The reduce process allow users to handle large value list difficult to fit in memory. Map-reduce makes it possible to have a simple interface enabling automatic parallelization and distribution of large-scale computation. The programming model can be applied on multiple core personal computer as well as large clusters of commodity PCs. Here is an example of map-reduce from [DEA04]. To count the number of occurrences of each word in a large collection of documents, the user can write like the pseudo code below: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)) The map function above emits each word and its associated counts (one in this example). The reduce function sums together all counts emitted for a particular word. The Map-reduce model has been successfully applied to a large variety of problems, such as Google's Web search service, sorting, data mining etc. It helps to reduce the amount of data sent across the network. In addition, its easiness to use makes the parallel and distributed computing achievable even for inexperienced programmer. [CHU07] provide a multicore map-reduce framework that is capable of implementing algorithm fitting the Statistical Query Model. Neural network is one of the ten machine learning algorithms fitting the requirement. - Neural Network Neural network is one of many modern technology inspired by biology. It is a model which can perform simple computation such as linear regression as well as very complex nonlinear calculation. Without doubt, neural network is one of the most popular machine learning methodology. The simplest artificial neural network is a single-layer perceptron as shown in Figure 1. It contains a single layer of adaptive weights. The output is calculated by applying an activation function f to the weighted sum of inputs: http://www.cs.ucf.edu/~yhu/gsoc/formula1.JPG (1) http://www.cs.ucf.edu/~yhu/gsoc/fig1.JPG Figure 1 - Single-layer Perceptron The limitation of a single-layer perceptron is that it can only model linear relationships between an output and the inputs. This is overcome by multi-layer perceptrons. Multi-layer perceptrons contain several layers of adaptive weights. In Figure 2, a hidden layer was added into the single layer perceptron in Figure 1 to form a two-layer perceptron. If there is an activation function g for the first layer of adaptive weights
GSOC Create Sql adapters proposal
Hi, Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use; Necati Batur ; necatiba...@gmail.com Mahout / Mahout - 332 : Assigned Mentor is Robin Anil Proposal Abstract: It would be useful to use thrift as the protocol with the noSQL systems, as opposed to the native API of them so that a nice abstraction could be made for all the NoSQL systems in general and specific thrift client implementations added to maximize code re-use. Even if someone were to make the port for 1 NoSQL client, having the demarcation would help to pick up and port. Detailed Description: The data adapters fort he higher level languages will require the good capability of using data structures and some information about finite mathematics that I am confident on that issues.Then,the code given in svn repository seems to need some improvements and also documetation. Briefly,I would do the following operations fort his project 1. Understand the underlying maths for adapters 2. Determine the data structures that would be used for adapters 3. Implement the neccassary methods to handle creation of these structures 4. Some test cases that we probably would need to check whether our code cover all the issues required by a data retrieve operations 5. New iterations on the code to robust the algorithms 6. Documentation of overall project to join our particular Project to overall scope Additional Information: First of all,I am very excited to join an organization like GSOC and most importantly work for a big open source Project apache.I am looking for a good collaboration and new challenges on software development.Especially information management issues sound great to me.I am confident to work with all new technologies.I took the data structures I , II courses at university so I am ok with data structures.Most importantly I am interested in databases.From my software engineering courses experience I know how to work on a project by iterative development and timelining
[jira] Created: (MAHOUT-373) VectorDumper/VectorHelper doesn't dump values when dictionary is present
VectorDumper/VectorHelper doesn't dump values when dictionary is present Key: MAHOUT-373 URL: https://issues.apache.org/jira/browse/MAHOUT-373 Project: Mahout Issue Type: Improvement Components: Utils Affects Versions: 0.3 Reporter: Drew Farris Priority: Minor Dumping term vectors with a dictionary using: {{mahout vectordump -s vector-output/chunk-0 -dt sequencefile -d dictionary-output}} gives me output like the following with no values, just the indexes expanded into their dictionary entries: {code} Name: 0001-11055 elts: {513:discard, 7199:empty,... {code} While dumping the same vector without a dictionary using {{mahout vectordump -s vector-output/chunk-0}} gives me output that includes indexes and values: {code} Name: 0001-11055 elts: {513:1.0, 7199:1.0... {code} Would it make sense for the dictionary based output to include values as well? Anyone opposed to modifying VectorHelper.vectorToString(Vector, String[]) to do so? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces
[ https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855381#action_12855381 ] Kris Jack commented on MAHOUT-372: -- Thanks for your reply. I'll run it using the command line parameters and hopefully get it working faster. Thanks for letting me know also about the other mailing list, I'll use that in the future for such questions. > Partitioning Collaborative Filtering Job into Maps and Reduces > -- > > Key: MAHOUT-372 > URL: https://issues.apache.org/jira/browse/MAHOUT-372 > Project: Mahout > Issue Type: Question > Components: Collaborative Filtering >Affects Versions: 0.4 > Environment: Ubuntu Koala >Reporter: Kris Jack >Assignee: Sean Owen > Fix For: 0.4 > > > I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main > on my hadoop cluster and it partitions the job in 2 although I have more than > 2 nodes available. I was reading that the partitioning could be changed by > setting the JobConf's conf.setNumMapTasks(int num) and > conf.setNumReduceTasks(int num). > Would I be right in assuming that this would speed up the processing by > increasing these, say to 4)? Can this code be partitioned into many > reducers? If so, would setting them in the protected AbstractJob::JobConf > prepareJobConf() function be appropriate? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces
[ https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-372. -- Resolution: Fixed Fix Version/s: 0.4 Assignee: Sean Owen Yes, sure there's no particular limit to the number of mappers or reducers. These are Hadoop params, which you can set on the command line with, for example: -Dmapred.map.tasks=10 -Dmapred.reduce.tasks=10 Reopen if that doesn't quite answer the question. (We can also discuss on mahout-u...@apache.org, perhaps, if this isn't necessarily a bug or enhancement request.) > Partitioning Collaborative Filtering Job into Maps and Reduces > -- > > Key: MAHOUT-372 > URL: https://issues.apache.org/jira/browse/MAHOUT-372 > Project: Mahout > Issue Type: Question > Components: Collaborative Filtering >Affects Versions: 0.4 > Environment: Ubuntu Koala >Reporter: Kris Jack >Assignee: Sean Owen > Fix For: 0.4 > > > I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main > on my hadoop cluster and it partitions the job in 2 although I have more than > 2 nodes available. I was reading that the partitioning could be changed by > setting the JobConf's conf.setNumMapTasks(int num) and > conf.setNumReduceTasks(int num). > Would I be right in assuming that this would speed up the processing by > increasing these, say to 4)? Can this code be partitioned into many > reducers? If so, would setting them in the protected AbstractJob::JobConf > prepareJobConf() function be appropriate? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Mahout GSoC 2010 proposal: Association Mining
Hi Lukáš, It would have been great if you could have participated in GSOC, there is time left. But you still have your proposal in the GSOC system. Take your time to decide, but if you choose not participate to do remove the application from the soc website. Wiki page for association mining would a good start. The pattern mining package needs to grow beyond just the FPGrowth. Resource intensive operations are what Mahout should do best on large datasets using Hadoop. I can help around the code as much as you like for making it more generic and suitable for association mining. Regards Robin On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček wrote: > Robin, > > I think it does not make sense for me to catch with GSoC timeline now as I > am quite busy with other stuff. However, I will develop the proposal for > Association Mining (or GUHA if you like) and keep this discussion going on. > I am really interested in contributing some implementation to Mahout but as > of now the GCoS timeline is not of any help to me. > > Let me look at this in detail and I will get back to mahout community with > more details. > > As for the use cases for association mining there can be find a lot > examples > in literature. When it comes to missing or negative attributes of the data > (of the transaction) I think there can be a lot of examples as well. One > example would be analysis of click stream, where you can learn that those > people visiting some negative comment on product blog never enter order > form. Not saying this is best example but in general this is the essence of > it. You simply need to take all possible values from the transaction into > account even if it is missing in the market basket. I also remember that > Simpson's paradox is often referred in connection with this issue. As for > GUHA the power of it is that it has well developed theory background. This > for example means that it stated formalized framework for various analysis > that have probably origin in psychology and similar kind of soft-science > and > "association-like-functions" between data attributes can be expressed using > 4ft table members and user thresholds. > > The biggest challenge in implementing this would be the fact that the > analysis have to deal with all the data (not just the most frequent > patterns) and combinations. It is very resource expensive. > > I am thinking of starting a wiki page for Mahout about association mining > ... this could help, what do you think? > > Regards, > Lukas > > On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil wrote: > > > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily > > refactored out or even a vanilla implementation of FPGrowth is not so > > difficult to re-create by re-using the existing methods. > > > > Robin > > > > > > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil wrote: > > > > > Hi Lukas, > > >Sorry for being late to getting back to you on this. > > > Association rule mining is a great addition to FPGrowth. I am not sure > I > > > understand GUHA method well but then again I understood Ted's LLR after > > some > > > deep reading. Could you put up an interesting example to help us > > understand > > > this method. Maybe starting from a transaction of shopping cart item ? > A > > > great demo is big plus for a GSOC project. > > > > > > Robin > > > > > > > > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček > >wrote: > > > > > >> Hello, > > >> > > >> I would like to apply for Mahout GSoC 2010. My proposal is to > implement > > >> Association Mining algorithm utilizing existing PFPGrowth > implementation > > ( > > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). > > >> > > >> As for the Assoiciation Mining I would like to implement a very > general > > >> algorithm based on old and established method called GUHA [1]. Short > and > > >> informative description of GUHA method can be found here [2]. Very > > >> exhaustive theoretical foundation can be found here [3]. Since the > GUHA > > >> method has been developing from 60's and is very rich now and given > the > > >> limited time during GSoC program it would be wise to reduce scope of > > >> initial > > >> implementation. > > >> > > >> There are two main topics that I would like to focus on during GSoC: > > >> > > >> 1) Enhancing and generalization of PFPGworth: > > >> Frequent pattern mining is usually part of each association maining > > task. > > >> In > > >> Mahout there is existing implementation of PFPGrowth algorithm. I > would > > >> like > > >> to utilize this implementaion but it would be necessary to enhance or > > >> generalize it first. The goal here would be to keep current > > functionality > > >> and performance of PFPGrowth but allow more generalized input/output > > data > > >> and conditions. We will need to get frequencies of very rare patterns > > thus > > >> if it will show up that mining only top K is limiting factor then we > > would > > >> need to all
Re: Mahout GSoC 2010 proposal: Association Mining
Robin, I think it does not make sense for me to catch with GSoC timeline now as I am quite busy with other stuff. However, I will develop the proposal for Association Mining (or GUHA if you like) and keep this discussion going on. I am really interested in contributing some implementation to Mahout but as of now the GCoS timeline is not of any help to me. Let me look at this in detail and I will get back to mahout community with more details. As for the use cases for association mining there can be find a lot examples in literature. When it comes to missing or negative attributes of the data (of the transaction) I think there can be a lot of examples as well. One example would be analysis of click stream, where you can learn that those people visiting some negative comment on product blog never enter order form. Not saying this is best example but in general this is the essence of it. You simply need to take all possible values from the transaction into account even if it is missing in the market basket. I also remember that Simpson's paradox is often referred in connection with this issue. As for GUHA the power of it is that it has well developed theory background. This for example means that it stated formalized framework for various analysis that have probably origin in psychology and similar kind of soft-science and "association-like-functions" between data attributes can be expressed using 4ft table members and user thresholds. The biggest challenge in implementing this would be the fact that the analysis have to deal with all the data (not just the most frequent patterns) and combinations. It is very resource expensive. I am thinking of starting a wiki page for Mahout about association mining ... this could help, what do you think? Regards, Lukas On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil wrote: > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily > refactored out or even a vanilla implementation of FPGrowth is not so > difficult to re-create by re-using the existing methods. > > Robin > > > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil wrote: > > > Hi Lukas, > >Sorry for being late to getting back to you on this. > > Association rule mining is a great addition to FPGrowth. I am not sure I > > understand GUHA method well but then again I understood Ted's LLR after > some > > deep reading. Could you put up an interesting example to help us > understand > > this method. Maybe starting from a transaction of shopping cart item ? A > > great demo is big plus for a GSOC project. > > > > Robin > > > > > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček >wrote: > > > >> Hello, > >> > >> I would like to apply for Mahout GSoC 2010. My proposal is to implement > >> Association Mining algorithm utilizing existing PFPGrowth implementation > ( > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). > >> > >> As for the Assoiciation Mining I would like to implement a very general > >> algorithm based on old and established method called GUHA [1]. Short and > >> informative description of GUHA method can be found here [2]. Very > >> exhaustive theoretical foundation can be found here [3]. Since the GUHA > >> method has been developing from 60's and is very rich now and given the > >> limited time during GSoC program it would be wise to reduce scope of > >> initial > >> implementation. > >> > >> There are two main topics that I would like to focus on during GSoC: > >> > >> 1) Enhancing and generalization of PFPGworth: > >> Frequent pattern mining is usually part of each association maining > task. > >> In > >> Mahout there is existing implementation of PFPGrowth algorithm. I would > >> like > >> to utilize this implementaion but it would be necessary to enhance or > >> generalize it first. The goal here would be to keep current > functionality > >> and performance of PFPGrowth but allow more generalized input/output > data > >> and conditions. We will need to get frequencies of very rare patterns > thus > >> if it will show up that mining only top K is limiting factor then we > would > >> need to allow relaxation of this approach. Also we will need to take > >> account > >> on negative features (those missing in individual transaction). Negative > >> features can be either directly supported by implementation of FP > >> algorithm > >> or it can be coded at the feature level (decision about what is better > >> approach will be part of the GSoC program). It should be noted that for > >> the > >> GSoC program we will narrow scope of association mining antecedents and > >> succedents to conjunctions of data features only. > >> > >> 2) API for custom association mining functions based on 4ft table: > >> Association mining in GUHA sense means testing hypothesis on four-fold > >> table > >> (4ft) [see 2. item 5]. There has been proposed a lot of association > >> functions for GUHA, some of them are based on statistical tests (for > >> example > >> Fischer test, C
[jira] Created: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces
Partitioning Collaborative Filtering Job into Maps and Reduces -- Key: MAHOUT-372 URL: https://issues.apache.org/jira/browse/MAHOUT-372 Project: Mahout Issue Type: Question Components: Collaborative Filtering Affects Versions: 0.4 Environment: Ubuntu Koala Reporter: Kris Jack I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main on my hadoop cluster and it partitions the job in 2 although I have more than 2 nodes available. I was reading that the partitioning could be changed by setting the JobConf's conf.setNumMapTasks(int num) and conf.setNumReduceTasks(int num). Would I be right in assuming that this would speed up the processing by increasing these, say to 4)? Can this code be partitioned into many reducers? If so, would setting them in the protected AbstractJob::JobConf prepareJobConf() function be appropriate? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [GSOC] 2010 Timelines
Timeline including Apache internal deadlines: http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC Mentors, please also click on the ranking link to the ranking explanation [1] for more information on how to rank student proposals. Isabel [1] http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process signature.asc Description: This is a digitally signed message part.