[jira] [Commented] (MAHOUT-1421) Adapter package for all mahout tools
[ https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909209#comment-13909209 ] jay vyas commented on MAHOUT-1421: -- >From mailinglist "Great idea. Hard to do well. Would it be possible for you >to try to build a picture of all the pieces that need to be connected before you start building connectors and converters? " -- ted dunning. Sure ted. Will see if i can visualize all the different input types. > Adapter package for all mahout tools > > > Key: MAHOUT-1421 > URL: https://issues.apache.org/jira/browse/MAHOUT-1421 > Project: Mahout > Issue Type: Improvement >Reporter: jay vyas > > Hi mahout. I'd like to create an umbrella JIRA for allowing more runtime > flexibility for reading different types of input formats for all mahout > tasks. > Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which > typically requires: > 1) Hashing text entries into numbers > 2) Saving the large transformed file on disk > 3) Feeding it into classifieer > Instead, we could build adapters into the classifier itself, so that the user > 1) Specifies input file to recommender > 2) Specifies transformation class which converts each record of input to 3 > column recommender format > 3) Runs internal mahout recommender directly against the data > And thus the user could easily run mahout against existing data without > having to munge it to much. > This package might be called something like "org.apache.mahout.adapters", and > would over time provide flexible adapters to the core mahout algorithm > implementations, so that folks wouldnt have to worry so much about > vectors/csv transformers/etc... > Any thoughts on this? If positive feedback I can submit an initial patch to > get things started. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs
[ https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908579#comment-13908579 ] Pat Ferrel commented on MAHOUT-1422: There is another job that needs to be created for the cross-recommender, this job could take any number of inputs but I believe would use the XRSJ in pairs internally. I did a prototype that can use 3 actions on the same items by the same users. It does matrix multiply for cooccurrence similarity in pairs as described above. Haven't entered that into Jira yet > Make a version of RSJ that uses two inputs > -- > > Key: MAHOUT-1422 > URL: https://issues.apache.org/jira/browse/MAHOUT-1422 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 1.0 > Environment: mapreduce >Reporter: Pat Ferrel > Labels: recommender, similarity > Fix For: 1.0 > > > Currently the RowSimiairtyJob uses a similarity measure to pairwise compare > all rows in a DistributedRowMatrix. > For many applications including a cross-action recommender we need something > like RSJ that takes two DRMs and compares matching rows of each. The output > would be the same form as RSJ, and ideally would allow the use of any > similarity type already defined--especially LLR. > There are two implementations of a Cross-Recommender one based on the Mahout > RecommenderJob, and another based on Solr, that can immediately benefit from > a Cross-RSJ. > A modification of the matrix multiply job may be a place to start since the > current RSJ seems to rely heavily if self-similarity. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs
[ https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908548#comment-13908548 ] Pat Ferrel commented on MAHOUT-1422: Yes and no. What you describe makes sense but can be done with the proposed implementation. The algorithm when applied to a recommender works in pairs of inputs. In practice you would create a recommender for purchases, one for purchase from web-views and purchases, one for purchases from mobile-views and purchases. Then the results/recs are combined linearly. Or using Solr each of the pairs creates a field that is indexed separately, the query would be purchases against the purchase self similarity field, web-views against the web-view/purchase similarity field, and mobile-view against the mobile-view/purchase similarity field. This allows each type of history to add information to the query. In other words you can do what you are talking about as a combination of pairs. However in other contexts I believe the cross-similarity can be applied across any number of inputs as long as the row space is the same and there are some interesting applications of this. But I think you can get to the same place by chaining the pairwise jobs. So unless there is some benefit in the implementation to perform the entire chain at once... > Make a version of RSJ that uses two inputs > -- > > Key: MAHOUT-1422 > URL: https://issues.apache.org/jira/browse/MAHOUT-1422 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 1.0 > Environment: mapreduce >Reporter: Pat Ferrel > Labels: recommender, similarity > Fix For: 1.0 > > > Currently the RowSimiairtyJob uses a similarity measure to pairwise compare > all rows in a DistributedRowMatrix. > For many applications including a cross-action recommender we need something > like RSJ that takes two DRMs and compares matching rows of each. The output > would be the same form as RSJ, and ideally would allow the use of any > similarity type already defined--especially LLR. > There are two implementations of a Cross-Recommender one based on the Mahout > RecommenderJob, and another based on Solr, that can immediately benefit from > a Cross-RSJ. > A modification of the matrix multiply job may be a place to start since the > current RSJ seems to rely heavily if self-similarity. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs
[ https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908535#comment-13908535 ] Andrew Musselman commented on MAHOUT-1422: -- Is there a reason to limit this to two inputs if we could add more, e.g. customer by purchase, customer by page view, and customer by mobile page view. > Make a version of RSJ that uses two inputs > -- > > Key: MAHOUT-1422 > URL: https://issues.apache.org/jira/browse/MAHOUT-1422 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 1.0 > Environment: mapreduce >Reporter: Pat Ferrel > Labels: recommender, similarity > Fix For: 1.0 > > > Currently the RowSimiairtyJob uses a similarity measure to pairwise compare > all rows in a DistributedRowMatrix. > For many applications including a cross-action recommender we need something > like RSJ that takes two DRMs and compares matching rows of each. The output > would be the same form as RSJ, and ideally would allow the use of any > similarity type already defined--especially LLR. > There are two implementations of a Cross-Recommender one based on the Mahout > RecommenderJob, and another based on Solr, that can immediately benefit from > a Cross-RSJ. > A modification of the matrix multiply job may be a place to start since the > current RSJ seems to rely heavily if self-similarity. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1422) Make a version of RSJ that uses two inputs
[ https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1422: --- Description: Currently the RowSimiairtyJob uses a similarity measure to pairwise compare all rows in a DistributedRowMatrix. For many applications including a cross-action recommender we need something like RSJ that takes two DRMs and compares matching rows of each. The output would be the same form as RSJ, and ideally would allow the use of any similarity type already defined--especially LLR. There are two implementations of a Cross-Recommender one based on the Mahout RecommenderJob, and another based on Solr, that can immediately benefit from a Cross-RSJ. A modification of the matrix multiply job may be a place to start since the current RSJ seems to rely heavily if self-similarity. was: Currently the RowSimiairtyJob uses a similarity measure to pairwise compare all row in a DistributedRowMatrix. For many applications including a cross-action recommender we need something like RSJ that takes two DRMs and compares matching rows of each. The output would be the same form as RSJ, and ideally would allow the use of any similarity type already defined--especially LLR. There are two implementations of a Cross-Recommender one based on the Mahout RecommenderJob, and another based on Solr, that can immediately benefit from a Cross-RSJ. A modification of the matrix multiply job may be a place to start since the current RSJ seems to rely heavily if self-similarity. > Make a version of RSJ that uses two inputs > -- > > Key: MAHOUT-1422 > URL: https://issues.apache.org/jira/browse/MAHOUT-1422 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 1.0 > Environment: mapreduce >Reporter: Pat Ferrel > Labels: recommender, similarity > Fix For: 1.0 > > > Currently the RowSimiairtyJob uses a similarity measure to pairwise compare > all rows in a DistributedRowMatrix. > For many applications including a cross-action recommender we need something > like RSJ that takes two DRMs and compares matching rows of each. The output > would be the same form as RSJ, and ideally would allow the use of any > similarity type already defined--especially LLR. > There are two implementations of a Cross-Recommender one based on the Mahout > RecommenderJob, and another based on Solr, that can immediately benefit from > a Cross-RSJ. > A modification of the matrix multiply job may be a place to start since the > current RSJ seems to rely heavily if self-similarity. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs
[ https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908487#comment-13908487 ] Pat Ferrel commented on MAHOUT-1422: This job should recreate AB' for cooccurrence similarity. So the row space (number of dimensions) of both input matrices must be the same. There are some applications where the column spaces of the two are not the same and this should be allowed as it is in the matrix multiply special case. The column id spaces should not be interpreted as representing the same things, whereas the row id spaces are identical. The options for the CrossRowSimilairtyJob can be a superset of the current RSJ with the addition of a second input matrix. --input will need to be --input1 and --input2 and --numberOfColumns will need to be --numberOfColumns1 and --numberOfColumns2 or some such. See Ted for further description of these asymmetric applications, where the two column spaces are not the same. Also note that current job makes the assumption of a symmetric DRM as output and this will not be the case for a XRSJ. > Make a version of RSJ that uses two inputs > -- > > Key: MAHOUT-1422 > URL: https://issues.apache.org/jira/browse/MAHOUT-1422 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 1.0 > Environment: mapreduce >Reporter: Pat Ferrel > Labels: recommender, similarity > Fix For: 1.0 > > > Currently the RowSimiairtyJob uses a similarity measure to pairwise compare > all row in a DistributedRowMatrix. > For many applications including a cross-action recommender we need something > like RSJ that takes two DRMs and compares matching rows of each. The output > would be the same form as RSJ, and ideally would allow the use of any > similarity type already defined--especially LLR. > There are two implementations of a Cross-Recommender one based on the Mahout > RecommenderJob, and another based on Solr, that can immediately benefit from > a Cross-RSJ. > A modification of the matrix multiply job may be a place to start since the > current RSJ seems to rely heavily if self-similarity. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908443#comment-13908443 ] Gokhan Capan commented on MAHOUT-1329: -- Good news that I tried that too, on a 2.2.0 cluster. seqdir, seq2sparse, and kmeans worked without a problem. I'm gonna wait till Monday to commit this, in case folks want to verify that it works. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM: --- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster [EDIT:Sorry I missed you mentioned that you ran the examples, great then] was (Author: gokhancapan): Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM: --- Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. was (Author: gokhancapan): Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126 ] Gokhan Capan commented on MAHOUT-1329: -- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908122#comment-13908122 ] Sergey Svinarchuk commented on MAHOUT-1329: --- I tested unit tests and examples with hadoop1 and hadoop2. All tests and examples passed. But for build mahout with hadoop2 I use: mvn clean package -Dhadoop2.version=2.2.0 > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)