[jira] [Commented] (MAHOUT-1421) Adapter package for all mahout tools

2014-02-21 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909209#comment-13909209
 ] 

jay vyas commented on MAHOUT-1421:
--

>From mailinglist "Great idea.  Hard to do well. Would it be possible for you 
>to try to build a picture of all the pieces
that need to be connected before you start building connectors and
converters? " -- ted dunning.  

Sure ted.  Will see if i can visualize all the different input types.

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs

2014-02-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908579#comment-13908579
 ] 

Pat Ferrel commented on MAHOUT-1422:


There is another job that needs to be created for the cross-recommender, this 
job could take any number of inputs but I believe would use the XRSJ in pairs 
internally. I did a prototype that can use 3 actions on the same items by the 
same users. It  does matrix multiply for cooccurrence similarity in pairs as 
described above.

Haven't entered that into Jira yet

> Make a version of RSJ that uses two inputs
> --
>
> Key: MAHOUT-1422
> URL: https://issues.apache.org/jira/browse/MAHOUT-1422
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: mapreduce
>Reporter: Pat Ferrel
>  Labels: recommender, similarity
> Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all rows in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs

2014-02-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908548#comment-13908548
 ] 

Pat Ferrel commented on MAHOUT-1422:


Yes and no. What you describe makes sense but can be done with the proposed 
implementation.

The algorithm when applied to a recommender works in pairs of inputs. In 
practice you would create a recommender for purchases, one for purchase from 
web-views and purchases, one for purchases from mobile-views and purchases. 
Then the results/recs are combined linearly. Or using Solr each of the pairs 
creates a field that is indexed separately, the query would be purchases 
against the purchase self similarity field, web-views against the 
web-view/purchase similarity field, and mobile-view against the 
mobile-view/purchase similarity field. This allows each type of history to add 
information to the query.

In other words you can do what you are talking about as a combination of pairs.

However in other contexts I believe the cross-similarity can be applied across 
any number of inputs as long as the row space is the same and there are some 
interesting applications of this. But I think you can get to the same place by 
chaining the pairwise jobs. So unless there is some benefit in the 
implementation to perform the entire chain at once...



> Make a version of RSJ that uses two inputs
> --
>
> Key: MAHOUT-1422
> URL: https://issues.apache.org/jira/browse/MAHOUT-1422
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: mapreduce
>Reporter: Pat Ferrel
>  Labels: recommender, similarity
> Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all rows in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs

2014-02-21 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908535#comment-13908535
 ] 

Andrew Musselman commented on MAHOUT-1422:
--

Is there a reason to limit this to two inputs if we could add more, e.g. 
customer by purchase, customer by page view, and customer by mobile page view.

> Make a version of RSJ that uses two inputs
> --
>
> Key: MAHOUT-1422
> URL: https://issues.apache.org/jira/browse/MAHOUT-1422
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: mapreduce
>Reporter: Pat Ferrel
>  Labels: recommender, similarity
> Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all rows in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1422) Make a version of RSJ that uses two inputs

2014-02-21 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1422:
---

Description: 
Currently the RowSimiairtyJob uses a similarity measure to pairwise compare all 
rows in a DistributedRowMatrix.

For many applications including a cross-action recommender we need something 
like RSJ that takes two DRMs and compares matching rows of each.  The output 
would be the same form as RSJ, and ideally would allow the use of any 
similarity type already defined--especially LLR.

There are two implementations of a Cross-Recommender one based on the Mahout 
RecommenderJob, and another based on Solr, that can immediately benefit from a 
Cross-RSJ. 

A modification of the matrix multiply job may be a place to start since the 
current RSJ seems to rely heavily if self-similarity.

  was:
Currently the RowSimiairtyJob uses a similarity measure to pairwise compare all 
row in a DistributedRowMatrix.

For many applications including a cross-action recommender we need something 
like RSJ that takes two DRMs and compares matching rows of each.  The output 
would be the same form as RSJ, and ideally would allow the use of any 
similarity type already defined--especially LLR.

There are two implementations of a Cross-Recommender one based on the Mahout 
RecommenderJob, and another based on Solr, that can immediately benefit from a 
Cross-RSJ. 

A modification of the matrix multiply job may be a place to start since the 
current RSJ seems to rely heavily if self-similarity.


> Make a version of RSJ that uses two inputs
> --
>
> Key: MAHOUT-1422
> URL: https://issues.apache.org/jira/browse/MAHOUT-1422
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: mapreduce
>Reporter: Pat Ferrel
>  Labels: recommender, similarity
> Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all rows in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1422) Make a version of RSJ that uses two inputs

2014-02-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908487#comment-13908487
 ] 

Pat Ferrel commented on MAHOUT-1422:


This job should recreate AB' for cooccurrence similarity. So the row space 
(number of dimensions) of both input matrices must be the same. There are some 
applications where the column spaces of the two are not the same and this 
should be allowed as it is in the matrix multiply special case. The column id 
spaces should not be interpreted as representing the same things, whereas the 
row id spaces are identical.

The options for the CrossRowSimilairtyJob can be a superset of the current RSJ 
with the addition of a second input matrix. --input will need to be --input1 
and --input2 and --numberOfColumns will need to be --numberOfColumns1 and 
--numberOfColumns2 or some such.
 
See Ted for further description of these asymmetric applications, where the two 
column spaces are not the same.

Also note that current job makes the assumption of a symmetric DRM as output 
and this will not be the case for a XRSJ. 

> Make a version of RSJ that uses two inputs
> --
>
> Key: MAHOUT-1422
> URL: https://issues.apache.org/jira/browse/MAHOUT-1422
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: mapreduce
>Reporter: Pat Ferrel
>  Labels: recommender, similarity
> Fix For: 1.0
>
>
> Currently the RowSimiairtyJob uses a similarity measure to pairwise compare 
> all row in a DistributedRowMatrix.
> For many applications including a cross-action recommender we need something 
> like RSJ that takes two DRMs and compares matching rows of each.  The output 
> would be the same form as RSJ, and ideally would allow the use of any 
> similarity type already defined--especially LLR.
> There are two implementations of a Cross-Recommender one based on the Mahout 
> RecommenderJob, and another based on Solr, that can immediately benefit from 
> a Cross-RSJ. 
> A modification of the matrix multiply job may be a place to start since the 
> current RSJ seems to rely heavily if self-similarity.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908443#comment-13908443
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Good news that I tried that too, on a 2.2.0 cluster.
seqdir, seq2sparse, and kmeans worked without a problem.

I'm gonna wait till Monday to commit this, in case folks want to verify that it 
works.



> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM:
---

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster [EDIT:Sorry I missed you 
mentioned that you ran the examples, great then]



was (Author: gokhancapan):
Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM:
---

Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.


was (Author: gokhancapan):
Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Sergey Svinarchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908122#comment-13908122
 ] 

Sergey Svinarchuk commented on MAHOUT-1329:
---

I tested unit tests and examples with hadoop1 and hadoop2. All tests and 
examples passed.
But for build mahout with hadoop2 I use: mvn clean package 
-Dhadoop2.version=2.2.0

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)