[jira] [Created] (MAHOUT-1515) Contact the original Frequent Pattern Mining author

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1515:
--

 Summary: Contact the original Frequent Pattern Mining author
 Key: MAHOUT-1515
 URL: https://issues.apache.org/jira/browse/MAHOUT-1515
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
Priority: Critical
 Fix For: 1.0


We should contact the original FPM author to ask about maintenance of the 
algorithm. Otherwise this becomes a candidate for removal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1514) Contact the original Random Forest author

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1514:
--

 Summary: Contact the original Random Forest author
 Key: MAHOUT-1514
 URL: https://issues.apache.org/jira/browse/MAHOUT-1514
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
Priority: Critical
 Fix For: 1.0


We should contact the original Random Forest author to ask about maintenance of 
the implementation. Otherwise, this becomes a candidate for removal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1513) Deprecate Canopy Clustering

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1513:
--

 Summary: Deprecate Canopy Clustering
 Key: MAHOUT-1513
 URL: https://issues.apache.org/jira/browse/MAHOUT-1513
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
 Fix For: 1.0


citing [~smarthi] "I meant to deprecate first (and eventually remove) Canopy 
clustering. This is in line with the conversation I had with Ted and Frank at 
AMS about weaning users away from the old style Canopy->KMeans clustering to 
start using Streaming KMeans. No point in keeping Canopy once users switch to 
using Streaming KMeans."



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1512) Hadoop 2 compatibility

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1512:
--

 Summary: Hadoop 2 compatibility
 Key: MAHOUT-1512
 URL: https://issues.apache.org/jira/browse/MAHOUT-1512
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
Priority: Critical
 Fix For: 1.0


We must ensure that all our MR code also runs on Hadoop 2. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1511) Renaming core to mrlegacy

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1511:
--

 Summary: Renaming core to mrlegacy
 Key: MAHOUT-1511
 URL: https://issues.apache.org/jira/browse/MAHOUT-1511
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
 Fix For: 1.0


Rename the core module to mrlegacy to reflect that we still maintain this code 
but do not add new MR algorithms. We should aim to gradually pull out items 
that are really needed from this module.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1510) Goodbye MapReduce

2014-04-14 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1510:
--

 Summary: Goodbye MapReduce
 Key: MAHOUT-1510
 URL: https://issues.apache.org/jira/browse/MAHOUT-1510
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0


We should prominently state on the website that we reject any future MR 
algorithm contributions (but still maintain and bugfix what we have so far).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Tackling the "legacy dilemma"

2014-04-14 Thread Sebastian Schelter

Hi,

From reading the thread, I have the impression that we agree on the 
following actions:


 * reject any future MR algorithm contributions, prominently state this
on the website and in talks
 * make all existing algorithm code compatible with Hadoop 2, if there 
is no one willing to make an existing algorithm compatible, remove the 
algorithm

 * deprecate Canopy clustering
 * email the original FPM and random forest authors to ask for 
maintenance of the algorithms
 * rename core to "mr-legacy" (and  gradually pull items we really need 
out of that later)


I will create jira tickets for those action points. I think the biggest 
challenge here is the Hadoop 2 compatibility, is someone volunteering to 
drive that? Would be awesome.


Best,
Sebastian


On 04/13/2014 07:19 PM, Andrew Musselman wrote:

This is a good summary of how I feel too.


On Apr 13, 2014, at 10:15 AM, Sebastian Schelter  wrote:

Unfortunately, its not that easy to get enough voluntary work. I issued the 
third call for working on the documentation today as there are still lots of 
open issues. That's why I'm trying to suggest a move that involves as few work 
as possible.

We should get the MR codebase into a state that we all can live with and then 
focus on new stuff like the scala DSL.

--sebastian





On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
The best thing, should be do a plan, and see how much effort do you need to
this. Then find out voluntaries to accomplish the task. Quite sure that
there a lot of people around there that they are willing to help out.

BR,
deneb.


2014-04-13 18:45 GMT+02:00 Sebastian Schelter :


Hi,

I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.

In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.

That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.

*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:

  * reject any future MR algorithm contributions, prominently state this on
the website and in talks
  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
  * deprecate the existing MR algorithms, yet still take bug fix
contributions
  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

There are two more actions that I would like to see, but'd be willing to
give up if there are objections:

  * move the MR algorithms into a separate maven module
  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)

Let me know what you think.

--sebastian






[jira] [Commented] (MAHOUT-1504) Enable/fix thetaSummer job in TrainNaiveBayesJob

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969223#comment-13969223
 ] 

Hudson commented on MAHOUT-1504:


SUCCESS: Integrated in Mahout-Quality #2569 (See 
[https://builds.apache.org/job/Mahout-Quality/2569/])
MAHOUT-1504: Enable/fix thetaSummer job in TrainNaiveBayesJob (smarthi: rev 
1587393)
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
MAHOUT-1504: Enable/fix thetaSummer job in TrainNaiveBayesJob (smarthi: rev 
1587390)
* /mahout/trunk/CHANGELOG
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/ComplementaryNaiveBayesClassifier.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/AbstractThetaTrainer.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/ComplementaryThetaTrainer.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/StandardThetaTrainer.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/naivebayes/NaiveBayesTestBase.java


> Enable/fix thetaSummer job in TrainNaiveBayesJob
> 
>
> Key: MAHOUT-1504
> URL: https://issues.apache.org/jira/browse/MAHOUT-1504
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1504.patch
>
>
> A new implementation of Naive Bayes was introduced in .7.  The weight (theta) 
> normalization job was at least partially carried over but not fully 
> implemented or enabled.  Weight normalization does not effect simple NB or 
> CNB however enabling it will allow for all NB implementations in the Rennie 
> et al. paper. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1503) TestNaiveBayesDriver fails in sequential mode

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969224#comment-13969224
 ] 

Hudson commented on MAHOUT-1503:


SUCCESS: Integrated in Mahout-Quality #2569 (See 
[https://builds.apache.org/job/Mahout-Quality/2569/])
MAHOUT-1503: TestNaiveBayesDriver fails in sequential mode (smarthi: rev 
1587387)
* /mahout/trunk/CHANGELOG
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java


> TestNaiveBayesDriver fails in sequential mode
> -
>
> Key: MAHOUT-1503
> URL: https://issues.apache.org/jira/browse/MAHOUT-1503
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1503.patch
>
>
> As reported by Chandler Burgess, testnb fails in sequential mode with 
> exception:
> Exception in thread "main" java.io.FileNotFoundException: 
> /tmp/mahout-work-andy/20news-train-vectors (Is a directory)
>   at java.io.FileInputStream.open(Native Method)
>   at java.io.FileInputStream.(FileInputStream.java:120)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:71)
> {...} at 
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:99)
> {...}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1509) Invalid URL in link from "quick start/basics" page

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969222#comment-13969222
 ] 

Hudson commented on MAHOUT-1509:


SUCCESS: Integrated in Mahout-Quality #2569 (See 
[https://builds.apache.org/job/Mahout-Quality/2569/])
MAHOUT-1509:Invalid URL in link from "quick start/basics" page (smarthi: rev 
1587383)
* /mahout/trunk/CHANGELOG


> Invalid URL in link from "quick start/basics" page
> --
>
> Key: MAHOUT-1509
> URL: https://issues.apache.org/jira/browse/MAHOUT-1509
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
> Environment: Website
>Reporter: Nick Martin
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0
>
>
> From https://mahout.apache.org/users/basics/quickstart.html the "Dos and 
> Don'ts" link under "Recommendations" goes to nowhere (URL typo - 
> "ecommender") 
> https://mahout.apache.org/users/recommender/ecommender-first-timer-faq.html 
> Can't remember who's running point on the URL updates or I'd [at] them...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1445) Create an intro for item based recommender

2014-04-14 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1445:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Added to the website at 
https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html . 
Thanks for the contribution!

> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Maciej Mazur
>  Labels: documentation, recommender
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1477) Clean up website on Logistic Regression

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969203#comment-13969203
 ] 

Sebastian Schelter commented on MAHOUT-1477:


[~nimartin] I think you can go ahead and take this one, as there hasn't been 
activity for three weeks from [~kanjilal]

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1439) Update talks on Mahout

2014-04-14 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969186#comment-13969186
 ] 

Ted Dunning commented on MAHOUT-1439:
-

@nimartin

That would be SOOO helpful.



> Update talks on Mahout
> --
>
> Key: MAHOUT-1439
> URL: https://issues.apache.org/jira/browse/MAHOUT-1439
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> The talks listed on our homepage seem to end somewhere in 2012.
> I know that there have been tons of other talks on Mahout since then, I've 
> added mine already. It would be great if everybody who knows of additional 
> talks would paste them here, so I can add them to the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1369) Why does theta normalization for naive bayes classification commented out?

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1369:
-

Assignee: Suneel Marthi

> Why does theta normalization for naive bayes classification commented out?
> --
>
> Key: MAHOUT-1369
> URL: https://issues.apache.org/jira/browse/MAHOUT-1369
> Project: Mahout
>  Issue Type: Question
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
> Environment: mahout 0.8
>Reporter: utku yaman
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: features
> Fix For: 1.0
>
>
> TrainNaiveBayesJob line 155:158
> and
> BayesUtils line 86:93
> are commented out and these lines are for theta normalization for bayes.
> what is the problem with the code and is there a plan for correcting these 
> methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1369) Why does theta normalization for naive bayes classification commented out?

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1369.
---

Resolution: Fixed

Resolved by the fix for Mahout-1504

> Why does theta normalization for naive bayes classification commented out?
> --
>
> Key: MAHOUT-1369
> URL: https://issues.apache.org/jira/browse/MAHOUT-1369
> Project: Mahout
>  Issue Type: Question
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
> Environment: mahout 0.8
>Reporter: utku yaman
>Priority: Minor
>  Labels: features
> Fix For: 1.0
>
>
> TrainNaiveBayesJob line 155:158
> and
> BayesUtils line 86:93
> are commented out and these lines are for theta normalization for bayes.
> what is the problem with the code and is there a plan for correcting these 
> methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1504) Enable/fix thetaSummer job in TrainNaiveBayesJob

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1504:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to trunk, thanks again.

> Enable/fix thetaSummer job in TrainNaiveBayesJob
> 
>
> Key: MAHOUT-1504
> URL: https://issues.apache.org/jira/browse/MAHOUT-1504
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1504.patch
>
>
> A new implementation of Naive Bayes was introduced in .7.  The weight (theta) 
> normalization job was at least partially carried over but not fully 
> implemented or enabled.  Weight normalization does not effect simple NB or 
> CNB however enabling it will allow for all NB implementations in the Rennie 
> et al. paper. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1504) Enable/fix thetaSummer job in TrainNaiveBayesJob

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1504:
-

Assignee: Suneel Marthi

> Enable/fix thetaSummer job in TrainNaiveBayesJob
> 
>
> Key: MAHOUT-1504
> URL: https://issues.apache.org/jira/browse/MAHOUT-1504
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1504.patch
>
>
> A new implementation of Naive Bayes was introduced in .7.  The weight (theta) 
> normalization job was at least partially carried over but not fully 
> implemented or enabled.  Weight normalization does not effect simple NB or 
> CNB however enabling it will allow for all NB implementations in the Rennie 
> et al. paper. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1503) TestNaiveBayesDriver fails in sequential mode

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1503:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed the patch with minor changes. Need to add a unit test to ensure 
adequate test coverage.

> TestNaiveBayesDriver fails in sequential mode
> -
>
> Key: MAHOUT-1503
> URL: https://issues.apache.org/jira/browse/MAHOUT-1503
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1503.patch
>
>
> As reported by Chandler Burgess, testnb fails in sequential mode with 
> exception:
> Exception in thread "main" java.io.FileNotFoundException: 
> /tmp/mahout-work-andy/20news-train-vectors (Is a directory)
>   at java.io.FileInputStream.open(Native Method)
>   at java.io.FileInputStream.(FileInputStream.java:120)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:71)
> {...} at 
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:99)
> {...}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1439) Update talks on Mahout

2014-04-14 Thread Nick Martin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969142#comment-13969142
 ] 

Nick Martin commented on MAHOUT-1439:
-

[~tdunning] I have some time this week to work on doc cleanup...if it would 
help I can scan your public slideshare and comment back a list of talks/topics 
and dates. Might save you some cycles on aggregating the talk timeline if 
you're tied up.

> Update talks on Mahout
> --
>
> Key: MAHOUT-1439
> URL: https://issues.apache.org/jira/browse/MAHOUT-1439
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> The talks listed on our homepage seem to end somewhere in 2012.
> I know that there have been tons of other talks on Mahout since then, I've 
> added mine already. It would be great if everybody who knows of additional 
> talks would paste them here, so I can add them to the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1477) Clean up website on Logistic Regression

2014-04-14 Thread Nick Martin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969137#comment-13969137
 ] 

Nick Martin commented on MAHOUT-1477:
-

[~kanjilal] hey there - have you had a chance to start this yet? If not, I have 
some time this week I can probably knock it out but don't want step on your 
stuff if you've started something. Let me know ASAP so I know whether to start 
or not. Thx.

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1509) Invalid URL in link from "quick start/basics" page

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1509:
-

Assignee: Suneel Marthi

> Invalid URL in link from "quick start/basics" page
> --
>
> Key: MAHOUT-1509
> URL: https://issues.apache.org/jira/browse/MAHOUT-1509
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
> Environment: Website
>Reporter: Nick Martin
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0
>
>
> From https://mahout.apache.org/users/basics/quickstart.html the "Dos and 
> Don'ts" link under "Recommendations" goes to nowhere (URL typo - 
> "ecommender") 
> https://mahout.apache.org/users/recommender/ecommender-first-timer-faq.html 
> Can't remember who's running point on the URL updates or I'd [at] them...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1509) Invalid URL in link from "quick start/basics" page

2014-04-14 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1509.
---

   Resolution: Fixed
Fix Version/s: 1.0

Thanks for pointing out, fixed the bad link.

> Invalid URL in link from "quick start/basics" page
> --
>
> Key: MAHOUT-1509
> URL: https://issues.apache.org/jira/browse/MAHOUT-1509
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
> Environment: Website
>Reporter: Nick Martin
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0
>
>
> From https://mahout.apache.org/users/basics/quickstart.html the "Dos and 
> Don'ts" link under "Recommendations" goes to nowhere (URL typo - 
> "ecommender") 
> https://mahout.apache.org/users/recommender/ecommender-first-timer-faq.html 
> Can't remember who's running point on the URL updates or I'd [at] them...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1509) Invalid URL in link from "quick start/basics" page

2014-04-14 Thread Nick Martin (JIRA)
Nick Martin created MAHOUT-1509:
---

 Summary: Invalid URL in link from "quick start/basics" page
 Key: MAHOUT-1509
 URL: https://issues.apache.org/jira/browse/MAHOUT-1509
 Project: Mahout
  Issue Type: Documentation
  Components: Examples
 Environment: Website
Reporter: Nick Martin
Priority: Minor


>From https://mahout.apache.org/users/basics/quickstart.html the "Dos and 
>Don'ts" link under "Recommendations" goes to nowhere (URL typo - "ecommender") 
>https://mahout.apache.org/users/recommender/ecommender-first-timer-faq.html 

Can't remember who's running point on the URL updates or I'd [at] them...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1445) Create an intro for item based recommender

2014-04-14 Thread Nick Martin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969112#comment-13969112
 ] 

Nick Martin commented on MAHOUT-1445:
-

Item Based Recommender
Introduction

Mahout’s item based recommender is a flexible and easily implemented algorithm 
with a diverse range of applications. The minimalism of the primary input 
file’s structure and availability of ancillary filtering controls can make 
sourcing required data and shaping a desired output both efficient and 
straightforward.

Typical use cases include:
• Recommend products to customers via an eCommerce platform (think: Amazon, 
Netflix, Overstock)
• Identify organic sales opportunities
• Segment users/customers based on similar item preferences

Broadly speaking, Mahout's item-based recommendation algorithm takes as input 
customer preferences by item and generates an output recommending similar items 
with a score indicating the likelihood a customer will "like" the recommended 
item.

One of the strengths of the item based recommender is its adaptability to your 
business conditions or research interests. For example, there are many 
available approaches for providing product preference. One such method is to 
calculate the total orders for a given product for each customer (i.e. Acme 
Corp has ordered Widget-A 5,678 times) while others rely on user preference 
captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a 
product two thumbs’ up).

Additionally, a variety of methodologies can be implemented to narrow the focus 
of Mahout's recommendations, such as:
• Exclude low volume or low profitability products from consideration
• Group customers by segment or market rather than using user/customer level 
data
• Exclude zero-dollar transactions, returns or other order types
• Map product substitutions into the Mahout input (i.e. if WidgetA is a 
recommended item replace it with WidgetX)

The item based recommender output can be easily consumed by downstream 
applications (i.e. websites, ERP systems or salesforce automation tools) and is 
configurable so users can determine the number of item recommendations 
generated by the algorithm.


Example

Testing the item based recommender can be a simple and potentially quite 
rewarding endeavor. Whereas the typical sample use case for collaborative 
filtering focuses on utilization of, and integration with, eCommerce platforms 
we can instead look at a potential use case applicable to most businesses (even 
those without a web presence). Let’s look at how a company might use Mahout’s 
item based recommender to identify new sales opportunities for an existing 
customer base. First, you’ll need to get Mahout up and running, the 
instructions for which can be found here 
(https://mahout.apache.org/users/basics/quickstart.html). After you've ensured 
Mahout is properly installed we’re ready to run a quick example. 

Step 1: Gather some test data
Mahout’s item based recommender relies on three key pieces of data: userID, 
itemID and preference. The “users” could be website visitors or simply 
customers that purchase products from your business. Similarly, items could be 
products, product groups or even pages on your website – really anything you 
would want to recommend to a group of users or customers. For our example let’s 
use customer orders as a proxy for preference. A simple count of distinct 
orders by customer, by product will work for this example. You’ll find as you 
explore ways to manipulate the item based recommender the preference value can 
be many things (page clicks, explicit ratings, order counts, etc.). Once your 
test data is gathered put it in a .txt file separated by commas with no column 
headers included. 

Step 2: Pick a similarity measure
Choosing a similarity measure for use in a production environment is something 
that requires careful testing, evaluation and research. For our example 
purposes, we’ll just go with a Mahout similarity classname called 
“SIMILARITY_LOGLIKELIHOOD”.

Step 3: Configure the Mahout command
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly 
we’re ready to configure our syntax. Enter the following command:

$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD  -i /path/to/input/file 
-o /path/to/desired/output  --numRecommendations 25 

Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the directory specified in the command 
syntax. The output file will contain two columns: the userID and an array of 
itemIDs and scores. 

Step 4: Making use of the output and doing more with Mahout
The output file generated in our simple example can be transformed using your 
tool of choice and consumed by downstream applications. There exist a variety 
of configuration options for Mahout’s item based recommender to accommodate 
custom bus

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969070#comment-13969070
 ] 

Pat Ferrel commented on MAHOUT-1464:


Ok, something misconfigured maybe, IDEA. Let's leave IDEA out of the loop.

If you could indulge me--how do you run this from the CLI? 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1504) Enable/fix thetaSummer job in TrainNaiveBayesJob

2014-04-14 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1504:
---

Status: Patch Available  (was: Open)

> Enable/fix thetaSummer job in TrainNaiveBayesJob
> 
>
> Key: MAHOUT-1504
> URL: https://issues.apache.org/jira/browse/MAHOUT-1504
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1504.patch
>
>
> A new implementation of Naive Bayes was introduced in .7.  The weight (theta) 
> normalization job was at least partially carried over but not fully 
> implemented or enabled.  Weight normalization does not effect simple NB or 
> CNB however enabling it will allow for all NB implementations in the Rennie 
> et al. paper. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1504) Enable/fix thetaSummer job in TrainNaiveBayesJob

2014-04-14 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1504:
---

Attachment: MAHOUT-1504.patch

This patch fixes the thetaSummer Job bug. With this CNB will run with weight 
normalization as per section 3.2 of the Rennie paper.  I decided to keep it 
simple, and just get the weight normalization working.  This allows for the 
full algorithm as outlined in Table 4 of Rennie.

Weight normalization is not needed for standard NB and the thetaSummer Job is 
just an added expense. Though the weight summations are all done, I've left the 
weight normalization step commented out in StandardNaiveBayesClassifier.

I am thinking maybe something like adding a -w option for weight normalization 
or only running the thetaSummer Job when the -c option is supplied might make 
sense (the former may unnecessarily complicate things).  Another (probably 
better) option would be to store the calculated weights in the model (during 
the training phase) so that they don't need to be recalculated when 
testing/classifying.  Probably questions for another JIRA.

Let me know if any changes are needed.

> Enable/fix thetaSummer job in TrainNaiveBayesJob
> 
>
> Key: MAHOUT-1504
> URL: https://issues.apache.org/jira/browse/MAHOUT-1504
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1504.patch
>
>
> A new implementation of Naive Bayes was introduced in .7.  The weight (theta) 
> normalization job was at least partially carried over but not fully 
> implemented or enabled.  Weight normalization does not effect simple NB or 
> CNB however enabling it will allow for all NB implementations in the Rennie 
> et al. paper. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968894#comment-13968894
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

i have been dumping from spark to hdfs, hbase, memory-mapped index
structures, you name it, for 2 years.

Pat, something is definitely getting wrong there -- but not by design.





> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968885#comment-13968885
 ] 

Pat Ferrel commented on MAHOUT-1464:


Hmm, maybe I should ask if anyone has gotten this stuff to read/write to HDFS? 
I can get read but not write

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Spark setup

2014-04-14 Thread Pat Ferrel
OK, thanks but this isn’t the question. All this is working fine with and 
without the cluster.


On Apr 14, 2014, at 11:26 AM, Saikat Kanjilal  wrote:

@Pat,
In regards to your question on JIRA, this is Dmitry's email about running 
mahout on spark.

Sent from my iPhone

> On Apr 11, 2014, at 7:52 PM, "Andrew Musselman"  
> wrote:
> 
> We've used Mesos at a client to run both Hadoop and Spark jobs in the same
> setup.  It's been a good experience so far.
> 
> I haven't used YARN on any projects yet but it looks like you need to
> rebuild Spark to run on it currently:
> https://spark.apache.org/docs/0.9.0/running-on-yarn.html
> 
> Why not officially support Hadoop v2 and recommend YARN for that, as well
> as supporting Mesos?
> 
> Another question is how long we will support Hadoop v1.
> 
> 
>> On Fri, Apr 11, 2014 at 1:43 PM, Ted Dunning  wrote:
>> 
>> I am pretty sure that mesos supports both map reduce and spark.
>> 
>> In general, though, the biggest design consideration in which resource
>> manager to use is to comply with local standards and traditions.
>> 
>> For playing around, stand-alone spark is fine.
>> 
>> 
>> 
>> On Thu, Apr 10, 2014 at 4:29 PM, Dmitriy Lyubimov 
>> wrote:
>> 
 On Thu, Apr 10, 2014 at 4:20 PM, Pat Ferrel 
>>> wrote:
>>> 
 Hmm, that leaves Spark and Hadoop to manage tasks independently. Not
>>> ideal
 if you are running both hadoop and spark jobs simultaneously.
>>> 
>>> I think the only resource manager that semi-officially supports both
>>> MapReduce and spark is Yarn. This sounds neat in theory, but in practice
>> i
>>> think one discovers too many hoops to jump thru. I am also inertly
>> dubious
>>> about quality and performance of Yarn compared to others.
>>> 
>>> 
 
 If you have a single user cluster or are running jobs in a pipeline I
 suppose you don't need Mesos.
 
 
 On Apr 10, 2014, at 1:00 PM, Dmitriy Lyubimov 
>> wrote:
 
 On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel 
 wrote:
 
> What is the recommended Spark setup?
 
 Check out their docs. We don't have any special instructions for
>> mahout.
 
 The main point behind 0.9.0 release is that it now supports master HA
>>> thru
 zookeeper, so for that reason alone you probably don't want to use
>> mesos.
 
 You may want to use mesos to have pre-allocated workers per spark
>> session
 (so called "coarse grained" mode). if you shoot a lot of short-running
 queries (1sec or less), this is a significant win in QPS and response
>>> time.
 (fine grained mode will add about 3 seconds to start all the workers
>>> lazily
 to pipeline time).
 
 In our case we are dealing with stuff that runs over 3 seconds for most
 part, so assuming 0.9.0 HA is stable enough (which i haven't tried
>> yet),
 there's no reason for us to go mesos, multi-master standalone with
 zookeeper is good enough.
 
 
> 
> I imagine most of us will have HDFS configured (with either local
>> files
 or
> an actual cluster).
 
 Hadoop DFS API  is pretty much the only persistence api supported by
>>> Mahout
 Spark Bindings at this point. So yes, you would want to have hdfs-only
 cluster running 1.x or 2 doesn't matter. i use cdh 4 distros.
 
 
> Since most of Mahout is recommended to be run on Hadoop 1.x we should
>>> use
> Mesos? https://github.com/mesos/hadoop
> 
> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and
> current mahout pom). We'd use Mesos to manage hadoop and spark jobs
>> but
> HDFS would be controlled separately by hadoop itself.
 
 I think i addressed this. no we are not bound by the MR part of mahout
 since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i
 would forego mesos -- unless it turns out meaningfully faster or more
 stable.
 
 
 
> 
> Is this about right? Is there a setup doc I missed?
 
 
 i dont think one needed.
>> 



[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968864#comment-13968864
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 9:28 PM:
-

I think IDEA forces some things to run local so it can keep track of threads or 
something. Seems to work correctly with Spark but not HDFS. There are ways to 
remote debug with it so it separates processes but I don't need you to help me 
with IDEA.

Seems easier to answer: How do I run this from the CLI? Let's get IDEA out of 
the picture. I bet it will just work.

We need a way to run these from the CLI via cron or scripts anyway, right?

Using spark-class I get no errors but no output either. It doesn't create the 
same Application name so I must be using it wrong. Will look later today.


was (Author: pferrel):
I think IDEA forces some things to run local so it can keep track of threads or 
something. Seems to work correctly with Spark but not HDFS. There are ways to 
remote debug with it so it separates processes but I don't need you to help me 
with IDEA.

Seems easier to answer: How do I run this from the CLI? Let's get IDEA out of 
the picture. I be it will just work.

We need a way to run these from the CLI via cron or scripts anyway, right?

Using spark-class I get no errors but no output either. It doesn't create the 
same Application name so I must be using it wrong. Will look later today.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968864#comment-13968864
 ] 

Pat Ferrel commented on MAHOUT-1464:


I think IDEA forces some things to run local so it can keep track of threads or 
something. Seems to work correctly with Spark but not HDFS. There are ways to 
remote debug with it so it separates processes but I don't need you to help me 
with IDEA.

Seems easier to answer: How do I run this from the CLI? Let's get IDEA out of 
the picture. I be it will just work.

We need a way to run these from the CLI via cron or scripts anyway, right?

Using spark-class I get no errors but no output either. It doesn't create the 
same Application name so I must be using it wrong. Will look later today.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968849#comment-13968849
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--



IDEA is driver. but output is written by spark workers. Not the same
environment, and in most cases, not the same machine. Just like it happens
for MR reducers. Unless it is "local" master url. Which i assume it was not.


This is strange. I can, was able to and will able to. why wouldn't it able
to? unless there are network or security issues. There's nothing
fundamentally different between reading/writing hdfs from a worker process
or any other process.



No. Spark client is about shipping driver and have it running somewhere
else. it is as if somebody was running mahout cli command on one of the
worker nodes. this is it. it knows nothing about hdfs -- and even what the
driver program is going to do. One might use the Client code to print out
"Hello, World" and exit on some of the worker nodes, the Client wouldn't
know or care. Using a worker to run driver programs, that's all it does.




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov
PS like i said, the "Client" feature only appeared in 0.9. Nobody missed it
before that and it never was a prerequisite to run anything.


On Mon, Apr 14, 2014 at 2:14 PM, Dmitriy Lyubimov (JIRA) wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968849#comment-13968849]
>
> Dmitriy Lyubimov commented on MAHOUT-1464:
> --
>
>
>
> IDEA is driver. but output is written by spark workers. Not the same
> environment, and in most cases, not the same machine. Just like it happens
> for MR reducers. Unless it is "local" master url. Which i assume it was
> not.
>
>
> This is strange. I can, was able to and will able to. why wouldn't it able
> to? unless there are network or security issues. There's nothing
> fundamentally different between reading/writing hdfs from a worker process
> or any other process.
>
>
>
> No. Spark client is about shipping driver and have it running somewhere
> else. it is as if somebody was running mahout cli command on one of the
> worker nodes. this is it. it knows nothing about hdfs -- and even what the
> driver program is going to do. One might use the Client code to print out
> "Hello, World" and exit on some of the worker nodes, the Client wouldn't
> know or care. Using a worker to run driver programs, that's all it does.
>
>
>
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968820#comment-13968820
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 9:02 PM:
-

Getting input from: hdfs://occam4.local/user/pat/xrsj  the job seems able to 
complete up to the point where it tries to write the output. Then running 
inside IDEA I am unable to connect to the cluster HDFS master to write.

I've never been able to have code write to HDFS from inside IDEA. I just run it 
from a bash script where my dev machine is configured as an HDFS client.

Shouldn't using 'spark-class org.apache.spark.deploy.Client launch' give us 
this?

BTW all the computation is indeed running on the cluster.



was (Author: pferrel):
Getting input from: hdfs://occam4.local/user/pat/xrsj  the job seems able to 
complete up to the point where it tries to write the output. Then running 
inside IDEA I am unable to connect to the cluster HDFS master to write.

I've never been able to have code write to HDFS from inside IDEA. I just run it 
from a bash script where my dev machine is configured as an HDFS client.

Shouldn't using spark_client give us this?

BTW all the computation is indeed running on the cluster.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968820#comment-13968820
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 8:51 PM:
-

Getting input from: hdfs://occam4.local/user/pat/xrsj  the job seems able to 
complete up to the point where it tries to write the output. Then running 
inside IDEA I am unable to connect to the cluster HDFS master to write.

I've never been able to have code write to HDFS from inside IDEA. I just run it 
from a bash script where my dev machine is configured as an HDFS client.

Shouldn't using spark_client give us this?

BTW all the computation is indeed running on the cluster.



was (Author: pferrel):
Getting input from: hdfs://occam4.local/user/pat/xrsj  the job seems able to 
complete up to the point where it tries to write the output. Then running 
inside IDEA I am unable to connect to the cluster HDFS master to write.

I've never been able to have code write to HDFS from inside IDEA. I just run it 
from a bash script where my dev machine is configured as an HDFS client.

Shouldn't using spark_client give us this?



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968820#comment-13968820
 ] 

Pat Ferrel commented on MAHOUT-1464:


Getting input from: hdfs://occam4.local/user/pat/xrsj  the job seems able to 
complete up to the point where it tries to write the output. Then running 
inside IDEA I am unable to connect to the cluster HDFS master to write.

I've never been able to have code write to HDFS from inside IDEA. I just run it 
from a bash script where my dev machine is configured as an HDFS client.

Shouldn't using spark_client give us this?



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968783#comment-13968783
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--



That's odd. Honestly I don't know and never encountered that.  Maybe it is
something the program itself does, not Spark? stacktrace or log with some
sort of complaint would be helpful.

I know that with Mesos supervision, SPARK_HOME must be the same on all
nodes (driver including). But i think this is only specfic to mesos setup.
Standalone back should be able to handle locations.



I think i gave an explanation for this already. Mostly, because it assumes
the jar is all it takes to run the program. but it takes entire mahout to
run a distribution. And because it still doesnt pass master to the program.
IMO there's no real advantage of doing  this vs. running a standalone
application (perhaps with exception when you are running from remote and
slowly connected client and want to disconnect while task still running).





> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968781#comment-13968781
 ] 

Pat Ferrel commented on MAHOUT-1464:


Probably don't here either but there is 16g to 8g on all machines.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968774#comment-13968774
 ] 

Sebastian Schelter commented on MAHOUT-1464:


Do you run this with the movielens dataset? You shouldn't need that much memory 
for that. 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968769#comment-13968769
 ] 

Pat Ferrel commented on MAHOUT-1464:


I'm running on the localhost spark://Maclaurin.local:7077 master now and 
getting out of heap errors. When I ran locally I just passed in -Xms8000 to the 
JVM and that was fine.

Had to hack the mahoutSparkContext code, there doesn't seem to be a way to pass 
in or modify the conf? Notice the 4g

  conf.setAppName(appName).setMaster(masterUrl)
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", 
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
  .set("spark.executor.memory", "4g")

This worked fine.

My dev machine is not part of the cluster and cannot participate because the 
path to scripts like start-slave.sh is different on the cluster and dev machine 
(Mac vs Linux). If I try to launch on the dev machine but point to a cluster 
managed by another machine it eventually tries to look in IDEA's 
WORKING_DIRECTORY/_temporary for something that is not there--maybe on the 
Spark Master?

I need a way to launch this outside IDEA on a cluster machine, why shouldn't 
the spark_client method work?

Anyway I'll keep trying to work this out, so far local and 'pseudo-cluster' 
work.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968661#comment-13968661
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

PS it mode other than "local" it will be looking for MAHOUT_HOME or 
-Dmahout.home= ... to point to latest Mahout directory. this should have latest 
binaries including RSJ to run in backend. that's what it ships to Spark app. ( 
you don't need to recompile Mahout if all you changed was just context hack). 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Spark setup

2014-04-14 Thread Dmitriy Lyubimov
thanks, yes that's it


On Mon, Apr 14, 2014 at 11:26 AM, Saikat Kanjilal wrote:

> @Pat,
> In regards to your question on JIRA, this is Dmitry's email about running
> mahout on spark.
>
> Sent from my iPhone
>
> > On Apr 11, 2014, at 7:52 PM, "Andrew Musselman" <
> andrew.mussel...@gmail.com> wrote:
> >
> > We've used Mesos at a client to run both Hadoop and Spark jobs in the
> same
> > setup.  It's been a good experience so far.
> >
> > I haven't used YARN on any projects yet but it looks like you need to
> > rebuild Spark to run on it currently:
> > https://spark.apache.org/docs/0.9.0/running-on-yarn.html
> >
> > Why not officially support Hadoop v2 and recommend YARN for that, as well
> > as supporting Mesos?
> >
> > Another question is how long we will support Hadoop v1.
> >
> >
> >> On Fri, Apr 11, 2014 at 1:43 PM, Ted Dunning 
> wrote:
> >>
> >> I am pretty sure that mesos supports both map reduce and spark.
> >>
> >> In general, though, the biggest design consideration in which resource
> >> manager to use is to comply with local standards and traditions.
> >>
> >> For playing around, stand-alone spark is fine.
> >>
> >>
> >>
> >> On Thu, Apr 10, 2014 at 4:29 PM, Dmitriy Lyubimov 
> >> wrote:
> >>
>  On Thu, Apr 10, 2014 at 4:20 PM, Pat Ferrel 
> >>> wrote:
> >>>
>  Hmm, that leaves Spark and Hadoop to manage tasks independently. Not
> >>> ideal
>  if you are running both hadoop and spark jobs simultaneously.
> >>>
> >>> I think the only resource manager that semi-officially supports both
> >>> MapReduce and spark is Yarn. This sounds neat in theory, but in
> practice
> >> i
> >>> think one discovers too many hoops to jump thru. I am also inertly
> >> dubious
> >>> about quality and performance of Yarn compared to others.
> >>>
> >>>
> 
>  If you have a single user cluster or are running jobs in a pipeline I
>  suppose you don't need Mesos.
> 
> 
>  On Apr 10, 2014, at 1:00 PM, Dmitriy Lyubimov 
> >> wrote:
> 
>  On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel 
>  wrote:
> 
> > What is the recommended Spark setup?
> 
>  Check out their docs. We don't have any special instructions for
> >> mahout.
> 
>  The main point behind 0.9.0 release is that it now supports master HA
> >>> thru
>  zookeeper, so for that reason alone you probably don't want to use
> >> mesos.
> 
>  You may want to use mesos to have pre-allocated workers per spark
> >> session
>  (so called "coarse grained" mode). if you shoot a lot of short-running
>  queries (1sec or less), this is a significant win in QPS and response
> >>> time.
>  (fine grained mode will add about 3 seconds to start all the workers
> >>> lazily
>  to pipeline time).
> 
>  In our case we are dealing with stuff that runs over 3 seconds for
> most
>  part, so assuming 0.9.0 HA is stable enough (which i haven't tried
> >> yet),
>  there's no reason for us to go mesos, multi-master standalone with
>  zookeeper is good enough.
> 
> 
> >
> > I imagine most of us will have HDFS configured (with either local
> >> files
>  or
> > an actual cluster).
> 
>  Hadoop DFS API  is pretty much the only persistence api supported by
> >>> Mahout
>  Spark Bindings at this point. So yes, you would want to have hdfs-only
>  cluster running 1.x or 2 doesn't matter. i use cdh 4 distros.
> 
> 
> > Since most of Mahout is recommended to be run on Hadoop 1.x we should
> >>> use
> > Mesos? https://github.com/mesos/hadoop
> >
> > This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and
> > current mahout pom). We'd use Mesos to manage hadoop and spark jobs
> >> but
> > HDFS would be controlled separately by hadoop itself.
> 
>  I think i addressed this. no we are not bound by the MR part of mahout
>  since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i
>  would forego mesos -- unless it turns out meaningfully faster or more
>  stable.
> 
> 
> 
> >
> > Is this about right? Is there a setup doc I missed?
> 
> 
>  i dont think one needed.
> >>
>


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968655#comment-13968655
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

yes, you should be able to both hack the conetxt and launch the driver 
successfully from IDEA regardless if you are running "local", "standalone/HA 
standalone" or "mesos/HA mesos" resource managers -- as long as resource 
managers are up and running on your cluster.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968649#comment-13968649
 ] 

Pat Ferrel commented on MAHOUT-1464:


OK runs fine in IDEA, now I need some pointers for how to launch on the cluster.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968649#comment-13968649
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 6:40 PM:
-

OK runs fine in IDEA, now I need some pointers for how to launch on the cluster.

Should I be able to do that from IDEA as well by changing the context? 


was (Author: pferrel):
OK runs fine in IDEA, now I need some pointers for how to launch on the cluster.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov
inline


On Mon, Apr 14, 2014 at 11:21 AM, Pat Ferrel (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968613#comment-13968613]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> @Dmitriy, no clue what email you are talking about, you have written a lot
> lately. Where is it, on a Jira?
>
no, on @dev... basically you want to run it as a standalone application
(just like SparkPI example). The easiest way to do it is just import all
mahout tree into idea and launch Sebastian's driver program directly, that
much should work -- especially since you only care about local mode in fact
(just to be clear, "local" master means same jvm, single thread, really
useful for debugging only).

>
> I did my setup and tried launching with Hadoop and Mahout running locally
> (MAHOUT_LOCAL=true),
>
this environment variable would have no bearing on spark program. The only
thing that is important is master url per above.


> One localhost instance of Spark, passing in the 'mvn package' mahout spark
> jar from the localfs and pointing at data on the localfs.  This is per
> instructions of the Spark site. There is no firewall issue since it is
> always localhost talking to localhost.
>

You need to be a bit more specific here.

Yes you can run spark as a single node cluster (just like hadoop single
node cluster), but that would be still "standalone" master, not "local".
"local" is as i indicated, is totally same jvm, single thread, it does not
require starting any additional spark processes.

As long as you want "standalone" (i.e. real thing, albeit single-node) you
need not use Client. It won't work. Launch program directly, just like they
do with examples such as SparkPi. this Client thing will not work for our
Mahout programs without additional considerations.


>
> Anyway if I could find your "running mahout on spark" email it would
> probably explain what I'm doing wrong.
>
> You did see I was using Spark 0.9.1?
>
In all likelihood this should be fine if you also change dependency and
recompile with it in root pom.xml. Otherwise there's no way of reliably
telling if different versions on client on backend may trigger
incompatibilities other than trying. (e.g. if they changed akka or netty
version between 0.9.0 and 0.9.1).



>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968624#comment-13968624
 ] 

Pat Ferrel commented on MAHOUT-1464:


ok, no spark_client launch, got it.

A pointer to the email would help.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Spark setup

2014-04-14 Thread Saikat Kanjilal
@Pat,
In regards to your question on JIRA, this is Dmitry's email about running 
mahout on spark.

Sent from my iPhone

> On Apr 11, 2014, at 7:52 PM, "Andrew Musselman"  
> wrote:
> 
> We've used Mesos at a client to run both Hadoop and Spark jobs in the same
> setup.  It's been a good experience so far.
> 
> I haven't used YARN on any projects yet but it looks like you need to
> rebuild Spark to run on it currently:
> https://spark.apache.org/docs/0.9.0/running-on-yarn.html
> 
> Why not officially support Hadoop v2 and recommend YARN for that, as well
> as supporting Mesos?
> 
> Another question is how long we will support Hadoop v1.
> 
> 
>> On Fri, Apr 11, 2014 at 1:43 PM, Ted Dunning  wrote:
>> 
>> I am pretty sure that mesos supports both map reduce and spark.
>> 
>> In general, though, the biggest design consideration in which resource
>> manager to use is to comply with local standards and traditions.
>> 
>> For playing around, stand-alone spark is fine.
>> 
>> 
>> 
>> On Thu, Apr 10, 2014 at 4:29 PM, Dmitriy Lyubimov 
>> wrote:
>> 
 On Thu, Apr 10, 2014 at 4:20 PM, Pat Ferrel 
>>> wrote:
>>> 
 Hmm, that leaves Spark and Hadoop to manage tasks independently. Not
>>> ideal
 if you are running both hadoop and spark jobs simultaneously.
>>> 
>>> I think the only resource manager that semi-officially supports both
>>> MapReduce and spark is Yarn. This sounds neat in theory, but in practice
>> i
>>> think one discovers too many hoops to jump thru. I am also inertly
>> dubious
>>> about quality and performance of Yarn compared to others.
>>> 
>>> 
 
 If you have a single user cluster or are running jobs in a pipeline I
 suppose you don't need Mesos.
 
 
 On Apr 10, 2014, at 1:00 PM, Dmitriy Lyubimov 
>> wrote:
 
 On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel 
 wrote:
 
> What is the recommended Spark setup?
 
 Check out their docs. We don't have any special instructions for
>> mahout.
 
 The main point behind 0.9.0 release is that it now supports master HA
>>> thru
 zookeeper, so for that reason alone you probably don't want to use
>> mesos.
 
 You may want to use mesos to have pre-allocated workers per spark
>> session
 (so called "coarse grained" mode). if you shoot a lot of short-running
 queries (1sec or less), this is a significant win in QPS and response
>>> time.
 (fine grained mode will add about 3 seconds to start all the workers
>>> lazily
 to pipeline time).
 
 In our case we are dealing with stuff that runs over 3 seconds for most
 part, so assuming 0.9.0 HA is stable enough (which i haven't tried
>> yet),
 there's no reason for us to go mesos, multi-master standalone with
 zookeeper is good enough.
 
 
> 
> I imagine most of us will have HDFS configured (with either local
>> files
 or
> an actual cluster).
 
 Hadoop DFS API  is pretty much the only persistence api supported by
>>> Mahout
 Spark Bindings at this point. So yes, you would want to have hdfs-only
 cluster running 1.x or 2 doesn't matter. i use cdh 4 distros.
 
 
> Since most of Mahout is recommended to be run on Hadoop 1.x we should
>>> use
> Mesos? https://github.com/mesos/hadoop
> 
> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and
> current mahout pom). We'd use Mesos to manage hadoop and spark jobs
>> but
> HDFS would be controlled separately by hadoop itself.
 
 I think i addressed this. no we are not bound by the MR part of mahout
 since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i
 would forego mesos -- unless it turns out meaningfully faster or more
 stable.
 
 
 
> 
> Is this about right? Is there a setup doc I missed?
 
 
 i dont think one needed.
>> 


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968613#comment-13968613
 ] 

Pat Ferrel commented on MAHOUT-1464:


@Dmitriy, no clue what email you are talking about, you have written a lot 
lately. Where is it, on a Jira?

I did my setup and tried launching with Hadoop and Mahout running locally 
(MAHOUT_LOCAL=true), One localhost instance of Spark, passing in the 'mvn 
package' mahout spark jar from the localfs and pointing at data on the localfs. 
 This is per instructions of the Spark site. There is no firewall issue since 
it is always localhost talking to localhost. 

Anyway if I could find your "running mahout on spark" email it would probably 
explain what I'm doing wrong.

You did see I was using Spark 0.9.1?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #814

2014-04-14 Thread Apache Jenkins Server
See 


Changes:

[akm] MAHOUT-1483: Organize links in web site navigation bar

--
[...truncated 2168 lines...]
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractLongObjectMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractFloatObjectMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractDoubleObjectMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteByteMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteCharMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteIntMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteShortMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteLongMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteFloatMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractByteDoubleMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharByteMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharCharMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharIntMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharShortMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharLongMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharFloatMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractCharDoubleMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractIntByteMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractIntCharMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractIntIntMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractIntShortMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Examples-Cluster-Reuters-II/trunk/math/target/generated-sources/mahout/org/apache/mahout/math/map/AbstractIntLongMap.java
[INFO] Writing to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Mahout-Example

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968603#comment-13968603
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

[~pferrel] if you look inside the Sebastian's patch, you will find it is 
hardcoded to use "local" Spark master. The master you specify to Client only 
tells which cluster to ship code to, not which master for the application to 
use. Which is why i think this Client thing is a little bit raw idea. Either 
way, it will not work with Sebastian's app. Instead, I'd suggest you to run 
Sebastian script directly from IDEA as a first step, after hacking master url 
in this line
{code}
implicit val sc = mahoutSparkContext(masterUrl = "local", appName = 
"MahoutLocalContext",
+  customJars = Traversable.empty[String])
{code}
or making the script to accept it from environment or app params. 

(the convention in Spark example programs is that they accept master url as the 
first parameter).

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968589#comment-13968589
 ] 

Pat Ferrel commented on MAHOUT-1464:


not sure what you are saying, local as in local filesystem and only localhost 
Spark instance? How are you launching RunCrossCooccurrenceAnalysisOnEpinions ?

I should get the local working first, I see the context setup in the code and 
will worry about that after local works.

Is there something wrong with the script?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968587#comment-13968587
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Running using Spark Client (inside the cluster) is a new thing in 0.9. Assuming 
it is stable, it is not supported at this point and going this way will have 
multiple hurdles. 

for one, mahout spark context requires MAHOUT_HOME to set all mahout binaries 
properly. The assumption is one needs Mahout's binaries only on driver's side, 
but if driver runs inside remote cluster, this will fail. So our batches should 
really be started in one of the ways i described in earlier email. 

Second, i don't think driver can load classes reliably because it includes 
Mahout dependencies such as mahout-math. That's another reason why using Client 
seems problematic to me -- it assumes one has his _entire_ application within 
that jar. So not true.

That said, your attempt doesn't exhibit any direct ClassNotFounds and looks 
more like akka communication issues i.e. spark setup issues. One thing about 
Spark is that requires direct port connectivity not only between cluster nodes 
but also back to client. In particular it means your client must not firewall 
incoming calls and must not be behind NAT. (even port forwarding doesn't really 
solve networking issues here). So my first bet would be on akka connectivity 
issues between cluster and back to client.




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968573#comment-13968573
 ] 

Pat Ferrel commented on MAHOUT-1464:


I am running it locally, if by that you mean stand alone on localhost and 
actually running RunCrossCooccurrenceAnalysisOnEpinions

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968564#comment-13968564
 ] 

Sebastian Schelter commented on MAHOUT-1464:


Currently, the RunCooccurrenceAnalysisOnMovielens1M script only sets up a local 
spark context and reads and writes from the local fs. Sorry for not mentioning 
this upfront. Do you want to try to change it yourself or should I update the 
patch?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968537#comment-13968537
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 5:18 PM:
-

OK, I have a cluster set up but first tried locally on my laptop. I installed 
the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), 
which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions 
has an incorrect comment println about usage--wrong object name. I never get 
the printlns, I assume because I'm not launching from the Spark shell??? 

  println("Usage: RunCooccurrenceAnalysisOnMovielens1M 
")

This leads me to believe that you launch from the Spark Scala shell?? Anyway I 
tried the method called out in the Spark docs for CLI execution shown below and 
execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure 
where to look for output. The code says:

RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
"/tmp/co-occurrence-on-epinions/indicators-item-item/")
RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
"/tmp/co-occurrence-on-epinions/indicators-trust-item/")

Assume this in localfs since the data came from there? I see the Spark pids 
there but no temp data.

Here's how I ran it.

Put data in localfs:
Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/
total 29320
drwxr-xr-x   4 pat  staff  136 Apr 14 09:01 .
drwxr-xr-x  10 pat  staff  340 Apr 14 09:00 ..
-rw-r--r--   1 pat  staff  8650128 Apr 14 09:01 ratings_data.txt
-rw-r--r--   1 pat  staff  6357397 Apr 14 09:01 trust_data.txt

Start up Spark on localhost, webUI says all is well.

Run the xrsj on local data via shell script attached.

The driver runs and creates a worker, which runs for quite awhile but the log 
says there was an ERROR.

Maclaurin:mahout pat$ cat 
/Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1  
spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out
Maclaurin:mahout pat$ cat 
/Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
Spark Command: 
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp 
:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar
 -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077


log4j:WARN No appenders could be found for logger 
(akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 
cores, 15.0 GB RAM
14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at 
http://192.168.0.2:8081
14/04/14 09:26:00 INFO Worker: Connecting to master 
spark://Maclaurin.local:7077...
14/04/14 09:26:00 INFO Worker: Successfully registered with master 
spark://Maclaurin.local:7077
14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-
2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from 
SCDynamicStore
14/04/14 09:26:20 INFO DriverRunner: Copying user jar 
file:/Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar to 
/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-/mahout-spark-1.0-SNAPSHOT.jar
14/04/14 09:26:20 INFO DriverRunner: Launch Command: 
"/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java" 
"-cp" 
":/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-/mahout-spark-1.0-SNAPSHOT.jar:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar:/usr/local/hadoop/conf"
 "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" 
"akka.tcp://sparkWorker@192.168.0.2:52068/user/Worker" 
"RunCrossCooccurrenceAnalysisOnEpinions" "file://Users/pat/hdfs-mirror/xrsj"
14/04/14 09:26:21 ERROR OneForOneStrategy: FAILED (of class 
scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
at akka.actor.ActorCell.r

[jira] [Updated] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1464:
---

Attachment: run-spark-xrsj.sh

script used to execute cross-similarity code on locahost Spark and local 
filesystem.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968539#comment-13968539
 ] 

Pat Ferrel commented on MAHOUT-1464:


wow, that really screwed up the shell script so I've attached it. 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968537#comment-13968537
 ] 

Pat Ferrel commented on MAHOUT-1464:


OK, I have a cluster set up but first tried locally on my laptop. I installed 
the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), 
which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions 
has an incorrect comment println about usage--wrong object name. I never get 
the printlns, I assume because I'm not launching from the Spark shell??? 

  println("Usage: RunCooccurrenceAnalysisOnMovielens1M 
")

This leads me to believe that you launch from the Spark Scala shell?? Anyway I 
tried the method called out in the Spark docs for CLI execution shown below and 
execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure 
where to look for output. The code says:

RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
"/tmp/co-occurrence-on-epinions/indicators-item-item/")
RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
"/tmp/co-occurrence-on-epinions/indicators-trust-item/")

Assume this in localfs since the data came from there? I see the Spark pids 
there but no temp data.

Here's how I ran it.

Put data in localfs:
Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/
total 29320
drwxr-xr-x   4 pat  staff  136 Apr 14 09:01 .
drwxr-xr-x  10 pat  staff  340 Apr 14 09:00 ..
-rw-r--r--   1 pat  staff  8650128 Apr 14 09:01 ratings_data.txt
-rw-r--r--   1 pat  staff  6357397 Apr 14 09:01 trust_data.txt

Start up Spark on localhost, webUI says all is well.

Run the xrsj on local data via shell script:

#!/usr/bin/env bash
#./bin/spark-class org.apache.spark.deploy.Client launch
#   [client-options] \
#  \
#   [application-options]

# cluster-url: The URL of the master node.
# application-jar-url: Path to a bundled jar including your application and all 
dependencies. Currently, the URL must be globally visible inside of # your 
cluster, for instance, an `hdfs://` path or$
# main-class: The entry point for your application.

# Client Options:
#  --memory  (amount of memory, in MB, allocated for your driver program)
#  --cores  (number of cores allocated for your driver program)
#  --supervise (whether to automatically restart your driver on application or 
node failure)
#  --verbose (prints increased logging output)

# RunCrossCooccurrenceAnalysisOnEpinions 
# Mahout Spark Jar from 'mvn package'
/Users/pat/spark-0.9.1-bin-hadoop1/bin/spark-class 
org.apache.spark.deploy.Client launch \
   spark://Maclaurin.local:7077 
file:///Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar 
RunCrossCooccurrenceAnalysisOnEpinions \
   file://Users/pat/hdfs-mirror/xrsj

The driver runs and creates a worker, which runs for quite awhile but the log 
says there was an ERROR.

Maclaurin:mahout pat$ cat 
/Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1  
spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out
Maclaurin:mahout pat$ cat 
/Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
Spark Command: 
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp 
:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar
 -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077


log4j:WARN No appenders could be found for logger 
(akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 
cores, 15.0 GB RAM
14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at 
http://192.168.0.2:8081
14/04/14 09:26:00 INFO Worker: Connecting to master 
spark://Maclaurin.local:7077...
14/04/14 09:26:00 INFO Worker: Successfully registered with master 
spark://Maclaurin.local:7077
14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-
2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from 
SCDynamicStore
14/04/14 09:26:20 INFO DriverRunner: Copying user jar 
file:/Users/pat/mahout/

[jira] [Commented] (MAHOUT-1421) Adapter package for all mahout tools

2014-04-14 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968523#comment-13968523
 ] 

jay vyas commented on MAHOUT-1421:
--

Great idea.  I will sketch out some more lightweight JIRAs once the docs 
improve.

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
> Fix For: 1.0
>
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1445) Create an intro for item based recommender

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968423#comment-13968423
 ] 

Sebastian Schelter commented on MAHOUT-1445:


[~pknarayan] I don't think we should talk about user similarity in an intro of 
item-based recommenders

[~nimartin] I really like your text. Could you add a small example that shows 
how to use an item based recommender in Mahout? You could extract something 
from this page: 
https://mahout.apache.org/users/recommender/recommender-documentation.html

> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Maciej Mazur
>  Labels: documentation, recommender
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: Documentation, Documentation, Documentation

2014-04-14 Thread Martin, Nick
Drafted a little intro to the item based rec and dropped it in the comments for 
1445. Aimed to include some examples of the variety of things one can do with 
the algo and hopefully enough info that someone hitting the page could get a 
feel for what they can potentially accomplish before diving directly into the 
'guts' of the workflow/config options, etc. 

Happy to take edits, saw there was another submission a bit ahead of mine this 
morning so not sure how that gets resolved. 

Anyways, maybe this can get us closer on cleanup!

-Original Message-
From: Sebastian Schelter [mailto:s...@apache.org] 
Sent: Sunday, April 13, 2014 7:49 AM
To: u...@mahout.apache.org; dev@mahout.apache.org
Subject: Documentation, Documentation, Documentation

Hi,

this is another reminder that we still have to finish our documentation 
improvements! The website looks shiny now and there have been lots of 
discussions about new directions but we still have some work todo in cleaning 
up webpages. We should especially make sure that the examples work.

Please help with that, anyone who is willing to sacrifice some time, go through 
a website and try out the steps described is of great help to the project. It 
would also be awesome to get some help in creating a few new pages, especially 
for the recommenders.

Here's the list of documentation related jira's for 1.0:

https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best,
Sebastian


[jira] [Commented] (MAHOUT-1445) Create an intro for item based recommender

2014-04-14 Thread Nick Martin (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968365#comment-13968365
 ] 

Nick Martin commented on MAHOUT-1445:
-

Worked on this a bit over the weekend. Feel free to use some/all/none if folks 
find it useful as an intro. I imagine the rest of the item based rec workflow 
would be described in greater detail below this intro piece, but hopefully 
something along these lines helps potential users get a feel for what's 
possible "above the fold" before diving into data models and similarity 
metrics, etc. etc. 

***Proposed text below***

Item Based Recommender
Introduction

Mahout’s item based recommender is a flexible and easily implemented algorithm 
with a diverse range of applications. The minimalism of the primary input 
file’s structure and availability of ancillary filtering controls can make 
sourcing required data and shaping a desired output both efficient and 
straightforward. 

Typical use cases include:
•   Recommend products to customers via an eCommerce platform (think: 
Amazon, Netflix, Overstock)
•   Identify organic sales opportunities
•   Segment users/customers based on similar item preferences

Broadly speaking, Mahout's item-based recommendation algorithm takes as input 
customer preferences by item and generates an output recommending similar items 
with a score indicating the likelihood a customer will "like" the recommended 
item. 

One of the strengths of the item based recommender is its adaptability to your 
business conditions or research interests. For example, there are many 
available approaches for providing product preference. One such method is to 
calculate the total orders for a given product for each customer (i.e. Acme 
Corp has ordered Widget-A 5,678 times) while others rely on user preference 
captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a 
product two thumbs’ up). 

 Additionally, a variety of methodologies can be implemented to narrow the 
focus of Mahout's recommendations, such as:
•   Exclude low volume or low profitability products from consideration
•   Group customers by segment or market rather than using user/customer 
level data
•   Exclude zero-dollar transactions, returns or other order types
•   Map product substitutions into the Mahout input (i.e. if WidgetA is a 
recommended item replace it with WidgetX)

The item based recommender output can be easily consumed by downstream 
applications (i.e. websites, ERP systems or salesforce automation tools) and is 
configurable so users can determine the number of item recommendations 
generated by the algorithm.


> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Maciej Mazur
>  Labels: documentation, recommender
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968259#comment-13968259
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

I wasn't considering this to be a fix for incremental doc mgmt, fine with 
leaving this as is now.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

The thing is it just 'loads' a Lucene index in memory as a matrix. You 
construct a matrix with the lucene index directory location and that's it. So 
it is not a fix for incremental document management issue.

The alternative approach is querying the index when a row/column vector, or 
cell is required. I, however, am not sure if the SolrMatrix thing is fast 
enough for that.

I haven't been available lately, and now I'm reading through the changes in and 
proposals for Mahout's future, and trying to set up my perspective for Mahout2. 
We probably can come up with a better way of document storage (still 
Lucene/Solr based). Let me leave this as is now, and then we can discuss the 
input formats further.

Is that OK for you?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968240#comment-13968240
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

Sorry for responding late (just waking up in my part of the world).  I still 
see value in having this. Both from lucene2seq and if we consider moving 
entirely to Lucene as document repository format (see the discussion in 
M-1252). 

Gokhan, please commit a patch if u think its ready else we can close this as 
'Won't Fix'.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1178.


Resolution: Won't Fix

Resolving as won't fix as discussed

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968221#comment-13968221
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

I personally like the idea of integrating additional storage layers as matrix 
inputs, but not like the implementation I did here.
After agreeing on the new algorithm layers, we can later move to the the 
additional input formats. 

So my vote also is for "Won't Fix"

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968185#comment-13968185
 ] 

Sebastian Schelter commented on MAHOUT-1178:


I'd personally resolve this as won't fix, as we should concentrate on the scala 
DSL in the future, any objections?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1468) Creating a new page for StreamingKMeans documentation on mahout website

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968183#comment-13968183
 ] 

Sebastian Schelter commented on MAHOUT-1468:


Ideally, the page should give an introduction to the ideas behind streaming 
k-means, show an example of how to use it and help people with choosing well 
working parameters.

> Creating a new page for StreamingKMeans documentation on mahout website
> ---
>
> Key: MAHOUT-1468
> URL: https://issues.apache.org/jira/browse/MAHOUT-1468
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Pavan Kumar N
>  Labels: Documentation
> Fix For: 1.0
>
>
> Separate page required on Streaming K Means algorithm description and 
> overview, explaining the various parameters can be used in streamingkmeans, 
> strategy for parallelization, link to this paper: 
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Closed] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-14 Thread Sebastian Schelter
No need to close stuff, we will resolve it as fixed and close it after 
the next release only.


On 04/14/2014 11:15 AM, Pavan Kumar N (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavan Kumar N closed MAHOUT-1450.
-



Cleaning up clustering documentation on mahout website
---

 Key: MAHOUT-1450
 URL: https://issues.apache.org/jira/browse/MAHOUT-1450
 Project: Mahout
  Issue Type: Documentation
  Components: Documentation
 Environment: This affects all mahout versions
Reporter: Pavan Kumar N
  Labels: documentation, newbie
 Fix For: 1.0


In canopy clustering, the strategy for parallelization seems to have some dead 
links. Need to clean them and replace with new links (if there are any). Here 
is the link:
http://mahout.apache.org/users/clustering/canopy-clustering.html
Here are some details of the dead links for kmeans clustering page:
On the k-Means clustering - basics page,
first line of the Quickstart part of the documentation, the hyperlink "Here"
http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
first sentence of Strategy for parallelization part of documentation, the hyperlink "Cluster computing 
and MapReduce", second second sentence the hyperlink "here" and last sentence the hyperlink 
"http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
Under the page: 
http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
in the second sentence of Pre-prep part of this page, the hyperlink "setup 
mahout" is dead.
http://mahout.apache.org/users/clustering/users/basics/quickstart.html
The existing documentation is too ambiguous and I recommend to make the 
following changes so the new users can use it as tutorial.
The Quickstart should be replaced with the following:
Get the data from:
wget 
http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
Place it within the example folder from mahout home director:
mahout-0.7/examples/reuters
mkdir reuters
cd reuters
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..
Mahout specific Commands
#1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
${MAHOUT_HOME}/bin/mahout
org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
reuters-text
#2 copy the file to your HDFS
bin/hadoop fs -copyFromLocal
/home/bigdata/mahout-distribution-0.7/examples/reuters-text
hdfs://localhost:54310/user/bigdata/
#3 generate sequence-file
mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
-o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
-chunk → specifying the number of data blocks
UTF-8 → specifying the appropriate input format
#4 Check the generated sequence-file
mahout-0.7$ ./bin/mahout seqdumper -i
/your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
#5 From sequence-file generate vector file
mahout seq2sparse -i
hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
-ow → overwrite
#6 take a look at it should have 7 items by using this command
bin/hadoop fs -ls
reuters-vectors/df-count
reuters-vectors/dictionary.file-0
reuters-vectors/frequency.file-0
reuters-vectors/tf-vectors
reuters-vectors/tfidf-vectors
reuters-vectors/tokenized-documents
reuters-vectors/wordcount
bin/hadoop fs -ls reuters-vectors
#7 check the vector: reuters-vectors/tf-vectors/part-r-0
mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
#8 Run canopy clustering to get optimal initial centroids for k-means
mahout canopy -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
-dm → specifying the distance measure to be used while clustering (here it is 
cosine distance measure)
#9 Run k-means clustering algorithm
mahout kmeans -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
-x 20 -k 10
-i → input
-o → output
-c → initial centroids for k-means (not defining this parameter will
trigger k-means to generate random initial centroids)
-cd → convergence delta parameter
-ow → overwrite
-x → specifying number of k-means iterations
-k → specifying number of clusters
#10 Export k-means output using Cluster Dumper tool
mahout clusterdump -dt sequence

[jira] [Closed] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-14 Thread Pavan Kumar N (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavan Kumar N closed MAHOUT-1450.
-


> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial centroids)
> -cd → convergence delta parameter
> -ow → overwrite
> -x → specifying number of k-means iterations
> -k → specifying number of clusters
> #10 Export k-means output using Cl

[jira] [Comment Edited] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-14 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968181#comment-13968181
 ] 

Pavan Kumar N edited comment on MAHOUT-1450 at 4/14/14 9:13 AM:


[~ssc] Yes, I'd love to work on 1468. Lets take this discussion to 1468, give 
me an outline of topics the page should have. I am closing 1450.


was (Author: pknarayan):
[~ssc] Yes, I'd love to work on 1468. Lets take this discussion to 1468, give 
me an outline of topics the page should have.

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://loc

[jira] [Commented] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-14 Thread Pavan Kumar N (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968181#comment-13968181
 ] 

Pavan Kumar N commented on MAHOUT-1450:
---

[~ssc] Yes, I'd love to work on 1468. Lets take this discussion to 1468, give 
me an outline of topics the page should have.

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial c

[jira] [Updated] (MAHOUT-1445) Create an intro for item based recommender

2014-04-14 Thread Pavan Kumar N (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavan Kumar N updated MAHOUT-1445:
--

   Labels: documentation recommender  (was: )
Affects Version/s: 1.0
   Status: Patch Available  (was: Open)

How does an item-based recommender work?
In item-based recommender engines, if a user like an item A then the same item 
can be recommended to other users who are similar to the user who likes item A. 

How to obtain similar users/similarity between users?
Similar users can be obtained by using profile-based information about the 
user—for example cluster the users based on their attributes, such as age, 
gender, geographic location, net worth, and so on. Alternatively, you can find 
similar users using a collaborative-based approach by analyzing the users’ 
actions and users’ historical ratings and reviews.

What is the high level logic?
For every item ‘i’ that user ‘u’ has no preference for yet & for every other 
user ‘v’ that has preference for item ‘i’,
Compute the similarity between u and 
v thereby incorporate user ‘v’ preference for item ‘i’ by a weight ’s’ into a 
running average.
Return the top items, ranked by weighted average.

> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Maciej Mazur
>  Labels: documentation, recommender
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Well I can add this, but considering the current status of the project, I think 
this is no longer in people's interest.
What do you say [~ssc], should we 'won't fix' it or commit?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Tackling the "legacy dilemma"

2014-04-14 Thread Dmitriy Lyubimov
I am ready to order a t-shirt with "Go, Andy! +100" accross it if it makes
any pragmatical sense.
On Apr 13, 2014 11:11 PM, "Sebastian Schelter"  wrote:

> On 04/14/2014 08:00 AM, Dmitriy Lyubimov wrote:
>
>> not all things unfortunately map gracefully into algebra. But hopefully
>> some of the whole can still be.
>>
>
> Yes, that's why I was asking Andy if there are enough constructs. If not,
> we might have to add more.
>
>
>> I am even a little bit worried that we may develop almost too much (is
>> there such thing) of ML before we have a chance to cyrstallize data frames
>> and perhaps dictionary discussions. these are more tools to keep
>> abstracted.
>>
>
> I think it's a very good thing to have early ML implementations on the
> DSL, because it allows us to validate whether we are on the right path. We
> should start with providing the things that are most popular in mahout,
> like the item-based recommender from MAHOUT-1464. Having a few
> implementations on the DSL also helps with designing new abstractions,
> because for every proposed feature we can look at the existing code and see
> how helpful the new feature would be.
>
>
>> I just don't want Mahout to be yet-another mllib. I shudder every time
>> somebody says "we want to create a Spark version of (an|the) algorithm".
>>  I
>> know it will be creating wrong talking points for somebody anxious to draw
>> parallels.
>>
>
> Totally agree here. Looks history repeats itself from "I want to create a
> Hadoop implementation" to "I want to create a Spark implementation" :)
>
>
>>
>> On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter 
>> wrote:
>>
>>  Andy, that would be awesome. Have you had a look at our new scala DSL
>>> [1]?
>>> Does it offer enough constructs for you to rewrite your implementation
>>> with
>>> it?
>>>
>>> --sebastian
>>>
>>>
>>> [1] https://mahout.apache.org/users/sparkbindings/home.html
>>>
>>>
>>> On 04/14/2014 07:47 AM, Andy Twigg wrote:
>>>
>>> +1 to removing present Random Forests. Andy Twigg had provided a

> Spark
> based Streaming Random Forests impl sometime last year. Its time to
> restart
> that conversation and integrate that into the codebase if the
> contributor
> is still willing i.e.
>
>
 I'm happy to contribute this, but as it stands it's written against
 spark, even forgetting the 'streaming' aspect. Do you have any advice
 on how to proceed?



>>>
>>
>