RE: Welcome Pat Ferrel as new committer on Mahout

2014-04-24 Thread Andrew Palumbo
Congratulations Pat!

> Subject: Re: Welcome Pat Ferrel as new committer on Mahout
> From: andrew.mussel...@gmail.com
> Date: Thu, 24 Apr 2014 06:44:43 -0700
> CC: user@mahout.apache.org
> To: d...@mahout.apache.org
> 
> Great news, welcome Pat!
> 
> > On Apr 24, 2014, at 3:19 AM, Sebastian Schelter  wrote:
> > 
> > Hi,
> > 
> > this is to announce that the Project Management Committee (PMC) for Apache 
> > Mahout has asked Pat Ferrel to become committer and we are pleased to 
> > announce that he has accepted.
> > 
> > Being a committer enables easier contribution to the project since in 
> > addition to posting patches on JIRA it also gives write access to the code 
> > repository. That also means that now we have yet another person who can 
> > commit patches submitted by others to our repo *wink*
> > 
> > Pat, we look forward to working with you in the future. Welcome! It would 
> > be great if you could introduce yourself with a few words.
> > 
> > -s
  

RE: Mahout Naive Bayes CSV Classification

2014-05-04 Thread Andrew Palumbo
Hi Jossef,

I can answer your first two questions for you:
 
> 1) Are these predicted values normal?

Yes, negative scores are normal.  

> 2) For now, i'm assuming that the max value 'wins'. is that correct?

That is correct, NaiveBayes uses a winner takes all approach to to class 
assignment based on the max score across all classes.  ie. :

> {0:-2119.616101368751,1:-2536.217343666528}

will be classified as 0. 

> 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in MahoutTest.java)
> it returns 40 instead of 41 features. Why is that?

This seems odd.  Is it possible that something is getting dropped in your 
vectorization process?  

Could you give a little more information on how you're using this.  Could you 
please clarify what you're referring to re:  (line 96 in MahoutTest.java)

Thanks,

Andy   

> From: josse...@gmail.com
> Date: Sun, 4 May 2014 23:16:48 +0300
> Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> To: user@mahout.apache.org; s...@apache.org
> 
> Hey Sebastian,
> 
> Thanks for your reply.
> 
> a link to a github gist with my java code and a small sample from the CSV
> i'm using can be found here:
> https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> 
> 
> 
> I wrote code to convert the csv data (41 features + class name) to a
> RandomAccessSparseVector and appending it into a sequence file
> 
> I successfully managed to create a model from the sequence file and to
> run the NaiveBayes classifier with data.
> 
> 
> My problem is that i get negative results when i call '
> classifier.classifyFull'
> 
> e.g. :
> 
> 
> {0:-2119.616101368751,1:-2536.217343666528}
> {0:-3210.7575139461096,1:-4569.913127240827}
> {0:-2986.049040829474,1:-3473.9551320126384}
> {0:-2411.582039236549,1:-3487.8547154600456}
> {0:-25620.824856365696,1:-31625.63011412386}
> {0:-4601.922062356241,1:-5019.98413435188}
> {0:-4331.835315861215,1:-4718.881475757016}
> {0:-3568.9589306062785,1:-4132.310969149298}
> ...
> ...
> 
> 
> 
> 
> 1) Are these predicted values normal?
> 2) For now, i'm assuming that the max value 'wins'. is that correct?
> 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in MahoutTest.java)
> it returns 40 instead of 41 features. Why is that?
> 
> 
> Thanks :)
> 
> 
> 
> 
> 
> On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter  wrote:
> 
> > Hi Jossef,
> >
> > You have to vectorize and normalize your data. The input for naive bayes
> > is a sequencefile containing a Text object as key (your label) and a
> > VectorWritable that holds a vector with the data.
> >
> > Instructions to run NaiveBayes can be found here:
> >
> > https://mahout.apache.org/users/classification/bayesian.html
> >
> > --sebastian
> >
> >
> >
> > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> >
> >> I have these 2 CSV files:
> >>
> >> 1. train-set.csv
> >> 2. test-set.csv
> >>
> >>
> >> Both of them are in the same structure (with different content) and
> >> similar
> >> to this example (http://i.stack.imgur.com/jsckr.png) :
> >>
> >> [image: enter image description here]
> >>
> >> Each column is a feature and the last column - class, is the name of the
> >> class to predict.
> >>
> >> .
> >>
> >> *Can anyone please provide a sample code for:*
> >>
> >> 1. Initializing Naive Bayes with a CSV file (model creation, training,
> >> required pre-processing, etc...)
> >> 2. For a given CSV row - predicting a class
> >>
> >>
> >> Thanks!
> >>
> >> .
> >>
> >> .
> >>
> >> BTW -
> >>
> >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
> >> links:
> >>
> >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> >>
> >> .
> >> ​
> >>
> >>
> >
>  
> 
> -- 
> Sincerely,

> 
> Jossef Harush.
> jossef.com 
  

RE: Mahout Naive Bayes CSV Classification

2014-05-05 Thread Andrew Palumbo
Jossef,
Does your training set have any features with a zero value for all instances?

> Date: Mon, 5 May 2014 08:33:37 +0300
> Subject: RE: Mahout Naive Bayes CSV Classification
> From: josse...@gmail.com
> To: user@mahout.apache.org
> 
> a link to a github gist with my java code and a small sample from the CSV
> i'm using can be found here:
> https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> On May 5, 2014 5:53 AM, "Andrew Palumbo"  wrote:
> 
> > Hi Jossef,
> >
> > I can answer your first two questions for you:
> >
> > > 1) Are these predicted values normal?
> >
> > Yes, negative scores are normal.
> >
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> >
> > That is correct, NaiveBayes uses a winner takes all approach to to class
> > assignment based on the max score across all classes.  ie. :
> >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> >
> > will be classified as 0.
> >
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> >
> > This seems odd.  Is it possible that something is getting dropped in your
> > vectorization process?
> >
> > Could you give a little more information on how you're using this.  Could
> > you please clarify what you're referring to re:  (line 96 in
> > MahoutTest.java)
> >
> > Thanks,
> >
> > Andy
> >
> > > From: josse...@gmail.com
> > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > To: user@mahout.apache.org; s...@apache.org
> > >
> > > Hey Sebastian,
> > >
> > > Thanks for your reply.
> > >
> > > a link to a github gist with my java code and a small sample from the CSV
> > > i'm using can be found here:
> > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > >
> > >
> > >
> > > I wrote code to convert the csv data (41 features + class name) to a
> > > RandomAccessSparseVector and appending it into a sequence file
> > >
> > > I successfully managed to create a model from the sequence file and to
> > > run the NaiveBayes classifier with data.
> > >
> > >
> > > My problem is that i get negative results when i call '
> > > classifier.classifyFull'
> > >
> > > e.g. :
> > >
> > >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > {0:-25620.824856365696,1:-31625.63011412386}
> > > {0:-4601.922062356241,1:-5019.98413435188}
> > > {0:-4331.835315861215,1:-4718.881475757016}
> > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > ...
> > > ...
> > >
> > >
> > >
> > >
> > > 1) Are these predicted values normal?
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> > >
> > >
> > > Thanks :)
> > >
> > >
> > >
> > >
> > >
> > > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter 
> > wrote:
> > >
> > > > Hi Jossef,
> > > >
> > > > You have to vectorize and normalize your data. The input for naive
> > bayes
> > > > is a sequencefile containing a Text object as key (your label) and a
> > > > VectorWritable that holds a vector with the data.
> > > >
> > > > Instructions to run NaiveBayes can be found here:
> > > >
> > > > https://mahout.apache.org/users/classification/bayesian.html
> > > >
> > > > --sebastian
> > > >
> > > >
> > > >
> > > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > > >
> > > >> I have these 2 CSV files:
> > > >>
> > > >> 1. train-set.csv
> > > >> 2. test-set.csv
> > > >>
> > > >>
> > > >> Both of them are in the same structure (with different content) and
> > > >> similar
> > > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > > >>
> > > >> [image: enter image description here]
> > > >>
> > > >> Each column is a feature and the last column - class, is the name of
> > the
> > > >> class to predict.
> > > >>
> > > >> .
> > > >>
> > > >> *Can anyone please provide a sample code for:*
> > > >>
> > > >> 1. Initializing Naive Bayes with a CSV file (model creation,
> > training,
> > > >> required pre-processing, etc...)
> > > >> 2. For a given CSV row - predicting a class
> > > >>
> > > >>
> > > >> Thanks!
> > > >>
> > > >> .
> > > >>
> > > >> .
> > > >>
> > > >> BTW -
> > > >>
> > > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow
> > these
> > > >> links:
> > > >>
> > > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > > >>
> > > >> .
> > > >> ​
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Sincerely,
> >
> > >
> > > Jossef Harush.
> > > jossef.com <http://www.jossef.com>
> >
  

RE: Mahout Naive Bayes CSV Classification

2014-05-06 Thread Andrew Palumbo
This would lead to that term not being counted by 
NaiveBayesModel.numFeatures().  NaiveBayesModel.numFeatures() returns the 
number of features (terms counts if this were a text classification problem) 
with a non-zero count across the entire input set.



> From: josse...@gmail.com
> Date: Tue, 6 May 2014 21:04:18 +0300
> Subject: Re: Mahout Naive Bayes CSV Classification
> To: user@mahout.apache.org
> 
> Yes
> 
> 
> On Mon, May 5, 2014 at 10:51 PM, Andrew Palumbo  wrote:
> 
> > Jossef,
> > Does your training set have any features with a zero value for all
> > instances?
> >
> > > Date: Mon, 5 May 2014 08:33:37 +0300
> > > Subject: RE: Mahout Naive Bayes CSV Classification
> > > From: josse...@gmail.com
> > > To: user@mahout.apache.org
> > >
> > > a link to a github gist with my java code and a small sample from the CSV
> > > i'm using can be found here:
> > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > > On May 5, 2014 5:53 AM, "Andrew Palumbo"  wrote:
> > >
> > > > Hi Jossef,
> > > >
> > > > I can answer your first two questions for you:
> > > >
> > > > > 1) Are these predicted values normal?
> > > >
> > > > Yes, negative scores are normal.
> > > >
> > > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > >
> > > > That is correct, NaiveBayes uses a winner takes all approach to to
> > class
> > > > assignment based on the max score across all classes.  ie. :
> > > >
> > > > > {0:-2119.616101368751,1:-2536.217343666528}
> > > >
> > > > will be classified as 0.
> > > >
> > > > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > > > MahoutTest.java)
> > > > > it returns 40 instead of 41 features. Why is that?
> > > >
> > > > This seems odd.  Is it possible that something is getting dropped in
> > your
> > > > vectorization process?
> > > >
> > > > Could you give a little more information on how you're using this.
> >  Could
> > > > you please clarify what you're referring to re:  (line 96 in
> > > > MahoutTest.java)
> > > >
> > > > Thanks,
> > > >
> > > > Andy
> > > >
> > > > > From: josse...@gmail.com
> > > > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > > > To: user@mahout.apache.org; s...@apache.org
> > > > >
> > > > > Hey Sebastian,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > a link to a github gist with my java code and a small sample from
> > the CSV
> > > > > i'm using can be found here:
> > > > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > > > >
> > > > >
> > > > >
> > > > > I wrote code to convert the csv data (41 features + class name) to a
> > > > > RandomAccessSparseVector and appending it into a sequence file
> > > > >
> > > > > I successfully managed to create a model from the sequence file and
> > to
> > > > > run the NaiveBayes classifier with data.
> > > > >
> > > > >
> > > > > My problem is that i get negative results when i call '
> > > > > classifier.classifyFull'
> > > > >
> > > > > e.g. :
> > > > >
> > > > >
> > > > > {0:-2119.616101368751,1:-2536.217343666528}
> > > > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > > > {0:-25620.824856365696,1:-31625.63011412386}
> > > > > {0:-4601.922062356241,1:-5019.98413435188}
> > > > > {0:-4331.835315861215,1:-4718.881475757016}
> > > > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > > > ...
> > > > > ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 1) Are these predicted values normal?
> > > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > >

RE: Using existing model to train again

2014-05-26 Thread Andrew Palumbo
Hi Subbu,  

There is currently no way to update an already trained Naive Bayes Model.  
You'd have to retrain on the full 2 million records.  

You could probably hack TrainNaiveBayesJob.java [1] to meet your needs if you 
anticipated this as something that you'd need to do in the future, but your new 
data will have to be vectorized in the exact same manner as the original data 
to update the model correctly- this would limit you to pure term frequencies 
(no IDF transformation) and would not allow for anything like maxDFPercent, etc.

Andy

[1]https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java


> Hi team,
> I have trained a model in naive Bayes using training data of 1 million
> records. Now I have another 1 million records . Can I add this new training
> data to the existing model and train it again to get a new model instead of
> passing all the 2 million records at once to get a model.
>
> Thanks,
> Subbu
>
  

RE: Using existing model to train again

2014-05-26 Thread Andrew Palumbo
Hi Namit,

The current Naive Bayes implementation is based on MapReduce and therefore 
dependant on Hadoop.  You could run mahout trainnb and mahout  testnb scripts 
locally by setting the environment variable MAHOUT_LOCAL=true.  

This will keep everything on your local filesystem and prevent Mahout from 
attempting to run in cluster mode.  But Hadoop is required. 

Andy


> Date: Mon, 26 May 2014 10:22:18 +0530
> Subject: Re: Using existing model to train again
> From: namitmaheshwa...@gmail.com
> To: user@mahout.apache.org
> 
> Hi Subbu,
> 
> I was too working with Naive Bayes. I wanted to know whether it is possible
> to run *Naive Bayes without Hadoop* in Mahout or is it necessary to use
> Hadoop.
> 
> Thanks
> Namit
> 
> 
> On Mon, May 26, 2014 at 10:18 AM, Kasi Subrahmanyam
> wrote:
> 
> > Hi team,
> > I have trained a model in naive Bayes using training data of 1 million
> > records. Now I have another 1 million records . Can I add this new training
> > data to the existing model and train it again to get a new model instead of
> > passing all the 2 million records at once to get a model.
> >
> > Thanks,
> > Subbu
> >
  

RE: Naive Bayes Classifier Bug ?

2014-06-21 Thread Andrew Palumbo
Hi Toyoharu, 

Mahout Naive Bayes uses Laplace smoothing (the alpha_I parameter with 
default=1) to deal with terms unseen by the training set. See Rennie et al. 
sec. 2.3 [1].  

Your modification will certainly work, and may in fact give better results for 
the problem that your working on. 

You could also look at optimizing the Laplacian [2].

[1] http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
[2] http://www.stat.yale.edu/~lc436/papers/temp/Zhang_Oles_2001.pdf


Andy

> Date: Sun, 22 Jun 2014 00:41:51 +0900
> Subject: Naive Bayes Classifier Bug ?
> From: toyoharu.ogih...@gmail.com
> To: user@mahout.apache.org
> 
> Hi Mahout,
> 
> In Naive Bayes, I think that a term does not exist in a training data
> should not affect a score.
> What do you think?
> 
>   org.apache.mahout.classifier.
> naivebayes.AbstractNaiveBayesClassifier
> 
>  Before:
>   protected double getScoreForLabelInstance(int label, Vector instance) {
> double result = 0.0;
> for (Element e : instance.nonZeroes()) {
>   result += e.get() * getScoreForLabelFeature(label, e.index());
> }
> return result;
>   }
> 
>  After:
>   protected double getScoreForLabelInstance(int label, Vector instance) {
> double result = 0.0;
> for (Element e : instance.nonZeroes()) {
>   int index = e.index();
>   double featureWeight = model.featureWeight(index);
>   if( featureLabelWeight != 0 ) {
> result += e.get() * getScoreForLabelFeature(label, index);
>   }
> }
> return result;
>   }
> 
> Thanks,
> Toyoharu
  

RE: any pointer to run wikipedia bayes example

2014-08-21 Thread Andrew Palumbo
Hello,

Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
example.  There is currently no documentation on the Mahout site for this.

You can run this script to build and test an NB classifier for option (1) 10 
arbitrary countries or option (2) 2 countries (United States and United Kingdom)

By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
run on the full set you'll have to change the download by commenting out line 
78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
changing datasets- option (3).*


The step by step process for  Creating a Naive Bayes Classifier for the 
wikipedia XML dump is very similar to creating the the 20 Newsgroups 
Classifier.  The only difference being that instead of running $mahout 
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on 
the unzipped wikipedia xml dump.

$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file 
of categories [2] and starts an MR job to parse the each document in the XML 
file.  This process will seek to extract documents with category which 
(exactly, if the exactMatchOnly option is set) matches a line in the category 
file.  If no match is found and the -all option is set, the document will be 
dumped into an "unknown" category.
The documents will then be written out as a  sequence file of the 
form (K: /category/document_title , V: document) .

There are 3 different example category files available to in the 
/examples/src/test/resources directory:  country.txt, country10.txt and 
country2.txt.

The CLI options for seqwiki are as follows:

-input   (-i) input pathname String
-output (-o)   the output pathname String
-categories  (-c)the file containing the Wikipedia categories
-exactMatchOnly (-e)if set, then the Wikipedia category must match 
exactly instead of simply containing the category string
-all  (-all)if set select all categories 

>From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
>example script.

Especially for the Binary classification problem you should have better results 
using 3 or 4-grams and a low maxDF cuttoff like 30.

[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh
[2] 
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt


Subject: Re: any pointer to run wikipedia bayes example
To: user@mahout.apache.org
From: w...@us.ibm.com
Date: Wed, 20 Aug 2014 09:50:42 -0400


hi, 



After did a bit more searching, I found 
https://issues.apache.org/jira/browse/MAHOUT-1527

The version of Mahout that I have been working on is Mahout 0.9 (from 
http://mahout.apache.org/general/downloads.html), which I downloaded in April.

Albeit the latest stable release, it doesn't include the patch mentioned in 
https://issues.apache.org/jira/browse/MAHOUT-1527



Then I realized had I cloned the latest mahout, I would get a script that 
classify-wiki.sh, and probably can start from there.  



 Sorry for the spam! 



Thanks,

Wei



Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian 
network 20news group example provided



From:   Wei Zhang/Watson/IBM@IBMUS

To: user@mahout.apache.org

Date:   08/19/2014 06:18 PM

Subject:any pointer to run wikipedia bayes example












Hi,



I have been able to run the bayesian network 20news group example provided

at Mahout website.



I am interested in running the Wikipedia bayes example, as it is a much

larger dataset.

>From several googling attempts,  I figured it is a bit different workflow

than running the 20news group example -- e.g., I would need to provide a

categories.txt file, and invoke WikipediaXmlSplitter,  call

wikipediaDataSetCreator and etc.



I am wondering is there a document somewhere that describes the process of

running Wikipedia bayes example ?

https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no

longer work.



Greatly appreciated!



Wei
  

RE: any pointer to run wikipedia bayes example

2014-08-22 Thread Andrew Palumbo
[1]https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java


Thank you very much !



Wei  





(2) Is it legit to use some other catogeries other than 

Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
current trunk, you can use the classify-wiki.sh example.  There i



From:   Andrew Palumbo 

To: "user@mahout.apache.org" 

Date:   08/21/2014 02:28 PM

Subject:RE: any pointer to run wikipedia bayes example








Hello,



Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
example.  There is currently no documentation on the Mahout site for this.



You can run this script to build and test an NB classifier for option (1) 10 
arbitrary countries or option (2) 2 countries (United States and United Kingdom)



By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
run on the full set you'll have to change the download by commenting out line 
78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
changing datasets- option (3).*





The step by step process for  Creating a Naive Bayes Classifier for the 
wikipedia XML dump is very similar to creating the the 20 Newsgroups 
Classifier.  The only difference being that instead of running $mahout 
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on 
the unzipped wikipedia xml dump.



$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file 
of categories [2] and starts an MR job to parse the each document in the XML 
file.  This process will seek to extract documents with category which 
(exactly, if the exactMatchOnly option is set) matches a line in the category 
file.  If no match is found and the -all option is set, the document will be 
dumped into an "unknown" category.

The documents will then be written out as a  sequence file of the 
form (K: /category/document_title , V: document) .



There are 3 different example category files available to in the 
/examples/src/test/resources directory:  country.txt, country10.txt and 
country2.txt.



The CLI options for seqwiki are as follows:



-input   (-i) input pathname String

-output (-o)   the output pathname String

-categories  (-c)the file containing the Wikipedia categories

-exactMatchOnly (-e)if set, then the Wikipedia category must match 
exactly instead of simply containing the category string

-all  (-all)if set select all categories 



>From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
>example script.



Especially for the Binary classification problem you should have better results 
using 3 or 4-grams and a low maxDF cuttoff like 30.



[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh

[2] 
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt





Subject: Re: any pointer to run wikipedia bayes example

To: user@mahout.apache.org

From: w...@us.ibm.com

Date: Wed, 20 Aug 2014 09:50:42 -0400





hi, 







After did a bit more searching, I found 
https://issues.apache.org/jira/browse/MAHOUT-1527



The version of Mahout that I have been working on is Mahout 0.9 (from 
http://mahout.apache.org/general/downloads.html), which I downloaded in April.



Albeit the latest stable release, it doesn't include the patch mentioned in 
https://issues.apache.org/jira/browse/MAHOUT-1527







Then I realized had I cloned the latest mahout, I would get a script that 
classify-wiki.sh, and probably can start from there.  







 Sorry for the spam! 







Thanks,



Wei







Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian 
network 20news group example provided







From:Wei Zhang/Watson/IBM@IBMUS



To:  user@mahout.apache.org



Date:08/19/2014 06:18 PM



Subject: any pointer to run wikipedia bayes example

























Hi,







I have been able to run the bayesian network 20news group example provided



at Mahout website.







I am interested in running the Wikipedia bayes example, as it is a much



larger dataset.



>From several googling attempts,  I figured it is a bit different workflow



than running the 20news group example -- e.g., I would need to provide a



categories.txt file, and invoke WikipediaXmlSplitter,  call



wikipediaDataSetCreator and etc.







I am wondering is there a document somewhere that describes the process of



running Wikipedia bayes example ?



https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no



longer work.







Greatly appreciated!







Wei


   

  

RE: any pointer to run wikipedia bayes example

2014-08-27 Thread Andrew Palumbo


Subject: RE: any pointer to run wikipedia bayes example
To: user@mahout.apache.org
From: w...@us.ibm.com
Date: Tue, 26 Aug 2014 18:12:52 -0400


Hello Andrew,



I have given a try to NB on the medium size  Wikipedia (~1GB data after 
decompression, roughly 1/50 of the full Wikipedia size) with two categories 
(US/UK) I examined the tf-idf vectors generated. 



I have two questions:

(1) It seems  there are (only) 11683 data points (i.e., documents) generated, 
albeit each data point has relatively high dimension. 10K data points seem not 
very exciting, even I multiply it by 50 ( to the full extend of Wikipedia 
dataset), it seems the data points are not particularly many.




I suspect that many of the documents are not categorized as either US or UK, 
thus not included in the training set. On a 20 node (8 cores each)( cluster 
(albeit a quite old one, 5 years old), it took 45 minutes to label/vectorize  
the dataset, but only 3 minutes to train the NB.






If you used option (2) from the classify-wiki.sh script,  seq2sparse will be 
vectorizing the data using 4-grams which take much longer and give you a much 
larger feature set.  option (1) uses bigrams.   






I am wondering is there a way to get a larger dataset that can stress the NB 
training (instead of the label/vectorization part) either by providing a more 
inclusive category file or choosing another dataset ? 








You could run on the the full country set:


https://github.com/apache/mahout/blob/master/examples/src/test/resources/country.txt


By editing line 101 or 107 to read:


cp $MAHOUT_HOME/examples/src/test/resources/country.txt 
${WORK_DIR}/country.txt


However on the medium data set, this only yields ~38200 documents so it still 
probably will not be not be the size that you are looking for. Alternatively, 
you could create your own category.txt file to use and pass it to the -c 
argument.

As well you could try turning the -all option which as we discussed before will 
likely skew the categories into an "unknown" category, but will not reject any 
documents






With a more inclusive category file, I can potentially get a larger dataset, 
but I don't know how to handle the case where a document has two labels in that 
category file. 






Currently, the WikipediaMapper is labeling the document as the first matching 
category that it finds, but you can customize this however you'd like.  






(2) I am wondering if I use Wikipedia dataset as the input to the K-means 
clustering, (thus no need to label the data), then I can get a relatively large 
dataset, and both K-means and NB use the SequenceFileFormat.






I believe this this should work- you could remove the labeling section - 
basically lines 79-85  from WikipediaMapper.java 



https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java


and write out something like (K=document_title,V=document) to the sequence 
file. 



and then run this sequence file through seq2sparse and kmeans as is done in the 
cluster-reuters.sh example(starting at line 109):


https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh





It seems that I would just need to bypass the label data part and go directly 
to the vectorization, I am not sure if it is feasible ?




Thanks a lot !



Wei



Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
current trunk, you can use the classify-wiki.sh example.  There i



From:   Andrew Palumbo 

To: "user@mahout.apache.org" 

Date:   08/21/2014 02:28 PM

Subject:RE: any pointer to run wikipedia bayes example








Hello,



Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
example.  There is currently no documentation on the Mahout site for this.



You can run this script to build and test an NB classifier for option (1) 10 
arbitrary countries or option (2) 2 countries (United States and United Kingdom)



By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
run on the full set you'll have to change the download by commenting out line 
78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
changing datasets- option (3).*





The step by step process for  Creating a Naive Bayes Classifier for the 
wikipedia XML dump is very similar to creating the the 20 Newsgroups 
Classifier.  The only difference being that instead of running $mahout 
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on 
the unzipped wikipedia xml dump.



$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file 
of categories [2] and starts an MR job to parse the each document in the XML 
file.  This process will seek to extract documents with category which 
(exactly, if the exactMatchOnly option is set) matches a line in the category 
file.  If no match 

RE: any pointer to run wikipedia bayes example

2014-08-27 Thread Andrew Palumbo

> 
> (2) I am wondering if I use Wikipedia dataset as the input to the K-means 
> clustering, (thus no need to label the data), then I can get a relatively 
> large dataset, and both K-means and NB use the SequenceFileFormat.
> 
> 

Thinking of this again, you could run seqwiki with the -all option set, and 
pass the output of that to seq2sparse.

> and then run this sequence file through seq2sparse and kmeans as is done in 
> the cluster-reuters.sh example(starting at line 109):
> 
> 
> https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh
> 
> 
> 
> 
> 
> It seems that I would just need to bypass the label data part and go directly 
> to the vectorization, I am not sure if it is feasible ?
> 
> 
> 
> 
> Thanks a lot !
> 
> 
> 
> Wei
> 
> 
> 
> Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
> current trunk, you can use the classify-wiki.sh example.  There i
> 
> 
> 
> From: Andrew Palumbo 
> 
> To:   "user@mahout.apache.org" 
> 
> Date: 08/21/2014 02:28 PM
> 
> Subject:  RE: any pointer to run wikipedia bayes example
> 
> 
> 
> 
> 
> 
> 
> 
> Hello,
> 
> 
> 
> Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
> example.  There is currently no documentation on the Mahout site for this.
> 
> 
> 
> You can run this script to build and test an NB classifier for option (1) 10 
> arbitrary countries or option (2) 2 countries (United States and United 
> Kingdom)
> 
> 
> 
> By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
> run on the full set you'll have to change the download by commenting out line 
> 78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
> changing datasets- option (3).*
> 
> 
> 
> 
> 
> The step by step process for  Creating a Naive Bayes Classifier for the 
> wikipedia XML dump is very similar to creating the the 20 Newsgroups 
> Classifier.  The only difference being that instead of running $mahout 
> seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki 
> on the unzipped wikipedia xml dump.
> 
> 
> 
> $ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text 
> file of categories [2] and starts an MR job to parse the each document in the 
> XML file.  This process will seek to extract documents with category which 
> (exactly, if the exactMatchOnly option is set) matches a line in the category 
> file.  If no match is found and the -all option is set, the document will be 
> dumped into an "unknown" category.
> 
> The documents will then be written out as a  sequence file of the 
> form (K: /category/document_title , V: document) .
> 
> 
> 
> There are 3 different example category files available to in the 
> /examples/src/test/resources directory:  country.txt, country10.txt and 
> country2.txt.
> 
> 
> 
> The CLI options for seqwiki are as follows:
> 
> 
> 
> -input   (-i) input pathname String
> 
> -output (-o)   the output pathname String
> 
> -categories  (-c)the file containing the Wikipedia categories
> 
> -exactMatchOnly (-e)if set, then the Wikipedia category must match 
> exactly instead of simply containing the category string
> 
> -all  (-all)if set select all categories 
> 
> 
> 
> From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
> example script.
> 
> 
> 
> Especially for the Binary classification problem you should have better 
> results using 3 or 4-grams and a low maxDF cuttoff like 30.
> 
> 
> 
> [1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh
> 
> [2] 
> https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt
> 
> 
> 
> 
> 
> Subject: Re: any pointer to run wikipedia bayes example
> 
> To: user@mahout.apache.org
> 
> From: w...@us.ibm.com
> 
> Date: Wed, 20 Aug 2014 09:50:42 -0400
> 
> 
> 
> 
> 
> hi, 
> 
> 
> 
> 
> 
> 
> 
> After did a bit more searching, I found 
> https://issues.apache.org/jira/browse/MAHOUT-1527
> 
> 
> 
> The version of Mahout that I have been working on is Mahout 0.9 (from 
> http://mahout.apache.org/general/downloads.html), which I downloaded in April.
> 
> 
> 
> Albeit the latest stable release, it doesn't include the patch mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-1527
> 
> 
> 
> 
> 
> 
> 
> Then I 

RE: any pointer to run wikipedia bayes example

2014-09-05 Thread Andrew Palumbo
Hi Wei,

Thanks for posting your findings!

Andy

Subject: RE: any pointer to run wikipedia bayes example
To: user@mahout.apache.org
From: w...@us.ibm.com
Date: Thu, 4 Sep 2014 14:10:00 -0400


Hi Andrew,



Finally I figured out it probably doesn't have anything to do with HDFS, it 
failed because of  filling up the local disk (during the phase between map and 
reduce). 



It seems the Collocation Driver are generating too much output even I am just 
using 2-gram (on the full Wiki dataset). 60GB per local node (22 nodes in 
total) is not enough to hold the temp data.

So i am using unigram instead, I hopein this way, it is probably more aligned 
with K-means vectorization as well. Then the vectorization worked (takes 
roughly 5 hours on a 22 node cluster).



Thanks again!



Wei 



Wei Zhang---08/29/2014 04:19:30 PM---Thanks a lot Andrew for the pointers! I 
also tried with a category file with 25 subjects (Art Cultur



From:   Wei Zhang/Watson/IBM@IBMUS

To: user@mahout.apache.org

Date:   08/29/2014 04:19 PM

Subject:RE: any pointer to run wikipedia bayes example








Thanks a lot Andrew for the pointers! 

 

I also tried with a category file with 25 subjects (Art Culture Economics 
Education Event Health History Industry Sports Geography ...) On the 1GB medium 
dataset, it roughly generated 50K data points, with ~65% accuracy. If I factor 
that with 40 (i.e., the full size data set), that gives me a 2 million data 
points with relatively-high dimension, that should be fine for me. 



In the past couple of days, I was trying the NB example of full wiki dataset 
(i.e., 11GB zip file, ~44G unzipped file). 



The cluster that we own (a bit old) has 1.5TB space (replication factor of 3, 
so effectively 0.5 TB free space). The cluster has 22 nodes, each node has 
30GB-50GB tmp directory. 

But NB (on full size of Wikipedia dump) repeatedly failed at 
org.apache.mahout.vectorizer.collocations.llr.CollocReducer , due to the 
complaint "No space left". 



Partial exception stack looks like this:

Error: java.io.IOException: No space left on device at 
java.io.FileOutputStream.writeBytes(Native Method) at 
java.io.FileOutputStream.write(FileOutputStream.java:356) at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198)
 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:93) at 
java.io.BufferedOutputStream.write(BufferedOutputStream.java:137) at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
 at java.io.DataOutputStream.write(DataOutputStream.java:118) at 
org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84) at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
 at java.io.DataOutputStream.write(DataOutputStream.java:118) at 
org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:218) at 
org.apache.hadoop.mapred.Merger.writeFile(Merger.java:157) at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2659)
 



some other exception stack from other JVM looks like this:

Creation of 
/tmp/hadoop-xxx/mapred/local/userlogs/job_201408261641_0027/attempt_201408261641_0027_r_00_1.cleanup
 failed. at 
org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at 
org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
 at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at 
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) 



It could be entirely possible that it is time for us to move to a larger 
cluster. I am just curious how much disk space should we expect to use for NB 
on full wiki dataset ?



Thanks!



Wei





Andrew Palumbo ---08/27/2014 05:17:56 PM---Subject: RE: any pointer to run 
wikipedia bayes example To: user@mahout.apache.org



From: Andrew Palumbo 

To: "user@mahout.apache.org" 

Date: 08/27/2014 05:17 PM

Subject: RE: any pointer to run wikipedia bayes example











Subject: RE: any pointer to run wikipedia bayes example

To: user@mahout.apache.org

From: w...@us.ibm.com

Date: Tue, 26 Aug 2014 18:12:52 -0400





Hello Andrew,







I have given a try to NB on the medium size  Wikipedia (~1GB data after 
decompression, roughly 1/50 of the full Wikipedia size) with two categories 
(US/UK) I examined the tf-idf vectors generated. 







I have two questions:



(1) It seems  there are (only) 11683 data points (i.e., documents) generated, 
albeit each data point has relatively high dimension. 10K data points seem not 
very exciting, even I multiply it by 50 ( to the full extend of Wikipedia 
dataset), it seems the data points are not particularly many.









I suspect that many of the documents are not categorized as either US or UK, 
thus not included in the training set. On a 20 node (8 cores each)( cluster 
(albeit a quite old one, 5 years old), it

RE: Any idea why h20 module compilation is crashing?

2014-09-22 Thread Andrew Palumbo
I just built with Java 1.6 and everything was successful. I tested with 1.7 
before committing it and that was successful as well.  



> Date: Mon, 22 Sep 2014 09:05:46 -0700
> Subject: Any idea why h20 module compilation is crashing?
> From: dlie...@gmail.com
> To: user@mahout.apache.org
> 
> Hello,
> 
> i checked out for the first time the h20 commit, doesn't compile per below .
> 
> scala 2.10.4,
> java 1.7
> 
> java -version
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> 
> scala -version
> Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> dmitriy@Intel-KUBU:~/projects/github/mahout-commits/h2o$
> 
> 
> compilation failure:
> 
> [INFO] includes = [**/*.scala,**/*.java,]
> [INFO] excludes = []
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/java:-1:
> info: compiling
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/scala:-1:
> info: compiling
> [INFO] Compiling 25 source files to
> /home/dmitriy/projects/github/mahout-commits/h2o/target/classes at
> 1411401772429
> [ERROR] error: error while loading , error in opening zip file
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object
> scala.runtime in compiler mirror not found.
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO]  at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO]  at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO]  at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO]  at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO]  at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO]  at scala.tools.nsc.Main.main(Main.scala)
> [INFO]  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO]  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> [INFO]  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [INFO]  at java.lang.reflect.Method.invoke(Method.java:606)
> [INFO]  at
> org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
> [INFO]  at
> org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
> [INFO]
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 3.925s
> [INFO] Finished at: Mon Sep 22 09:02:53 PDT 2014
> [INFO] Final Memory: 12M/219M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.scala-tools:maven-scala-plugin:2.15.2:compile (scala-compile-first) on
> project mahout-h2o: wrap: org.apache.commons.exec.ExecuteException: Process
> exited with an error: 1(Exit value: 1) -> [Hel
  

RE: Any idea why h20 module compilation is crashing?

2014-09-22 Thread Andrew Palumbo

my $SCALA_HOME is actually unset for some reason.  I do have 2.10.3 on my 
machine. But its not in my path. I'll try setting it and building again.  
From: ap@outlook.com
To: user@mahout.apache.org
Subject: RE: Any idea why h20 module compilation is crashing?
Date: Mon, 22 Sep 2014 12:18:15 -0400




I just built with Java 1.6 and everything was successful. I tested with 1.7 
before committing it and that was successful as well.  



> Date: Mon, 22 Sep 2014 09:05:46 -0700
> Subject: Any idea why h20 module compilation is crashing?
> From: dlie...@gmail.com
> To: user@mahout.apache.org
> 
> Hello,
> 
> i checked out for the first time the h20 commit, doesn't compile per below .
> 
> scala 2.10.4,
> java 1.7
> 
> java -version
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> 
> scala -version
> Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> dmitriy@Intel-KUBU:~/projects/github/mahout-commits/h2o$
> 
> 
> compilation failure:
> 
> [INFO] includes = [**/*.scala,**/*.java,]
> [INFO] excludes = []
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/java:-1:
> info: compiling
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/scala:-1:
> info: compiling
> [INFO] Compiling 25 source files to
> /home/dmitriy/projects/github/mahout-commits/h2o/target/classes at
> 1411401772429
> [ERROR] error: error while loading , error in opening zip file
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object
> scala.runtime in compiler mirror not found.
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO]  at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO]  at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO]  at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO]  at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO]  at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO]  at scala.tools.nsc.Main.main(Main.scala)
> [INFO]  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO]  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> [INFO]  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [INFO]  at java.lang.reflect.Method.invoke(Method.java:606)
> [INFO]  at
> org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
> [INFO]  at
> org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
> [INFO]
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 3.925s
> [INFO] Finished at: Mon Sep 22 09:02:53 PDT 2014
> [INFO] Final Memory: 12M/219M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.scala-tools:maven-scala-plugin:2.15.2:compile (scala-compile-first) on
> project mahout-h2o: wrap: org.apache.commons

RE: Any idea why h20 module compilation is crashing?

2014-09-22 Thread Andrew Palumbo
I'll get 2.10.4 and try with that.

From: ap@outlook.com
To: user@mahout.apache.org
Subject: RE: Any idea why h20 module compilation is crashing?
Date: Mon, 22 Sep 2014 12:24:19 -0400





my $SCALA_HOME is actually unset for some reason.  I do have 2.10.3 on my 
machine. But its not in my path. I'll try setting it and building again.  
From: ap@outlook.com
To: user@mahout.apache.org
Subject: RE: Any idea why h20 module compilation is crashing?
Date: Mon, 22 Sep 2014 12:18:15 -0400




I just built with Java 1.6 and everything was successful. I tested with 1.7 
before committing it and that was successful as well.  



> Date: Mon, 22 Sep 2014 09:05:46 -0700
> Subject: Any idea why h20 module compilation is crashing?
> From: dlie...@gmail.com
> To: user@mahout.apache.org
> 
> Hello,
> 
> i checked out for the first time the h20 commit, doesn't compile per below .
> 
> scala 2.10.4,
> java 1.7
> 
> java -version
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> 
> scala -version
> Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> dmitriy@Intel-KUBU:~/projects/github/mahout-commits/h2o$
> 
> 
> compilation failure:
> 
> [INFO] includes = [**/*.scala,**/*.java,]
> [INFO] excludes = []
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/java:-1:
> info: compiling
> [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/scala:-1:
> info: compiling
> [INFO] Compiling 25 source files to
> /home/dmitriy/projects/github/mahout-commits/h2o/target/classes at
> 1411401772429
> [ERROR] error: error while loading , error in opening zip file
> [ERROR] error: scala.reflect.internal.MissingRequirementError: object
> scala.runtime in compiler mirror not found.
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> [ERROR] at
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> [INFO]  at
> scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> [INFO]  at
> scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> [INFO]  at scala.tools.nsc.Global$Run.(Global.scala:1290)
> [INFO]  at scala.tools.nsc.Driver.doCompile(Driver.scala:32)
> [INFO]  at scala.tools.nsc.Main$.doCompile(Main.scala:79)
> [INFO]  at scala.tools.nsc.Driver.process(Driver.scala:54)
> [INFO]  at scala.tools.nsc.Driver.main(Driver.scala:67)
> [INFO]  at scala.tools.nsc.Main.main(Main.scala)
> [INFO]  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [INFO]  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> [INFO]  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [INFO]  at java.lang.reflect.Method.invoke(Method.java:606)
> [INFO]  at
> org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
> [INFO]  at
> org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
> [INFO]
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 3.925s
> [INFO] Finished at: Mon Sep 22 09:02:53 PDT 2014
> [INFO] Final Memory: 12M/219M
> [INFO]
> -

RE: Any idea why h20 module compilation is crashing?

2014-09-22 Thread Andrew Palumbo
yeah its building ok..
the h2o module seems to build ok here with 2.10.4.  I'll double check with the 
full build..

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 1:17.878s
[INFO] Finished at: Mon Sep 22 12:59:10 EDT 2014
[INFO] Final Memory: 45M/373M
[INFO] 
[andy@localhost h2o]$ scala -version
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
[andy@localhost h2o]$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
[andy@localhost h2o]$

> From: ap@outlook.com
> To: user@mahout.apache.org
> Subject: RE: Any idea why h20 module compilation is crashing?
> Date: Mon, 22 Sep 2014 12:37:54 -0400
> 
> I'll get 2.10.4 and try with that.
> 
> From: ap@outlook.com
> To: user@mahout.apache.org
> Subject: RE: Any idea why h20 module compilation is crashing?
> Date: Mon, 22 Sep 2014 12:24:19 -0400
> 
> 
> 
> 
> 
> my $SCALA_HOME is actually unset for some reason.  I do have 2.10.3 on my 
> machine. But its not in my path. I'll try setting it and building again.  
> From: ap@outlook.com
> To: user@mahout.apache.org
> Subject: RE: Any idea why h20 module compilation is crashing?
> Date: Mon, 22 Sep 2014 12:18:15 -0400
> 
> 
> 
> 
> I just built with Java 1.6 and everything was successful. I tested with 1.7 
> before committing it and that was successful as well.  
> 
> 
> 
> > Date: Mon, 22 Sep 2014 09:05:46 -0700
> > Subject: Any idea why h20 module compilation is crashing?
> > From: dlie...@gmail.com
> > To: user@mahout.apache.org
> > 
> > Hello,
> > 
> > i checked out for the first time the h20 commit, doesn't compile per below .
> > 
> > scala 2.10.4,
> > java 1.7
> > 
> > java -version
> > java version "1.7.0_51"
> > Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> > 
> > scala -version
> > Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> > dmitriy@Intel-KUBU:~/projects/github/mahout-commits/h2o$
> > 
> > 
> > compilation failure:
> > 
> > [INFO] includes = [**/*.scala,**/*.java,]
> > [INFO] excludes = []
> > [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/java:-1:
> > info: compiling
> > [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/scala:-1:
> > info: compiling
> > [INFO] Compiling 25 source files to
> > /home/dmitriy/projects/github/mahout-commits/h2o/target/classes at
> > 1411401772429
> > [ERROR] error: error while loading , error in opening zip file
> > [ERROR] error: scala.reflect.internal.MissingRequirementError: object
> > scala.runtime in compiler mirror not found.
> > [ERROR] at
> > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> > [ERROR] at
> > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> > [INFO]  at
> > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> > [INFO]  at
> > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> > [INFO]  at
> > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> > [INFO]  at
> > scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> > [INFO]  at
> > scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196)
> > [INFO]  at
> > scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261)
> > [INFO]  at scala.tools.nsc.Global$Run.(Global.scala:1290)
> > [INFO]  at sca

RE: Any idea why h20 module compilation is crashing?

2014-09-22 Thread Andrew Palumbo
full project built ok.. Possibly an old artifact in your maven repo from before 
it was committed?

> From: ap@outlook.com
> To: user@mahout.apache.org
> Subject: RE: Any idea why h20 module compilation is crashing?
> Date: Mon, 22 Sep 2014 13:00:50 -0400
> 
> yeah its building ok..
> the h2o module seems to build ok here with 2.10.4.  I'll double check with 
> the full build..
> 
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 1:17.878s
> [INFO] Finished at: Mon Sep 22 12:59:10 EDT 2014
> [INFO] Final Memory: 45M/373M
> [INFO] 
> 
> [andy@localhost h2o]$ scala -version
> Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> [andy@localhost h2o]$ java -version
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> [andy@localhost h2o]$
> 
> > From: ap@outlook.com
> > To: user@mahout.apache.org
> > Subject: RE: Any idea why h20 module compilation is crashing?
> > Date: Mon, 22 Sep 2014 12:37:54 -0400
> > 
> > I'll get 2.10.4 and try with that.
> > 
> > From: ap@outlook.com
> > To: user@mahout.apache.org
> > Subject: RE: Any idea why h20 module compilation is crashing?
> > Date: Mon, 22 Sep 2014 12:24:19 -0400
> > 
> > 
> > 
> > 
> > 
> > my $SCALA_HOME is actually unset for some reason.  I do have 2.10.3 on my 
> > machine. But its not in my path. I'll try setting it and building again.  
> > From: ap@outlook.com
> > To: user@mahout.apache.org
> > Subject: RE: Any idea why h20 module compilation is crashing?
> > Date: Mon, 22 Sep 2014 12:18:15 -0400
> > 
> > 
> > 
> > 
> > I just built with Java 1.6 and everything was successful. I tested with 1.7 
> > before committing it and that was successful as well.  
> > 
> > 
> > 
> > > Date: Mon, 22 Sep 2014 09:05:46 -0700
> > > Subject: Any idea why h20 module compilation is crashing?
> > > From: dlie...@gmail.com
> > > To: user@mahout.apache.org
> > > 
> > > Hello,
> > > 
> > > i checked out for the first time the h20 commit, doesn't compile per 
> > > below .
> > > 
> > > scala 2.10.4,
> > > java 1.7
> > > 
> > > java -version
> > > java version "1.7.0_51"
> > > Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> > > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> > > 
> > > scala -version
> > > Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
> > > dmitriy@Intel-KUBU:~/projects/github/mahout-commits/h2o$
> > > 
> > > 
> > > compilation failure:
> > > 
> > > [INFO] includes = [**/*.scala,**/*.java,]
> > > [INFO] excludes = []
> > > [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/java:-1:
> > > info: compiling
> > > [INFO] /home/dmitriy/projects/github/mahout-commits/h2o/src/main/scala:-1:
> > > info: compiling
> > > [INFO] Compiling 25 source files to
> > > /home/dmitriy/projects/github/mahout-commits/h2o/target/classes at
> > > 1411401772429
> > > [ERROR] error: error while loading , error in opening zip file
> > > [ERROR] error: scala.reflect.internal.MissingRequirementError: object
> > > scala.runtime in compiler mirror not found.
> > > [ERROR] at
> > > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> > > [ERROR] at
> > > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> > > [INFO]  at
> > > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> > > [INFO]  at
> > > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> > > [INFO]  at
> > > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> > > [INFO]  at
> > > scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> > > [INFO]  at
> > > scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> > > [INFO]  at
> > > scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> > > [INFO]  at
> > > scala.reflect.internal.

RE: Categorization of documents using clustering and classification

2014-10-24 Thread Andrew Palumbo

Hello Hersheeta,

Are you vectorizing the new text using the same dictionary as you used to train 
the models?  If not, this will likely severely impact the performance of the 
classifier.



> Date: Fri, 24 Oct 2014 21:28:06 +0530
> Subject: Categorization of documents using clustering and classification
> From: hersheetachandan...@gmail.com
> To: user@mahout.apache.org
> 
> Hi,
> 
> I have a collection of crawled text documents on different topics which I
> want to categorize into pre-decided categories like travel,sports,education
> etc.
> For this I've firstly clustered these documents using k-means clustering
> and then built a complimentary-naive bayes model of these clustered
> documents.
> The accuracy and reliability of the model was 83% & 63% respectively.
> Now the problem is that, on deploying the model the results recorded are
> absurd
> (eg- A sports document is categorized under business category).
> On analyzing the problem, I found that the clusters formed were not clean
> (contained unrelated documents) which may have led to creation of wrong
> dictionary file.
> 
> In order to avoid this, is there any other way to get the input data
> preprocessed and clustered ?
> or
> Is there any other alternative approach that could be used for the
> categorization?
> 
> Thanks,
> -Hersheeta
  

RE: Naive Bayes Classification

2014-11-11 Thread Andrew Palumbo
The Naive Bayes model is serialized automatically as "naiveBayesModel.bin" in 
by TrainNaiveBayesJob.java [1] (the driver for $ mahout trainnb).

To give you an idea of how, the serialization code can be found in 
NaiveBayesModel.java line 135 [2]

[1] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java
[2] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java


> To: user@mahout.apache.org
> From: vanyasinha110...@gmail.com
> Subject: Naive Bayes Classification
> Date: Tue, 11 Nov 2014 10:59:54 +
> 
> Hi. I'm new to Mahout and am building a classification mechanism on 
> Windows, which needs me to save the trained model into a file. The 
> ModelSerializer class is only available for SGD based algorithms from the 
> looks of it. Can someone please help me with saving the Complementary 
> Naive Bayes model after training along with its conf and other helper 
> files into a given path?
> 
  

RE: Insights to Naive Bayes classifier example - 20news groups

2014-12-01 Thread Andrew Palumbo
Hi Jakub,

The step that you are missing is `$mahout seqdir ...`.   in this step each file 
in each directory (where the directory is the Category) is converted into a 
sequence file of form   where the Text key is /Category/doc_id.

`$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` into a 
sequence file of form  leaving the Keys unchanged.  

`$mahout trainnb ... -el ...` then extracts the label from the Keys of the 
training data ie. the "Category" from /Category/doc_id.  

please see http://mahout.apache.org/users/classification/twenty-newsgroups.html
and http://mahout.apache.org/users/classification/bayesian.html
for more information.

> Date: Mon, 1 Dec 2014 17:09:55 +0100
> Subject: Insights to Naive Bayes classifier example - 20news groups
> From: stransky...@gmail.com
> To: user@mahout.apache.org
> 
> Hello Mahout experts,
> 
> I am trying to follow some examples provided with Mahout and some features
> are not clear to me. It would be great if someone could clarify a bit more.
> 
> To prepare a the data (train and test) the following sequence of steps is
> perfomed (taken from mahout cookbook):
> 
> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> 
> Converted to hadoop sequence file and then vectorized:
> *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
> -lnorm -nv -wt tfidf*
> 
> Devided to test and train data:
> *./mahout split*
> *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> *--testOutput ${WORK_DIR}/20news-test-vectors*
> *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> 
> Model is trained:
> *./mahout trainnb*
> *-i ${WORK_DIR}/20news-train-vectors -el*
> *-o ${WORK_DIR}/model*
> *-li ${WORK_DIR}/labelindex*
> *-ow*
> 
> 
> What I am missing here and that is subject of my question is: Where is the
> category assigned to the testing data to train the categorization? What I
> would expect is that there will be vector which says that this document
> belongs to a particular category. This seems to me has been ereased by
> first step where we mixed all the data to create our corpus. I would still
> expect that this information will be somewhere retained. Instead the
> messages looks as follows:
> 
> From: y...@a.cs.okstate.edu (YEO YEK CHONG)
> Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> Organization: Oklahoma State University
> Lines: 7
> 
> From article , by Steve Frampton <
> framp...@vicuna.ocunix.on.ca>:
> > I was wondering, is the "Kermit" package (the actual package, not a
> 
> Yes!  In the usual ftp sites.
> 
> Yek CHong
> 
> 
> There is no notion from which group this text belongs to. What's the hack!
> 
> Could someone please clarify a bit what's going on as when crosswalidation
> is performed - confusion matrix takes into consideration those categories.
> 
> Thanks a lot for helping me out
> Jakub
  

RE: Insights to Naive Bayes classifier example - 20news groups

2014-12-01 Thread Andrew Palumbo

> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
 
as well the above line should read as follows.  
$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
see: http://mahout.apache.org/users/classification/twenty-newsgroups.html

  

RE: Insights to Naive Bayes classifier example - 20news groups

2014-12-01 Thread Andrew Palumbo



> However the sequence of steps as described in Mahout Cookbook seems to me
> incorrect as:

this is entirely possible, that book may be out of date. The end to end 
instructions on the website for the 20 newsgroups example is up to date though. 
 As is the example script. 

You don't want to merge all of the files into one directory, rather to merge 
the training and testing sets in 20news-bydate while maintaining their 
directory structure.  

> After data set download and extraction data are merged via command:
> *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> 
> Which essentially copies files to a single location -> 20news-all folder

this should not copy all of the *files* individually into the 20news-all folder 
rather the directories containing the files:

$ ls 20news-all/
alt.atheism   rec.autos   sci.space
comp.graphics rec.motorcycles soc.religion.christian
{...}
 
> *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> ${WORK_DIR}/20news-seq*
> Converts to a hadoop sequence directory from 20news-all dir - where all
> files were copied and efffectively the classification to folders were lost.
> We can peek inside a created seq file via hadoop fs -text
> $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> 
> */67399* From:xxx
> Subject: Re: Imake-TeX: looking for beta testers
> Organization: CS Department, Dortmund University, Germany
> Lines: 59
> Distribution: world
> NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> In article ,
> yyy writes:
> |> As I announced at the X Technical Conference in January, I would
> like
> |> to
> |> make Imake-TeX, the Imake support for using the TeX typesetting
> system,
> |> publically available. Currently Imake-TeX is in beta test here at
> the
> |> computer science department of Dortmund University, and I am
> looking
> ...
> 
> To my understanding - number after slash in bold represents a key of
> sequence file, right?

Correct though it should read something like:

/comp.graphics/67399 {...}

where comp.graphics is the category as well as the directory that it was read 
in from.

> Then seq2sparse is performed:
> 
> ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> 
> 
> *Conclusions which I would like to verify:*
> - sequence of steps as described is incorrect - particularly conversion to
> sequence file as the key doesn't contain folder name describing the
> category of training data, or am I still missing something in here?

yes- it looks like you are copying the individual files rather than the 
directories into 20news-all

> 
> - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
>   What are the exact mechanics when label extraction is performed e.g.
> /category/docID as a key is resolved just to category ???

yes

> Does every time
> the last part after the slash is dropped as a category?? Or is is possible
> to define the strategy somewhere?

The hard-coded convention as of Mahout 0.9 is to extract the label as the first 
string after the key is split on "/".  This makes category organization by 
directory and sequence file conversion with seqdirectory straightforward.  The 
new scala DSL Naive Bayes which is currently in development will allow the user 
more flexibility in extracting the label.

The label extraction process can be found here: 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java

and could me modified if need be.
   
> 
> Thanks
> Jakub
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 1 December 2014 at 17:43, Andrew Palumbo  wrote:
> 
> > Hi Jakub,
> >
> > The step that you are missing is `$mahout seqdir ...`.   in this step each
> > file in each directory (where the directory is the Category) is converted
> > into a sequence file of form   where the Text key is
> > /Category/doc_id.
> >
> > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > into a sequence file of form  leaving the Keys
> > unchanged.
> >
> > `$mahout trainnb ... -el ...` then extracts the label from the Keys of the
> > training data ie. the "Category" from /Category/doc_id.
> >
> > please see
> > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > and http://mahout.apache.org/users/classification/bayesian.html
> > for more information.
> >
> > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > Subject: Insights

RE: Insights to Naive Bayes classifier example - 20news groups

2014-12-02 Thread Andrew Palumbo


> Date: Tue, 2 Dec 2014 14:06:44 +0100
> Subject: Re: Insights to Naive Bayes classifier example - 20news groups
> From: stransky...@gmail.com
> To: user@mahout.apache.org
> 
> Hi Andrew,
> 
> many thanks for final clarification! Now I have last question - probably
> the most obvious but I missed it somewhere probably. Because all the
> examples ends up by testing the classifier - display confusion matrix.  So
> the state is:
> We have a trained and tested model and now we would like to use the model
> to classify  unseen, unknown data - actually use the classifier. For sure
> it is clear how to prepare the input - vectorize etc. What is not clear to
> me at the moment is how do I call trained model with new vectorized data as
> an input. Or may be even the vectorization itself - because we need
> probably the dictionary used by model to produce a valid vectors. What
> about terms which we not in the training set etc.
> 
> Is there any documentation regarding this aspect?

As of Mahout 0.9 there are no CLI drivers available to vectorize and classify 
new documents.  There is a ticket open for Mahout 1.0 regarding this.  
Currently you'll have to write a utility class to vectorize and classify new 
documents.  As you mentioned, you'll need to use the same dictionary.file-0 as 
is created by seq2sparse for training.  As well if you're using TF-IDF weights 
you'll need to use the same df-count file to compute the IDF.  Both are located 
in the directory output by seq2sparse.You'll also want to use the same 
maxNgramSize as you used to train the model.  If you want to keep it simple, by 
using unigrams, you can avoid Lucene integration, an just keep a count the 
occurences of tokenized terms. Terms unseen by the training set can be rejected.

Once the document is vectorized, you can use BayesUtils.readModelFromDir(..) to 
retrieve your model, BayesUtils.readLabelIndex(..) [1], and 
(Complemtary)StandardNaiveBayesClassifier.classifyFull(...)[2] to classify your 
vector. You can also look at TestNaiveBayesDriver.AnalyzeResults[3] to see how 
labels are assigned.

There's no documentation on the Mahout site at the moment. There is a good blog 
post here that can give you an Idea of how to get started:

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

[1] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
[2] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java
[3] 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java

 
> 
> Thx
> Jakub
> 
> 
> 
> On 1 December 2014 at 21:12, Andrew Palumbo  wrote:
> 
> >
> >
> >
> > > However the sequence of steps as described in Mahout Cookbook seems to me
> > > incorrect as:
> >
> > this is entirely possible, that book may be out of date. The end to end
> > instructions on the website for the 20 newsgroups example is up to date
> > though.  As is the example script.
> >
> > You don't want to merge all of the files into one directory, rather to
> > merge the training and testing sets in 20news-bydate while maintaining
> > their directory structure.
> >
> > > After data set download and extraction data are merged via command:
> > > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> > >
> > > Which essentially copies files to a single location -> 20news-all folder
> >
> > this should not copy all of the *files* individually into the 20news-all
> > folder rather the directories containing the files:
> >
> > $ ls 20news-all/
> > alt.atheism   rec.autos   sci.space
> > comp.graphics rec.motorcycles soc.religion.christian
> > {...}
> >
> > > *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> > > ${WORK_DIR}/20news-seq*
> > > Converts to a hadoop sequence directory from 20news-all dir - where all
> > > files were copied and efffectively the classification to folders were
> > lost.
> > > We can peek inside a created seq file via hadoop fs -text
> > > $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> > >
> > > */67399* From:xxx
> > > Subject: Re: Imake-TeX: looking for beta testers
> > > Organization: CS Department, Dortmund University, Germany
> > > Lines: 59
> > > Distribution: world
> > > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> &

RE: Providing classification labels to Naive Bayes

2014-12-16 Thread Andrew Palumbo
Hi Suman,

Attachments don't come through on the user list.  Would you mind starting a 
Jira issue for this with an small example of your data and the error that 
you're receiving? This may be a feature that was not fully implemented in the 
most recent MapReduce version of Naive Bayes.

Thanks,

Andy


> Date: Tue, 16 Dec 2014 14:38:23 -0800
> From: suman.somasun...@oracle.com
> To: user@mahout.apache.org
> CC: tell2jy...@gmail.com
> Subject: RE: Providing classification labels to Naive Bayes
> 
> Hi,
> 
> Attached is the sample dataset.
> 
> I using the latest version of Mahout (downloaded from git).
> 
> Thanks,
> Suman.
> 
> -Original Message-
> From: jyotiranjan panda [mailto:tell2jy...@gmail.com] 
> Sent: Sunday, December 14, 2014 7:23 PM
> To: user@mahout.apache.org
> Subject: Re: Providing classification labels to Naive Bayes
> 
> Can you give the training dataset example.
> 
> Regards
> Jyoti Ranjan Panda
> 
> On Sat, Dec 13, 2014 at 4:16 AM, Suman Somasundar < 
> suman.somasun...@oracle.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I tried to run Naïve Bayes program on a digit recognition data set. If 
> > the program itself extracts the labels, it runs fine.
> >
> >
> >
> > If I provide a file which contains the labels, then the program throws 
> > an exception saying that the number of labels is 0.
> >
> >
> >
> > Why is this happening?
> >
> >
> >
> > Thanks,
> > Suman.
> >
  

RE: Providing classification labels to Naive Bayes

2014-12-18 Thread Andrew Palumbo
Thank you Suman.

 Original message From: Suman Somasundar 
 Date:12/18/2014  6:01 PM  (GMT-05:00) 
To: ap@outlook.com Cc: user@mahout.apache.org 
Subject: RE: Providing classification labels to Naive Bayes 


Hi,

I have created the JIRA MAHOUT-1635 with respect to this issue.

Thanks,
Suman.

-Original Message-
From: Andrew Palumbo [mailto:ap@outlook.com]
Sent: Tuesday, December 16, 2014 3:06 PM
To: user@mahout.apache.org
Subject: RE: Providing classification labels to Naive Bayes

Hi Suman,

Attachments don't come through on the user list.  Would you mind starting a 
Jira issue for this with an small example of your data and the error that 
you're receiving? This may be a feature that was not fully implemented in the 
most recent MapReduce version of Naive Bayes.

Thanks,

Andy


> Date: Tue, 16 Dec 2014 14:38:23 -0800
> From: suman.somasun...@oracle.com
> To: user@mahout.apache.org
> CC: tell2jy...@gmail.com
> Subject: RE: Providing classification labels to Naive Bayes
>
> Hi,
>
> Attached is the sample dataset.
>
> I using the latest version of Mahout (downloaded from git).
>
> Thanks,
> Suman.
>
> -Original Message-
> From: jyotiranjan panda [mailto:tell2jy...@gmail.com]
> Sent: Sunday, December 14, 2014 7:23 PM
> To: user@mahout.apache.org
> Subject: Re: Providing classification labels to Naive Bayes
>
> Can you give the training dataset example.
>
> Regards
> Jyoti Ranjan Panda
>
> On Sat, Dec 13, 2014 at 4:16 AM, Suman Somasundar < 
> suman.somasun...@oracle.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I tried to run Naïve Bayes program on a digit recognition data set.
> > If the program itself extracts the labels, it runs fine.
> >
> >
> >
> > If I provide a file which contains the labels, then the program
> > throws an exception saying that the number of labels is 0.
> >
> >
> >
> > Why is this happening?
> >
> >
> >
> > Thanks,
> > Suman.
> >



Re: Importing tfidf from training set

2015-03-17 Thread Andrew Palumbo
If you vectorized your training data with seq2sparse, you'll need to use 
the df-count and dictionary from the training set.  You can then 
tokenize a new document with a lucene analyzer and count the term 
frequencies for all terms in the dictionary.   You can then use the 
TFIDF class:


https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java

with the corresponding df-count for each term from the training set for 
the TF-IDF transformation.




On 03/17/2015 04:46 AM, mw wrote:

Hello,

i am running lda on a training set to create a topic model.
For calculating p(topic|document) on unseen data i need to import the 
inverse document frequency from the training set.

Is there a way to do that in mahout?

Best,
Max




Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-11 Thread Andrew Palumbo
After testing examples locally from .tar and .zip distribution and 
testing the staged mahout-math artifact in a java application, am happy 
with this release


+1 (binding)

On 04/11/2015 11:45 AM, Suneel Marthi wrote:

After checking the {source} * {tar,zip} and running a few tests locally, I
am fine with this release.

+1 (binding)

On Sat, Apr 11, 2015 at 11:43 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


After checking the binary tarball and zip, and running through all the
examples on an EMR cluster, I am good with this release.

+1 (binding)

On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning 
wrote:


Ah... forgot this.

+1 (binding)

On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning 
wrote:


I downloaded and tested the signatures and check-sums on

{binary,source}

x

{zip,tar} + pom.  All were correct.

One thing that I worry a little about is that the name of the artifact
doesn't include "apache".  Not sure that is a hard requirement, but it
seems a good thing to do.



On Fri, Apr 10, 2015 at 8:16 PM, Suneel Marthi <

suneel.mar...@gmail.com>

wrote:


Here's a new Mahout 0.10.0 Release Candidate at



https://repository.apache.org/content/repositories/orgapachemahout-1007/

The Voting for this ends on tomorrow.  Need atleast 3 PMC +1 for the
release to pass.

Grant, Ted:  Would appreciate if u guys could verify the signatures.


Rest: Please test the artifacts.

Thanks to all the contributors and committers.

Regards,
Suneel

On Fri, Apr 10, 2015 at 11:45 AM, Pat Ferrel 
wrote:


Ran well but we have a packaging problem with the binary distro.

Will

require either a pom or code change I think, hold the vote.



On Apr 9, 2015, at 4:31 PM, Andrew Musselman <

andrew.mussel...@gmail.com>

wrote:

Running on EMR now.

On Thu, Apr 9, 2015 at 3:52 PM, Pat Ferrel 

wrote:

I can't run it (due to messed up dev machine) but I verified the

artifacts

buildiing an external app with sbt using the staged repo instead

of

my

local .m2 cache. This means the Scala classes were resolved

correctly

from

the artifacts.

Hope someone can actually run it on a cluster


On Apr 9, 2015, at 2:42 PM, Suneel Marthi <

suneel.mar...@gmail.com>

wrote:

Please find the Mahout 0.10.0 release candidate at


https://repository.apache.org/content/repositories/orgapachemahout-1005/

The Voting runs till Saturday, April 11 2015, need atleast 3 PMC

+1

votes

for the candidate release to pass.

Thanks again to all the commiters and contributors for their hard

work

over

the past few weeks.

Regards,
Suneel
On Behalf of Apache Mahout Team










RE: trainnb labelindex not found error - help requested

2015-04-27 Thread Andrew Palumbo
It looks like you have a mahout 0.9 install trying to run the mahout 0.10.0 
Naive Bayes script.  The command line options have changed slightly for mahout 
0.10.0 MapReduce trainnb.

>mahout-examples-0.9-cdh5.3.0-job.jar
15/04/27 16:41:27 WARN 

Sent from my Verizon Wireless 4G LTE smartphone

 Original message From: Erdem Sahin 
 Date:04/27/2015  7:58 PM  (GMT-05:00) 
To: user@mahout.apache.org Subject: trainnb labelindex 
not found error - help requested 
Hi Mahout users,

I'm trying to run the classify-20-newsgroups.sh  script

and
it fails with a FileNotFoundException when it gets to the "trainnb"
command. All prior steps run successfully. I'm trying algo 1 or algo 2.

I have modified the script slightly so that it reads my input data instead
of the canonical data set. I've created a "wifidata" folder on the local FS
which has the following structure:
wifidata/havewifi
wifidata/nowifi

and within havewifi and nowifi, there exist files with text file names and
text content.These eventually get copied to HDFS.

I'm not clear if the "labelindex" file, which cannot be found, is supposed
to be created by trainnb or by a prior step.

Please see the details of the modified script and the error below. Any help
would be appreciated.

Thanks and best regards,
Erdem Sahin

Script:

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
  echo "This script runs SGD and Bayes classifiers over the classic 20 News
Groups."
  exit
fi

SCRIPT_PATH=${0%/*}
if [ "$0" != "$SCRIPT_PATH" ] && [ "$SCRIPT_PATH" != "" ]; then
  cd $SCRIPT_PATH
fi
START_PATH=`pwd`

# Set commands for dfs
source ${START_PATH}/set-dfs-commands.sh

WORK_DIR=/tmp/mahout-work-${USER}
algorithm=( cnaivebayes-MapReduce naivebayes-MapReduce cnaivebayes-Spark
naivebayes-Spark sgd clean)
if [ -n "$1" ]; then
  choice=$1
else
  echo "Please select a number to choose the corresponding task to run"
  echo "1. ${algorithm[0]}"
  echo "2. ${algorithm[1]}"
  echo "3. ${algorithm[2]}"
  echo "4. ${algorithm[3]}"
  echo "5. ${algorithm[4]}"
  echo "6. ${algorithm[5]}-- cleans up the work area in $WORK_DIR"
  read -p "Enter your choice : " choice
fi

echo "ok. You chose $choice and we'll use ${algorithm[$choice-1]}"
alg=${algorithm[$choice-1]}

# Spark specific check and work
if [ "x$alg" == "xnaivebayes-Spark" -o "x$alg" == "xcnaivebayes-Spark" ];
then
  if [ "$MASTER" == "" ] ; then
echo "Please set your MASTER env variable to point to your Spark Master
URL. exiting..."
exit 1
  fi
  if [ "$MAHOUT_LOCAL" != "" ] ; then
echo "Options 3 and 4 can not run in MAHOUT_LOCAL mode. exiting..."
exit 1
  fi
fi

#echo $START_PATH
cd $START_PATH
cd ../..

set -e

if  ( [ "x$alg" == "xnaivebayes-MapReduce" ] ||  [ "x$alg" ==
"xcnaivebayes-MapReduce" ] || [ "x$alg" == "xnaivebayes-Spark"  ] || [
"x$alg" == "xcnaivebayes-Spark" ] ); then
  c=""

  if [ "x$alg" == "xcnaivebayes-MapReduce" -o "x$alg" ==
"xnaivebayes-Spark" ]; then
c=" -c"
  fi

  set -x
  echo "Preparing 20newsgroups data"
  rm -rf ${WORK_DIR}/20news-all
  mkdir ${WORK_DIR}/20news-all
  cp -R $START_PATH/wifidata/* ${WORK_DIR}/20news-all


  echo "Copying 20newsgroups data to HDFS"
  set +e
  $DFSRM ${WORK_DIR}/20news-all
  $DFS -mkdir ${WORK_DIR}
  $DFS -mkdir ${WORK_DIR}/20news-all
  set -e
  if [ $HVERSION -eq "1" ] ; then
  echo "Copying 20newsgroups data to Hadoop 1 HDFS"
  $DFS -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
  elif [ $HVERSION -eq "2" ] ; then
  echo "Copying 20newsgroups data to Hadoop 2 HDFS"
  $DFS -put ${WORK_DIR}/20news-all ${WORK_DIR}/
  fi


  echo "Creating sequence files from 20newsgroups data"
  /usr/bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow

  echo "Converting sequence files to vectors"
  /usr/bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf

  echo "Creating training and holdout set with a random 80-20 split of the
generated vector dataset"
  /usr/bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
--trainingOutput ${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors  \
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

if [ "x$alg" == "xnaivebayes-MapReduce"  -o  "x$alg" ==
"xcnaivebayes-MapReduce" ]; then

  echo "Training Naive Bayes model"
  /usr/bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c

  echo "Self testing on training set"

  /usr/bin/mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

  echo "Testing on holdout set"

  /usr/bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${W

Re: [VOTE] Mahout 0.10.1 Release Candidate

2015-05-31 Thread Andrew Palumbo

+1 (binding)

Ran (on Hadoop 2.4.1 + spark 1.2.1) all examples with all options in the 
|.tar.gz| binary archive in pseudo-cluster mode and one with 
MAHOUT_LOCAL=true with only the previously noted minor data issue, which 
I agree can wait for the next release.


Ran a mix and match of the |.zip| binary archive examples with 
MAHOUT_LOCAL=true and in pseudo-cluster mode without issue.


Tested the shell from both archives for qr and matrix display fixes.


On 05/31/2015 12:09 PM, Pat Ferrel wrote:

+1 (binding)

Verified on Spark 1.3 psuedo-clustered HDFS 2.4

There are some cleanup of example data issues that can wait for next release.


On May 30, 2015, at 8:16 PM, Suneel Marthi  wrote:

Verified locally build and tests for {source} * {zip, tar}. No issues found.

+1 (binding)

On Sat, May 30, 2015 at 11:14 PM, Suneel Marthi  wrote:


Andrew Palumbo / Dmitriy:  Please also verify the various scenarios as
described in M-1693

On Sat, May 30, 2015 at 10:32 PM, Suneel Marthi 
wrote:


Here's the new 0.10.1 Release Candidate


https://repository.apache.org/content/repositories/orgapachemahout-1009/org/apache/mahout/apache-mahout-distribution/0.10.1/

The Voting ends on Sunday, May 31 2015.

Need a +1 from the PMC for each of the line items below for the release
to pass.

1. Ted/Grant:  Verify hashes and checksums - {binary,source} x {zip,tar}
+ pom

2. AKM:  Verify examples on EMR  - {binary, source} * {zip, tar}

3. Andrew Palumbo: Verify examples locally - {binary} * {zip, tar}

4. Suneel: Verify build and tests - {source} * {zip, tar}

5. Pat:  Verify examples locally - {source} * {zip, tar}

The LICENSE and NOTICE files have not been updated this time and will be
addressed in future releases.



On Sat, May 30, 2015 at 8:32 PM, Suneel Marthi 
wrote:


Please hold ur votes, will be refreshing staging with another build in
the next hour

On Sat, May 30, 2015 at 8:31 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


Likewise source zip and tarballs build and pass tests.

On Sat, May 30, 2015 at 3:23 PM, Suneel Marthi 
wrote:


Verified {source} * {zip, tar} and all tests pass.

+1 (binding)

On Sat, May 30, 2015 at 5:28 PM, Suneel Marthi 

wrote:

This is a call for VOTE to pass Mahout 0.10.1 release candidate

that's

available at




https://repository.apache.org/content/repositories/orgapachemahout-1008/org/apache/mahout/mahout-distribution/0.10.1/

Need atleast 3 PMC +1 (binding) votes to cut the release

Below are the tasks breakdown for the PMC and committers:

Andy Palumbo & Pat Ferrel: verify the binary artifacts and run tests

Suneel & AKM:  verify the src artifacts

Ted/Grant/Drew: verify the hashes and Sigs

The LICENSE.txt and NOTICE.txt still need to be updated and will

not be

addressed as part of 0.10.1 release.









deprecation of lucene2seq

2015-07-03 Thread Andrew Palumbo
Please note that mahout lucene2seq and all related classes will be 
deprecated as of the upcoming Mahout 0.10.2 release.


Thank You,

Andy




Re: [VOTE] Apache Mahout 0.10.2 Release Candidate

2015-08-01 Thread Andrew Palumbo

Verified source tar and zip, all tests pass.

Ran through all options of the classification and clustering examples in 
the binary tar.gz distribution in pseudo-cluster mode for MR and Spark 
without incident.


Ran through one option each in the .zip  Classification and Clustering 
examples in both pseudo-cluster and MAHOUT_LOCAL mode without incident.


Verified spark-document-classifier.mscala example from the spark-shell 
in both .zip and .tar.gz binaries.


+1 (binding)

On 08/01/2015 12:44 AM, Suneel Marthi wrote:

Verified {src} * {bin, tar} and all tests pass.

+1 (binding)



On Fri, Jul 31, 2015 at 11:56 PM, Suneel Marthi  wrote:


This is a call for Votes for Mahout 0.10.2 Release candidate available at
https://repository.apache.org/content/repositories/orgapachemahout-1011

Need atleast 3 PMC +1 votes for the RC to pass. Voting runs until Sunday
Aug 2, 2015.


Please verify the following:

1. Sigs and Hashes of Release artifacts (Ted/Drew/Grant/Stevo)
2. AWS testing of {src, bin} * {tar, zip}  (Andrew ?)
3. Integration testing of {src,bin} * {tar,zip} (Suneel/AP/)
4. Run thru Examples and scripts








[ANNOUNCE] Apache Mahout 0.11.0 Release

2015-08-07 Thread Andrew Palumbo
The Apache Mahout PMC is pleased to announce the release of Mahout 0.11.0.

Mahout's goal is to create an environment for quickly creating machine learning 
applications that scale and run on the highest performance parallel computation 
engines available. Mahout comprises an interactive environment and library that 
supports generalized scalable linear algebra and includes many modern machine 
learning algorithms.


The Mahout Math environment we call “Samsara” for its symbol of universal 
renewal. It reflects a fundamental rethinking of how scalable machine learning 
algorithms are built and customized. Mahout-Samsara is here to help people 
create their own math while providing some off-the-shelf algorithm 
implementations. At its base are general linear algebra and statistical 
operations along with the data structures to support them. It’s written in 
Scala with Mahout-specific extensions, and runs most fully on Spark.


To get started with Apache Mahout 0.11.0, download the release artifacts and 
signatures from http://www.apache.org/dist/mahout/0.11.0/.


Many thanks to the contributors and committers who were part of this release. 
Please see below for the Release Highlights.



RELEASE HIGHLIGHTS


This is a minor release over Mahout 0.10.0 meant to introduce several new 
features and to fix some bugs.  Mahout 0.11.0 includes all new features and 
bugfixes released in Mahout versions 0.10.1, and 0.10.2.



Mahout 0.11.0 new features compared to Mahout 0.10.0



  1.  Spark 1.3 support.

  2.  In-core transpose view rewrites. Modifiable transpose views eg. (for (col 
<- a.t) col := 5).

  3.  Performance and parallelization improvements for AB', A'B, A'A spark 
physical operators. This speeds SimilarityAnalysis and it’s associated jobs, 
spark-itemsimilarity and spark-rowsimilarity.

  4.  Optional structural "flavor" abstraction for in-core matrices.  In-core 
matrices can now be tagged as e.g. sparse or dense.

  5.  %*% optimization based on matrix flavors.

  6.  In-core ::= sparse assignment functions.

  7.  Assign := optimization (do proper traversal based on matrix flavors, 
similarly to %*%).

  8.  Adding in-place elementwise functional assignment (e.g. mxA := exp _, mxA 
::= exp _).

  9.  Distributed and in-core version of simple elementwise analogues of 
scala.math._. for example, for log(x) the convention is dlog(drm), mlog(mx), 
vlog(vec). Unfortunately we cannot overload these functions over what is done 
in scala.math, i.e. scala would not allow log(mx) or log(drm) and log(Double) 
at the same time, mainly because they are being defined in different packages.

  10. Distributed and in-core first and second moment routines. R analogs: 
mean(), colMeans(), rowMeans(), variance(), sd(). By convention, distributed 
versions are prepended by (d) letter: colMeanVars() colMeanStdevs() 
dcolMeanVars() dcolMeanStdevs().

  11. Distance and squared distance matrix routines. R analog: dist(). Provide 
both squared and non-squared eucledian distance matrices. By convention, 
distributed versions are prepended by (d) letter: dist(x), sqDist(x), 
dsqDist(x). Also a variation for pair-wise distance matrix of two different 
inputs x and y: sqDist(x,y), dsqDist(x,y).

  12. DRM row sampling api.

  13. Distributed performance bug fixes. This relates mostly to (a) matrix 
multiplication deficiencies, and (b) handling parallelism.

  14. Distributed engine neutral allreduceBlock() operator api for Spark and 
H2O.

  15. Distributed optimizer operators for elementwise functions. Rewrites 
recognizing e.g. 1+ drmX * dexp(drmX) as a single fused elementwise physical 
operator: elementwiseFunc(f1(f2(drmX)) where f1 = 1 + x and f2 = exp(x).

  16. More cbind, rbind flavors (e.g. 1 cbind mxX, 1 cbind drmX or the other 
way around) for Spark and H2O.

  17. Added +=: and *=: operators on vectors.

  18. Closeable API for broadcast tensors.

  19. Support for conversion of any type-keyed DRM into ordinally-keyed DRM.

  20. Scala logging style.

  21. rowSumsMap() summary for non-int-keyed DRMs.

  22. elementwise power operator ^ .

  23. R-like vector concatenation operator.

  24. In-core functional assignments e.g.: mxA :={ (x) => x * x}.

  25. Straighten out behavior of Matrix.iterator() and iterateNonEmpty().

  26. New mutable transposition view for in-core matrices.  In-core matrix 
transpose view. rewrite with mostly two goals in mind: (1) enable mutability, 
e.g. for (col <- mxA.t) col := k (2) translate matrix structural flavor for 
optimizers correctly. e.g. new SparseRowMatrix.t carries on as column-major 
structure.

  27. Native support for kryo serialization of tensor types.

  28. Deprecation of MultiLayerPerceptron, ConcatenateVectorsJob and all 
related classes.

  29. Deprecation of SparseColumnMatrix.

  30. Fixes for a major memory usage bug in co-occurrence analysis used by the 
driver spark-itemsimilarity. This will now require far less memory in the 
executor.

  31. Some minor fixe

Re: Exception in thread "main" java.lang.IllegalArgumentException: Unable to read output from "mahout -spark classpath"

2015-10-08 Thread Andrew Palumbo
The Mahout 0.11.0 Shell requires Spark 1.3. Please try with Spark 1.3.1.

On 10/08/2015 10:37 PM, go canal wrote:
> I tried Spark 1.4.1, same error. Then I saw the same error from shell 
> command. So I suspect that it is the environment configuration problem.
> I have followed this https://mahout.apache.org/general/downloads.html for 
> Mahout configuration.
> So it seems to be a Spark configuration problem, I guess, although I can run 
> spark-example without errors. Will need to figure out what are missing.
>   thanks, canal
>
>
>   On Monday, October 5, 2015 12:23 AM, Pat Ferrel 
>  wrote:
> 
>
>   Mahout 0.11.0 is built on Spark 1.4 and so 1.5.1 is a bit unknown. I think 
> the Mahout Shell does not run on 1.5.1.
>
> That may not be the error below, which is caused when Mahout tries to create 
> a set of jars to use in the Spark executors. The code runs `mahout -spark 
> classpath` to get these. So something is missing in your env in Eclipse. Does 
> `mahout -spark classpath` run in a shell, if so check to see if you env 
> matches in Eclipse.
>
> Also what are you trying to do? I have some example Spark Context creation 
> code if you are using Mahout as a Library.
>
>
> On Oct 3, 2015, at 2:14 AM, go canal  wrote:
>
> Hello,I am running a very simple Mahout application in Eclipse, but got this 
> error:
> Exception in thread "main" java.lang.IllegalArgumentException: Unable to read 
> output from "mahout -spark classpath". Is SPARK_HOME defined?
> I have SPARK_HOME defined in Eclipse as an environment variable with value of 
> /usr/local/spark-1.5.1.
> What else I need to include/set ?
>
> thanks, canal
>
>
>



Re: [VOTE] Apache Mahout 0.11.1 Release Candidate

2015-11-06 Thread Andrew Palumbo
1. Downloaded and built {src} {tar}- all tests passed.
2. Started shell from {src} {bin} *{tar} distro and ran some distributed 
algebra and I/O tests- no problems.
3. Ran MR Wikipedia example.
4. Ran Spark CLI naive bayes examples.

+1 (binding)
  


From: Suneel Marthi 
Sent: Friday, November 6, 2015 3:05 PM
To: mahout; user@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.11.1 Release Candidate

1. Downloaded {src}* {zip,tar}
2. Ran a clean build and all tests pass
3. Spun up Mahout Spark Shell from the compiled artifacts and ran a few
Samsara queries, tests passed
4. Downloaded {bin} * {zip, tar}
5. Spun up Mahout Spark Shell from the compiled artifacts and ran a few
Samsara queries, tests passed

Here's my +1 (binding)


On Fri, Nov 6, 2015 at 2:41 PM, Suneel Marthi 
wrote:

> Please vote on releasing the following candidate as Apache Mahout version
> 0.11.1:
>
> Branch:
> release-0.11.1
> (see https://git1-us-west.apache.org/repos/asf?p=mahout.git)
>
> The release artifacts to be voted on can be found at:
>
> https://repository.apache.org/content/repositories/orgapachemahout-1018/org/apache/mahout/apache-mahout-distribution/0.11.1/
>
> The release artifacts are signed with the key with fingerprint D3541808
> found at:
> http://www.apache.org/dist/mahout/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachemahout-1018
>
> -
>
> The vote is open for the next 72 hours and passes if a majority of at
> least
> three +1 PMC votes are cast.
>
> The vote ends on Sunday November 8, 2015.
>
> [ ] +1 Release this package as Apache Mahout 0.11.1
> [ ] -1 Do not release this package because ...
>
> ===
>


Re: Mahout : 20-newsgroups Classification Example : Split command

2016-01-14 Thread Andrew Palumbo
The poor results you are seeing by testing are because you've run seq2sparse on 
each set independently.   This will create two different dictionaries, which 
serve as the vector index for each term in your vocabulary.  You must use the 
same dictionary that you trained your model on to vectorize your holdout set.  
There is an example for doing this in Scala, using the new Mahout Samsara 
environment here: 

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

See the "Define a function to tokenize and vectorize new text using our current 
dictionary" section.  

   


From: Alok Tanna 
Sent: Thursday, January 14, 2016 2:31 PM
To: user@mahout.apache.org
Subject: Mahout : 20-newsgroups Classification Example : Split command

Hi ,

This request is in referece to the 20-newsgroups Classification Example on
the below link
https://mahout.apache.org/users/classification/twenty-newsgroups.html

I am able to run the example and get the results as mentioned in the link,
but when I am trying to do this example without the split command the
results are not same. Also when I try to run the other test data against
the same model results are not accurate.

Can we have this example run without the split command ?

Basically I am trying to do this :

I took both the datasets for training & testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now I  have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of
training a model on 80% of the data, I am  using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the
20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we I have specified separate dataset for
testing the model.

Results for this exercise is totally different then what I get when I am
using split command to split the data .


Thanks & Regards,

Alok R. Tanna


Re: Mahout : 20-newsgroups Classification Example : Split command

2016-01-14 Thread Andrew Palumbo
Correct,  actually the example will not work with another dataset after the 
seq2sparse step.  You'd have to write some vectorization methods which would 
use the dictionary-0 file from the output of your seq2sparse run on the 
training data to vectorize your out of sample text.  You could then run mahout 
testnb on that set


The scala example I mentioned earlier has an outline for writing such a method, 
although only goes as far as a single document tokenized into unigrams.





From: Alok Tanna 
Sent: Thursday, January 14, 2016 5:00 PM
To: user@mahout.apache.org; ap@outlook.com
Subject: Re: Mahout : 20-newsgroups Classification Example : Split command

Thank you Andrew for your inputs. I will try the example in Scala .

So this example of 20-newsgroup cannot be used with other data sets to test it 
once the split is done , is that right ?

Thanks,
Alok Tanna

On Thu, Jan 14, 2016 at 4:26 PM, Andrew Palumbo 
mailto:ap@outlook.com>> wrote:
The poor results you are seeing by testing are because you've run seq2sparse on 
each set independently.   This will create two different dictionaries, which 
serve as the vector index for each term in your vocabulary.  You must use the 
same dictionary that you trained your model on to vectorize your holdout set.  
There is an example for doing this in Scala, using the new Mahout Samsara 
environment here:

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

See the "Define a function to tokenize and vectorize new text using our current 
dictionary" section.




From: Alok Tanna mailto:tannaa...@gmail.com>>
Sent: Thursday, January 14, 2016 2:31 PM
To: user@mahout.apache.org<mailto:user@mahout.apache.org>
Subject: Mahout : 20-newsgroups Classification Example : Split command

Hi ,

This request is in referece to the 20-newsgroups Classification Example on
the below link
https://mahout.apache.org/users/classification/twenty-newsgroups.html

I am able to run the example and get the results as mentioned in the link,
but when I am trying to do this example without the split command the
results are not same. Also when I try to run the other test data against
the same model results are not accurate.

Can we have this example run without the split command ?

Basically I am trying to do this :

I took both the datasets for training & testing.

Run below commands on both sets:
1. seqdirectory
2. seq2sparse

Now I  have vectors generated for both datasets.
- Run trainnb command using first dataset's vectors output. So instead of
training a model on 80% of the data, I am  using the whole dataset.
- Run testnb command using second dataset's vectors output. This is not the
20% of the data, it's completely new dataset, solely used for testing.

So instead of using mahout split, we I have specified separate dataset for
testing the model.

Results for this exercise is totally different then what I get when I am
using split command to split the data .


Thanks & Regards,

Alok R. Tanna



--
Thanks & Regards,

Alok R. Tanna



RE: Mahout error : seq2sparse

2016-02-04 Thread Andrew Palumbo
thank you for reporting this.  The "-el" option was removed from 'mahout 
trainnb' in v0.10.0

It is now the default action.

That piece of documentation needs to be updated.

Andy


 Original message 
From: Andrew Musselman 
Date: 02/04/2016 2:20 PM (GMT-05:00)
To: Alok Tanna 
Cc: user@mahout.apache.org
Subject: Re: Mahout error : seq2sparse

Great to hear!  If you're up for it you could sign up and file a bug at
https://issues.apache.org/jira/browse/MAHOUT so we can track that.

Thanks!

On Thu, Feb 4, 2016 at 11:18 AM, Alok Tanna  wrote:

> Thank you so much Andrew it did work with the latest version in local
> mode.
>
> I found one thing that with the new version in the twenty-newsgroups
> classification example(
> https://mahout.apache.org/users/classification/twenty-newsgroups.html)
> this command
> 6. Train the classifier
>
>  $ mahout trainnb
> -i ${WORK_DIR}/20news-train-vectors
> -el
> -o ${WORK_DIR}/model
> -li ${WORK_DIR}/labelindex
> -ow
> -c
>
>
>  wont work with -el parameter . Once I removed it worked fine. not sure
> why ?
>
> with this -el parameter it worked with earlier versions.
>
> Thanks,
> Alok Tanna
>
> On Thu, Feb 4, 2016 at 2:18 AM, Alok Tanna  wrote:
>
>> Will try to update it to night to the latest version and then give it a
>> try .
>>
>> Thanks,
>> Alok Tanna
>>
>> On Thu, Feb 4, 2016 at 1:48 AM, Andrew Musselman <
>> andrew.mussel...@gmail.com> wrote:
>>
>>> Would recommend updating to the latest version if you can; you're
>>> probably working with two-releases-old code.
>>>
>>>
>>> On Wednesday, February 3, 2016, Alok Tanna  wrote:
>>>
 Thank you Andrew . I was able to remove empty lines with your help and
 also run re run the process but then still I am getting the same error.

 when I just run Mahout it shows me  this
 jar /mahout-examples-1.0-SNAPSHOT-job.jar!

 I think only option I have now is to set up the cluster and run it on
 that

 Thanks,
 Alok Tanna


>>
>>
>> --
>> Thanks & Regards,
>>
>> Alok R. Tanna
>>
>>
>
>
>
> --
> Thanks & Regards,
>
> Alok R. Tanna
>
>


RE: mahout spark-itemsimilarity does not work on EMR 4.3

2016-02-23 Thread Andrew Palumbo
Please update to Mahout 0.11.1 for spark versions > 1.3.

 Original message 
From: Zhun Shen 
Date: 02/23/2016 8:57 PM (GMT-05:00)
To: user@mahout.apache.org
Subject: mahout spark-itemsimilarity does not work on EMR 4.3

Hi,
mahout version: 0.11.0
EMR version: 4.3
spark version: 1.6.0

I try to run mahout spark-itemsimilarity on AWS EMR, but it told me that 
“MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Cannot find Spark classpath. Is 'SPARK_HOME' set?”, Is it a bug for EMR or I 
use mahout spark in a wrong way ?


New Mahout "Samsara" Book

2016-02-25 Thread Andrew Palumbo
The new book, "Apache Mahout: Beyond MapReduce" has been released. Written by 
Mahout committers Dmitriy Lyubimov and Andrew Palumbo, this book covers 
previously undocumented features of Mahout releases 0.10 and 0.11.
For more information please see the announcement page:
http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
Thank You



Re: New Mahout "Samsara" Book

2016-02-25 Thread Andrew Palumbo
As Suneel said, you can see the table of contents when you "Look Inside" the 
book on the Amazon page. The "Look Inside" feature does not seem to be 
available on some Mobile browsers.  

Thanks.



From: Suneel Marthi 
Sent: Thursday, February 25, 2016 10:26 AM
To: Pavan K Narayanan; user@mahout.apache.org
Subject: Re: New Mahout "Samsara" Book

It does give u TOC when u 'Look Inside'.

On Thu, Feb 25, 2016 at 10:16 AM, Pavan K Narayanan <
pavan.naraya...@gmail.com> wrote:

> I checked both links, they have only front and back cover of the book. No
> table of contents
> On Feb 25, 2016 9:57 AM, "Suneel Marthi"  wrote:
>
>> You can see the TOC on Amazon
>>
>>
>> http://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785
>>
>>
>> On Thu, Feb 25, 2016 at 9:55 AM, Pavan K Narayanan <
>> pavan.naraya...@gmail.com> wrote:
>>
>> > Andrew, can you please attach table of contents if you don't mind.
>> > On Feb 25, 2016 8:05 AM, "Andrew Palumbo"  wrote:
>> >
>> > > The new book, "Apache Mahout: Beyond MapReduce" has been released.
>> > Written
>> > > by Mahout committers Dmitriy Lyubimov and Andrew Palumbo, this book
>> > covers
>> > > previously undocumented features of Mahout releases 0.10 and 0.11.
>> > > For more information please see the announcement page:
>> > >
>> > >
>> >
>> http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
>> > > Thank You
>> > >
>> > >
>> >
>>
>


Re: New Mahout "Samsara" Book

2016-02-25 Thread Andrew Palumbo
Thanks very much for the invite Scott.  


From: Suneel Marthi 
Sent: Thursday, February 25, 2016 10:35 AM
To: user@mahout.apache.org
Subject: Re: New Mahout "Samsara" Book

Thanks Scott for the invite.

@apalumbo @Dmitriy @PatFerrel  ???

On Thu, Feb 25, 2016 at 10:19 AM, scott cote  wrote:

> Suneel and others:
>
> Anyone of ya’ll want to come to DFW Data Science sometime this summer and
> give a talk?  You can promote the book.
> You would be following on the heals of a couple of talks regarding deep
> learning and search engines.
>
> Here is the url for the user group:
>
> http://www.meetup.com/DFW-Data-Science/ <
> http://www.meetup.com/DFW-Data-Science/>
>
> Our events are always on the first Monday of every month.  We have almost
> 600 members with average attendance somewhere north of 50 per event (High
> of 110 and low of 25).
>
> Cheers,
>
> SCott
>
> > On Feb 25, 2016, at 8:56 AM, Suneel Marthi  wrote:
> >
> > You can see the TOC on Amazon
> >
> >
> http://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785
> >
> >
> > On Thu, Feb 25, 2016 at 9:55 AM, Pavan K Narayanan <
> > pavan.naraya...@gmail.com> wrote:
> >
> >> Andrew, can you please attach table of contents if you don't mind.
> >> On Feb 25, 2016 8:05 AM, "Andrew Palumbo"  wrote:
> >>
> >>> The new book, "Apache Mahout: Beyond MapReduce" has been released.
> >> Written
> >>> by Mahout committers Dmitriy Lyubimov and Andrew Palumbo, this book
> >> covers
> >>> previously undocumented features of Mahout releases 0.10 and 0.11.
> >>> For more information please see the announcement page:
> >>>
> >>>
> >>
> http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
> >>> Thank You
> >>>
> >>>
> >>
>
>


Re: [VOTE] Apache Mahout 0.11.2 Release Candidate

2016-03-11 Thread Andrew Palumbo
Built and tested src tar.  Ran through classification and clustering examples 
in the .zip and .tar binary distro covering Spark single machine, MapReduce 
pseudo-cluster and MAHOUT_LOCAL.  Ran the spark-shell with some simple 
distributed matrix multiplication and I/O tests in single machine mode.  Ran 
the spark-document-classifier.mscala script.  All without issue (aside from the 
URL changes mentioned in the release notes).

+1


From: Suneel Marthi 
Sent: Friday, March 11, 2016 6:03 PM
To: user@mahout.apache.org; mahout
Subject: Re: [VOTE] Apache Mahout 0.11.2 Release Candidate

Checked {src} * {zip, tar}, ran a clean build and all tests pass.

+1

On Fri, Mar 11, 2016 at 5:17 PM, Suneel Marthi  wrote:

> This is the vote for release 0.11.2 of Apache Mahout.
>
> The vote will be going for 24 hours and will be closed on Sunday,
> March 12th, 2016.  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.11.2 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.11.2 release of Apache Mahout,
> because...
>
>
> Maven staging repo:
>
> https://repository.apache.org/content/repositories/orgapachemahout-1019/
>
> The git tag to be voted upon is release-0.11.2
>


Re: [VOTE] Apache Mahout 0.12.0 Release Candidate

2016-04-11 Thread Andrew Palumbo
ran through mapreduce/spark examples from the binary .tar.gz distro. ran the 
spark-shell and tested the spark-document-classifier.mscala script.  
+1  


From: Andrew Musselman 
Sent: Monday, April 11, 2016 12:43 PM
To: d...@mahout.apache.org
Cc: user@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.12.0 Release Candidate

Sigs and hashes are correct, running a build and examples next.

On Mon, Apr 11, 2016 at 8:38 AM, Suneel Marthi  wrote:

> Ran a complete build on  {src} * {zip, tar} and verified that all tests
> pass.
>
> Tested Spark Shell
>
> All Flink tests pass
>
> +1 (binding)
>
> On Mon, Apr 11, 2016 at 8:44 AM, Suneel Marthi  wrote:
>
> > Correction to previous message
> > --
> >
> > This is a vote for release 0.12.0 of Apache Mahout that adds Apache Flink
> > as an execution engine to the Samsara Linear Algebra framework.
> >
> > The vote will run for 24 hours and will be closed on Tuesday,
> > April 12th, 2016.  Please download, test and vote with
> >
> > [ ] +1, accept RC as the official 0.12.0 release of Apache Mahout
> > [ ] +0, I don't care either way,
> > [ ] -1, do not accept RC as the official 0.12.0 release of Apache Mahout,
> > because...
> >
> >
> > Maven staging repo:
> >
> > https://repository.apache.org/content/repositories/orgapachemahout-1022/
> >
> > The git tag to be voted upon is mahout-0.12.0
> >
> > On Mon, Apr 11, 2016 at 8:41 AM, Suneel Marthi 
> wrote:
> >
> >> This is a vote for release 0.12.0 of Apache Mahout that adds Apache
> Flink
> >> as an execution engine to the Samsara Linear Algebra framework.
> >>
> >> The vote will run for 24 hours and will be closed on Monday,
> >> April 12th, 2016.  Please download, test and vote with
> >>
> >> [ ] +1, accept RC as the official 0.12.0 release of Apache Mahout
> >> [ ] +0, I don't care either way,
> >> [ ] -1, do not accept RC as the official 0.12.0 release of Apache
> Mahout,
> >> because...
> >>
> >>
> >> Maven staging repo:
> >>
> >>
> https://repository.apache.org/content/repositories/orgapachemahout-1022/
> >>
> >> The git tag to be voted upon is mahout-0.12.0
> >>
> >
> >
>


RE: Congratulations to our new Chair

2016-04-20 Thread Andrew Palumbo
Thanks you guys!

 Original message 
From: Andrew Musselman 
Date: 04/20/2016 8:14 PM (GMT-05:00)
To: d...@mahout.apache.org, user@mahout.apache.org
Subject: Re: Congratulations to our new Chair

Suneel, thanks your great work as Chair and thank you Andy for stepping in!

On Wed, Apr 20, 2016 at 5:00 PM, Dmitriy Lyubimov  wrote:

> congrats!
>
> On Wed, Apr 20, 2016 at 4:55 PM, Suneel Marthi  wrote:
>
> > Please join me in congratulating Andrew Palumbo on becoming our new
> Project
> > Chair.
> >
> > As for me, it was a pleasure to serve as Chair starting with the Mahout
> > 0.10.0 release and ending with the recent 0.12.0 release, and perhaps we
> > will do it again someday


Re: Negative probabilities

2016-05-11 Thread Andrew Palumbo
Hello, the elements of the vector are not actually probabilities, they are 
scores,  the classification is a winner takes all approach, assigning the 
classification to the class with the max score.

See: http://mahout.apache.org/users/algorithms/spark-naive-bayes.html for an 
overview of the algorithm.

Thanks


From: Nantia Makrynioti 
Sent: Wednesday, May 11, 2016 10:33:29 PM
To: user@mahout.apache.org
Subject: Negative probabilities

Hello,

I am using the classifyFullInstance method on a Naive Bayes model, but when
I print the elements of the generated vector, the probabilities are
negative. What might be the reason for this?

Thanks a lot,
Nantia


Re: [VOTE] Apache Mahout 0.12.1 Release

2016-05-18 Thread Andrew Palumbo
+1 (binding) tested a clean source build.


From: Suneel Marthi 
Sent: Wednesday, May 18, 2016 6:23:57 PM
To: mahout; user@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.12.1 Release

Verified {src} * {tar, zip}

Ran a clean build and tests and see no issues

+1 (binding)

On Wed, May 18, 2016 at 6:07 PM, Suneel Marthi  wrote:

> This is the vote for release 0.12.1 of Apache Mahout.
>
> The vote will be going for at least 72 hours and will be closed on
> Wednesday,
> May 21th, 2016.  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.12.1 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.12.1 release of Apache Mahout,
> because...
>
>
> Maven staging repo:
> https://repository.apache.org/content/repositories/orgapachemahout-1023
>
> The git tag to be voted upon is release-0.12.1
>


Welcome Trevor Grant as a new Mahout Committer

2016-05-23 Thread Andrew Palumbo
In recognition of Trevor Grant's contributions to the Mahout project
notably his Zeppelin Integration work, the PMC has invited and is pleased
to announce that he has accepted our invitation to join the Mahout project
as a committer.

As is customary, I will leave it to Trevor to provide a little bit of
background about himself.

Congratulations and Welcome!

-Andrew Palumbo
On Behalf of the Mahout PMC


RE: [VOTE] Mahout 0.12.2 Release Candidate 2

2016-06-10 Thread Andrew Palumbo
+1 ran classify-wikipedia.sh MR script, launched shell and ran 
spark-document-classifier.mscala in standalone cluster mode.

 Original message 
From: Andrew Musselman 
Date: 06/10/2016 9:23 PM (GMT-05:00)
To: user@mahout.apache.org
Cc: mahout 
Subject: Re: [VOTE] Mahout 0.12.2 Release Candidate 2

Signatures and hashes are correct; +1 (binding).

On Fri, Jun 10, 2016 at 6:05 PM, Suneel Marthi  wrote:

> Verified {bin} * {zip,tar} - ran tests, tests pass
>
> Verified {src} * {zip,tar} - rant tests, tests pass
>
> Here's my +1 (binding)
>
> On Fri, Jun 10, 2016 at 8:59 PM, Suneel Marthi  wrote:
>
> > This is the vote for release 0.12.2 of Apache Mahout.
> >
> > The vote will be going for at least 72 hours and will be closed on
> Sunday,
> > June 12th, 2016 or once there are at least 3 PMC +1 binding votes (which
> > ever occurs earlier).  Please download, test and vote with
> >
> > [ ] +1, accept RC as the official 0.12.2 release of Apache Mahout
> > [ ] +0, I don't care either way,
> > [ ] -1, do not accept RC as the official 0.12.2 release of Apache Mahout,
> > because...
> >
> >
> > Maven staging repo:
> >
> >
> https://repository.apache.org/content/repositories/orgapachemahout-1025/
> > <
> https://repository.apache.org/content/repositories/orgapachemahout-1025/>
> >
> > The git tag to be voted upon is mahout-0.12.2
> >
>


RE: [VOTE] Mahout 0.12.2 Release Candidate 2

2016-06-10 Thread Andrew Palumbo
+1 (binding) that is..  per last email tested MR wikipedia example and spark 
document classifier without issue.

 Original message 
From: Andrew Palumbo 
Date: 06/10/2016 10:44 PM (GMT-05:00)
To: d...@mahout.apache.org, user@mahout.apache.org
Subject: RE: [VOTE] Mahout 0.12.2 Release Candidate 2

+1 ran classify-wikipedia.sh MR script, launched shell and ran 
spark-document-classifier.mscala in standalone cluster mode.

 Original message 
From: Andrew Musselman 
Date: 06/10/2016 9:23 PM (GMT-05:00)
To: user@mahout.apache.org
Cc: mahout 
Subject: Re: [VOTE] Mahout 0.12.2 Release Candidate 2

Signatures and hashes are correct; +1 (binding).

On Fri, Jun 10, 2016 at 6:05 PM, Suneel Marthi  wrote:

> Verified {bin} * {zip,tar} - ran tests, tests pass
>
>


RE: Text clustering how to?

2016-07-27 Thread Andrew Palumbo
I don't think the response was completely sarcastic.  The point is if you want 
to learn more about the subject you might do well to dig in and get your hands 
dirty updating the code.  It's the push/kick in the right direction that you 
asked for.



 Original message 
From: Raviteja Lokineni 
Date: 07/27/2016 9:21 PM (GMT-05:00)
To: user@mahout.apache.org
Subject: Re: Text clustering how to?

Thank you for the help. Nice responses. You could have just said you didn't
know the answer.

I know that they have diverged. A point of common sense, I know what reply
I received on JIRA. Since I wasn't up to the job, I reached out to user
forums for help(not the dev forums mind you).

If the users forums is consisting of sarcastic people no point in having
them. Thank you for the wonderful responses, good day/night.

On Jul 27, 2016 7:48 PM, "Suneel Marthi"  wrote:

> You did get a reply via jira, please stop spamming Mahout and OpenNLP
> mailing listswith the same question.
> The book u r looking at 'Taming Text' is from 2011-12, and both OpenNLP and
> Mahout projects have long diverged from the book.
>
> If u r following the book for ur learning, u may be better off learning on
> your own from the project.
>
> On Wed, Jul 27, 2016 at 7:33 PM, Dmitriy Lyubimov 
> wrote:
>
> > I think you have got a reply via jira.
> >
> > On Wed, Jul 27, 2016 at 10:50 AM, Raviteja Lokineni <
> > raviteja.lokin...@gmail.com> wrote:
> >
> > > Anybody?
> > >
> > > On Thu, Jul 21, 2016 at 10:42 AM, Raviteja Lokineni <
> > > raviteja.lokin...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am pretty new to Apache Mahout. I am trying to figure out how to do
> > > text
> > > > clustering, I was following the book Taming Text (Manning). Looking
> at
> > > the
> > > > book I tried to run Mahout and stumbled upon a version
> incompatibility
> > > with
> > > > latest Lucence indexes. I therefore opened up:
> > > > https://issues.apache.org/jira/browse/MAHOUT-1876
> > > >
> > > > Looks like the code responsible for doing what I needed to do is in
> > > legacy
> > > > map reduce code. Is there any supported(which is not deprecated or
> > > legacy)
> > > > approach to achieve what I am supposed to do?
> > > >
> > > > Was wondering if someone would push / kick me in the right direction
> ☺.
> > > >
> > > > Thanks,
> > > > --
> > > > *Raviteja Lokineni* | Business Intelligence Developer
> > > > TD Ameritrade
> > > >
> > > > E: raviteja.lokin...@gmail.com
> > > >
> > > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > > 
> > > >
> > > >
> > >
> > >
> > > --
> > > *Raviteja Lokineni* | Business Intelligence Developer
> > > TD Ameritrade
> > >
> > > E: raviteja.lokin...@gmail.com
> > >
> > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > 
> > >
> >
>


RE: Text clustering how to?

2016-07-27 Thread Andrew Palumbo
Right, so per the response in the software, maybe you would be interested in 
upgrading the lucene dep?   You can begin work  in that jira if you'd like.  We 
usually do not accept patches for legacy components but the person responding 
to your jira specifically said that we would consider an upgrade of that 
particular dependency.

Open source software projects depend on the entire community at times to 
maintain, upgrade and add new components to projects.  If this is a feature 
that you need, and you are the first to file an issue for it, please consider 
making a contribution.

Thanks,

Andy




 Original message 
From: Raviteja Lokineni 
Date: 07/27/2016 9:30 PM (GMT-05:00)
To: user@mahout.apache.org
Subject: RE: Text clustering how to?

Already doing that.

On Jul 27, 2016 9:28 PM, "Andrew Palumbo"  wrote:

> I don't think the response was completely sarcastic.  The point is if you
> want to learn more about the subject you might do well to dig in and get
> your hands dirty updating the code.  It's the push/kick in the right
> direction that you asked for.
>
>
>
>  Original message 
> From: Raviteja Lokineni 
> Date: 07/27/2016 9:21 PM (GMT-05:00)
> To: user@mahout.apache.org
> Subject: Re: Text clustering how to?
>
> Thank you for the help. Nice responses. You could have just said you didn't
> know the answer.
>
> I know that they have diverged. A point of common sense, I know what reply
> I received on JIRA. Since I wasn't up to the job, I reached out to user
> forums for help(not the dev forums mind you).
>
> If the users forums is consisting of sarcastic people no point in having
> them. Thank you for the wonderful responses, good day/night.
>
> On Jul 27, 2016 7:48 PM, "Suneel Marthi"  wrote:
>
> > You did get a reply via jira, please stop spamming Mahout and OpenNLP
> > mailing listswith the same question.
> > The book u r looking at 'Taming Text' is from 2011-12, and both OpenNLP
> and
> > Mahout projects have long diverged from the book.
> >
> > If u r following the book for ur learning, u may be better off learning
> on
> > your own from the project.
> >
> > On Wed, Jul 27, 2016 at 7:33 PM, Dmitriy Lyubimov 
> > wrote:
> >
> > > I think you have got a reply via jira.
> > >
> > > On Wed, Jul 27, 2016 at 10:50 AM, Raviteja Lokineni <
> > > raviteja.lokin...@gmail.com> wrote:
> > >
> > > > Anybody?
> > > >
> > > > On Thu, Jul 21, 2016 at 10:42 AM, Raviteja Lokineni <
> > > > raviteja.lokin...@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am pretty new to Apache Mahout. I am trying to figure out how to
> do
> > > > text
> > > > > clustering, I was following the book Taming Text (Manning). Looking
> > at
> > > > the
> > > > > book I tried to run Mahout and stumbled upon a version
> > incompatibility
> > > > with
> > > > > latest Lucence indexes. I therefore opened up:
> > > > > https://issues.apache.org/jira/browse/MAHOUT-1876
> > > > >
> > > > > Looks like the code responsible for doing what I needed to do is in
> > > > legacy
> > > > > map reduce code. Is there any supported(which is not deprecated or
> > > > legacy)
> > > > > approach to achieve what I am supposed to do?
> > > > >
> > > > > Was wondering if someone would push / kick me in the right
> direction
> > ☺.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > *Raviteja Lokineni* | Business Intelligence Developer
> > > > > TD Ameritrade
> > > > >
> > > > > E: raviteja.lokin...@gmail.com
> > > > >
> > > > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > > > <http://in.linkedin.com/in/ravitejalokineni>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > *Raviteja Lokineni* | Business Intelligence Developer
> > > > TD Ameritrade
> > > >
> > > > E: raviteja.lokin...@gmail.com
> > > >
> > > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > > <http://in.linkedin.com/in/ravitejalokineni>
> > > >
> > >
> >
>


RE: Text clustering how to?

2016-07-27 Thread Andrew Palumbo
* per the response in the jira...

 Original message 
From: Andrew Palumbo 
Date: 07/27/2016 9:40 PM (GMT-05:00)
To: user@mahout.apache.org
Subject: RE: Text clustering how to?

Right, so per the response in the software, maybe you would be interested in 
upgrading the lucene dep?   You can begin work  in that jira if you'd like.  We 
usually do not accept patches for legacy components but the person responding 
to your jira specifically said that we would consider an upgrade of that 
particular dependency.

Open source software projects depend on the entire community at times to 
maintain, upgrade and add new components to projects.  If this is a feature 
that you need, and you are the first to file an issue for it, please consider 
making a contribution.

Thanks,

Andy




 Original message 
From: Raviteja Lokineni 
Date: 07/27/2016 9:30 PM (GMT-05:00)
To: user@mahout.apache.org
Subject: RE: Text clustering how to?

Already doing that.

On Jul 27, 2016 9:28 PM, "Andrew Palumbo"  wrote:

> I don't think the response was completely sarcastic.  The point is if you
> want to learn more about the subject you might do well to dig in and get
> your hands dirty updating the code.  It's the push/kick in the right
> direction that you asked for.
>
>
>
>  Original message 
> From: Raviteja Lokineni 
> Date: 07/27/2016 9:21 PM (GMT-05:00)
> To: user@mahout.apache.org
> Subject: Re: Text clustering how to?
>
> Thank you for the help. Nice responses. You could have just said you didn't
> know the answer.
>
> I know that they have diverged. A point of common sense, I know what reply
> I received on JIRA. Since I wasn't up to the job, I reached out to user
> forums for help(not the dev forums mind you).
>
> If the users forums is consisting of sarcastic people no point in having
> them. Thank you for the wonderful responses, good day/night.
>
> On Jul 27, 2016 7:48 PM, "Suneel Marthi"  wrote:
>
> > You did get a reply via jira, please stop spamming Mahout and OpenNLP
> > mailing listswith the same question.
> > The book u r looking at 'Taming Text' is from 2011-12, and both OpenNLP
> and
> > Mahout projects have long diverged from the book.
> >
> > If u r following the book for ur learning, u may be better off learning
> on
> > your own from the project.
> >
> > On Wed, Jul 27, 2016 at 7:33 PM, Dmitriy Lyubimov 
> > wrote:
> >
> > > I think you have got a reply via jira.
> > >
> > > On Wed, Jul 27, 2016 at 10:50 AM, Raviteja Lokineni <
> > > raviteja.lokin...@gmail.com> wrote:
> > >
> > > > Anybody?
> > > >
> > > > On Thu, Jul 21, 2016 at 10:42 AM, Raviteja Lokineni <
> > > > raviteja.lokin...@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am pretty new to Apache Mahout. I am trying to figure out how to
> do
> > > > text
> > > > > clustering, I was following the book Taming Text (Manning). Looking
> > at
> > > > the
> > > > > book I tried to run Mahout and stumbled upon a version
> > incompatibility
> > > > with
> > > > > latest Lucence indexes. I therefore opened up:
> > > > > https://issues.apache.org/jira/browse/MAHOUT-1876
> > > > >
> > > > > Looks like the code responsible for doing what I needed to do is in
> > > > legacy
> > > > > map reduce code. Is there any supported(which is not deprecated or
> > > > legacy)
> > > > > approach to achieve what I am supposed to do?
> > > > >
> > > > > Was wondering if someone would push / kick me in the right
> direction
> > ☺.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > *Raviteja Lokineni* | Business Intelligence Developer
> > > > > TD Ameritrade
> > > > >
> > > > > E: raviteja.lokin...@gmail.com
> > > > >
> > > > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > > > <http://in.linkedin.com/in/ravitejalokineni>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > *Raviteja Lokineni* | Business Intelligence Developer
> > > > TD Ameritrade
> > > >
> > > > E: raviteja.lokin...@gmail.com
> > > >
> > > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > > <http://in.linkedin.com/in/ravitejalokineni>
> > > >
> > >
> >
>


Re: Text clustering how to?

2016-08-04 Thread Andrew Palumbo
Hello Raviteja,


Could you start a JIRA issue for this, and post your output there?


Instructions are in the "Making Changes" section here:


http://mahout.apache.org/developers/how-to-contribute.html

Apache Mahout: Scalable machine learning and data 
mining
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than just 
writing code -- it's about doing what you can to make the project better.




Thanks,


Andy


From: Raviteja Lokineni 
Sent: Thursday, August 4, 2016 12:09:38 PM
To: user@mahout.apache.org
Subject: Re: Text clustering how to?

Attaching the test output with failures. Please let me know if you find 
anything relevant.

On Thu, Aug 4, 2016 at 11:51 AM, Raviteja Lokineni 
mailto:raviteja.lokin...@gmail.com>> wrote:
Alright folks, I removed all the compilation errors after creating a fork of 
github repo. FYI.

Running the tests and will see how the tests perform.

On Thu, Jul 28, 2016 at 3:14 AM, Andrew Butkus 
mailto:and...@butkus.co.uk>> wrote:
I once posted in the ffmpeg mailing list a patch and my God did it put me off 
doing so again :) there definitely needs to be more front of shop for open 
source projects to entice people to contribute more :)) but the message is good 
if not the delivery, there's nothing like getting stuck in its the biggest and 
most rewarding learning curve

Sent from my iPhone

> On 28 Jul 2016, at 02:21, Raviteja Lokineni 
> mailto:raviteja.lokin...@gmail.com>> wrote:
>
> Thank you for the help. Nice responses. You could have just said you didn't
> know the answer.
>
> I know that they have diverged. A point of common sense, I know what reply
> I received on JIRA. Since I wasn't up to the job, I reached out to user
> forums for help(not the dev forums mind you).
>
> If the users forums is consisting of sarcastic people no point in having
> them. Thank you for the wonderful responses, good day/night.
>
>> On Jul 27, 2016 7:48 PM, "Suneel Marthi" 
>> mailto:smar...@apache.org>> wrote:
>>
>> You did get a reply via jira, please stop spamming Mahout and OpenNLP
>> mailing listswith the same question.
>> The book u r looking at 'Taming Text' is from 2011-12, and both OpenNLP and
>> Mahout projects have long diverged from the book.
>>
>> If u r following the book for ur learning, u may be better off learning on
>> your own from the project.
>>
>> On Wed, Jul 27, 2016 at 7:33 PM, Dmitriy Lyubimov 
>> mailto:dlie...@gmail.com>>
>> wrote:
>>
>>> I think you have got a reply via jira.
>>>
>>> On Wed, Jul 27, 2016 at 10:50 AM, Raviteja Lokineni <
>>> raviteja.lokin...@gmail.com> wrote:
>>>
 Anybody?

 On Thu, Jul 21, 2016 at 10:42 AM, Raviteja Lokineni <
 raviteja.lokin...@gmail.com> wrote:

> Hi all,
>
> I am pretty new to Apache Mahout. I am trying to figure out how to do
 text
> clustering, I was following the book Taming Text (Manning). Looking
>> at
 the
> book I tried to run Mahout and stumbled upon a version
>> incompatibility
 with
> latest Lucence indexes. I therefore opened up:
> https://issues.apache.org/jira/browse/MAHOUT-1876
>
> Looks like the code responsible for doing what I needed to do is in
 legacy
> map reduce code. Is there any supported(which is not deprecated or
 legacy)
> approach to achieve what I am supposed to do?
>
> Was wondering if someone would push / kick me in the right direction
>> ☺.
>
> Thanks,
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokin...@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> 


 --
 *Raviteja Lokineni* | Business Intelligence Developer
 TD Ameritrade

 E: raviteja.lokin...@gmail.com

 [image: View Raviteja Lokineni's profile on LinkedIn]
 
>>



--
Raviteja Lokineni | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokin...@gmail.com

[View Raviteja Lokineni's profile on 
LinkedIn]




--
Raviteja Lokineni | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokin...@gmail.com

[View Raviteja Lokineni's profile on 
LinkedIn]



Re:

2016-08-04 Thread Andrew Palumbo
Raviteja,


Before opening a Jira, could you explain what changes you made on the 
d...@mahout.apache.org list, and explain the errors that you're getting?


We don't use attachments so please include in your text.


Thanks,

Andy



____
From: Andrew Palumbo 
Sent: Thursday, August 4, 2016 12:22:44 PM
To: user@mahout.apache.org
Subject: Re: Text clustering how to?


Hello Raviteja,


Could you start a JIRA issue for this, and post your output there?


Instructions are in the "Making Changes" section here:


http://mahout.apache.org/developers/how-to-contribute.html

Apache Mahout: Scalable machine learning and data 
mining<http://mahout.apache.org/developers/how-to-contribute.html>
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than just 
writing code -- it's about doing what you can to make the project better.




Thanks,


Andy


From: Raviteja Lokineni 
Sent: Thursday, August 4, 2016 12:09:38 PM
To: user@mahout.apache.org
Subject: Re: Text clustering how to?

Attaching the test output with failures. Please let me know if you find 
anything relevant.

On Thu, Aug 4, 2016 at 11:51 AM, Raviteja Lokineni 
mailto:raviteja.lokin...@gmail.com>> wrote:
Alright folks, I removed all the compilation errors after creating a fork of 
github repo. FYI.

Running the tests and will see how the tests perform.

On Thu, Jul 28, 2016 at 3:14 AM, Andrew Butkus 
mailto:and...@butkus.co.uk>> wrote:
I once posted in the ffmpeg mailing list a patch and my God did it put me off 
doing so again :) there definitely needs to be more front of shop for open 
source projects to entice people to contribute more :)) but the message is good 
if not the delivery, there's nothing like getting stuck in its the biggest and 
most rewarding learning curve

Sent from my iPhone

> On 28 Jul 2016, at 02:21, Raviteja Lokineni 
> mailto:raviteja.lokin...@gmail.com>> wrote:
>
> Thank you for the help. Nice responses. You could have just said you didn't
> know the answer.
>
> I know that they have diverged. A point of common sense, I know what reply
> I received on JIRA. Since I wasn't up to the job, I reached out to user
> forums for help(not the dev forums mind you).
>
> If the users forums is consisting of sarcastic people no point in having
> them. Thank you for the wonderful responses, good day/night.
>
>> On Jul 27, 2016 7:48 PM, "Suneel Marthi" 
>> mailto:smar...@apache.org>> wrote:
>>
>> You did get a reply via jira, please stop spamming Mahout and OpenNLP
>> mailing listswith the same question.
>> The book u r looking at 'Taming Text' is from 2011-12, and both OpenNLP and
>> Mahout projects have long diverged from the book.
>>
>> If u r following the book for ur learning, u may be better off learning on
>> your own from the project.
>>
>> On Wed, Jul 27, 2016 at 7:33 PM, Dmitriy Lyubimov 
>> mailto:dlie...@gmail.com>>
>> wrote:
>>
>>> I think you have got a reply via jira.
>>>
>>> On Wed, Jul 27, 2016 at 10:50 AM, Raviteja Lokineni <
>>> raviteja.lokin...@gmail.com<mailto:raviteja.lokin...@gmail.com>> wrote:
>>>
>>>> Anybody?
>>>>
>>>> On Thu, Jul 21, 2016 at 10:42 AM, Raviteja Lokineni <
>>>> raviteja.lokin...@gmail.com<mailto:raviteja.lokin...@gmail.com>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am pretty new to Apache Mahout. I am trying to figure out how to do
>>>> text
>>>>> clustering, I was following the book Taming Text (Manning). Looking
>> at
>>>> the
>>>>> book I tried to run Mahout and stumbled upon a version
>> incompatibility
>>>> with
>>>>> latest Lucence indexes. I therefore opened up:
>>>>> https://issues.apache.org/jira/browse/MAHOUT-1876
>>>>>
>>>>> Looks like the code responsible for doing what I needed to do is in
>>>> legacy
>>>>> map reduce code. Is there any supported(which is not deprecated or
>>>> legacy)
>>>>> approach to achieve what I am supposed to do?
>>>>>
>>>>> Was wondering if someone would push / kick me in the right direction
>> :).
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> *Raviteja Lokineni* | Business Intelligence Developer
>>>>> TD Ameritrade
>>>>>
>>>>> E: raviteja.lokin...@gmail.com<mailto:raviteja.lokin...@gmail.com>
>>>>>
>>>>> [image: View Raviteja Lokin

Purging of MAHOUT_LOCAL funcationallity

2016-09-07 Thread Andrew Palumbo
As the MAHOUT_LOCAL functionality is a legacy component, is currently not 
working and has no maintainer, it will be removed in one of the upcoming 
releases.


Thank you,


Andy


Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-02-01 Thread Andrew Palumbo




From: Isabel Drost 
Sent: Wednesday, February 1, 2017 4:55 AM
To: Dmitriy Lyubimov
Cc: user@mahout.apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation



On Tue, Jan 31, 2017 at 04:06:36PM -0800, Dmitriy Lyubimov wrote:
> Except for a several applied
> off-the-shelves, Mahout has not (hopefully just yet) developed a
> comprehensive set of things to use.

Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?


I'm sorry to cut in on the convorsation here, but I wanted people to be aware 
of the algorithm framework effort that is currently underway.

I think that https://issues.apache.org/jira/browse/MAHOUT-1856 ,  a solid 
framework for new algorithms will go A long way towards helping out new users 
understand how easy it is to add algorithms.  There has been significant work 
on this issue already merged to master with a fine OLS example including 
statistical tests for Autocorrelation and Heteroskedasticity.  Trevor G. has 
been heading up the framework effort, which is still in development, and will 
continue to be throughout 0.13.x releases (and hopefully added to in 0.14.x as 
well).

I believe that having the framework in place will both make make Mahout More 
intuitive for new users and developers to write algorithms and pipelines, as 
well as to provide a set of canned algos to those who are looking for something 
off-the-shelf.

Just wanted to get that into the conversation.


> The off-the-shelves currently are cross-occurrence recommendations (which
> still require real time serving component taken from elsewhere), svd-pca,
> some algebra, and Naive/complement Bayes at scale.
>
> Most of the bigger companies i worked for never deal with completely the
> off-the-shelf open source solutions. It always requires more understanding
> of their problem. (E.g., much as COO recommender is wonderful, i don't
> think Netflix would entertain taking Mahout's COO run on it verbatim).

Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?



> It is quite common that companies invest in their own specific
> understanding of their problem and requirements and a specific solution to
> their problem through iterative experimentation with different
> methodologies, most of which are either new-ish enough or proprietary
> enough that public solution does not exist.

While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.


> That latter case was pretty much motivation for Samsara. If you are a
> practitioner solving numerical problems thru experimentation cycle, Mahout
> is much more useful than any of the off-the-shelf collections.

+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.


> > perspective? If so, would there be interest among the Mahout committers to
> > help
> > users publicly create docs/examples/modules to support these use cases?
> >
>
> yes

Where do we start? ;)


Isabel




Fw: Starter Issues

2017-02-01 Thread Andrew Palumbo




From: Trevor Grant 
Sent: Wednesday, February 1, 2017 5:01 PM
To: d...@mahout.apache.org
Subject: Starter Issues

Hey all,

I know there are some folks on here who have been interested in getting
more involved with the project, specifically contributing code.  With the
recent additions of Native Bindings and Algorithms, there is a lot to do-
and we have begun tagging issues as 'beginner', for those who are
overwhelmed by where to start.

Click here to see the recent list:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20status%20%3D%20Open%20and%20labels%20%3D%20beginner

We'll keep that updated, feel free to reach out with any issues, and as
always thanks for the support!

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]

rawkintrevo (Trevor Grant) · GitHub
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


RE: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-03-01 Thread Andrew Palumbo
I will verify keys tonight.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Andrew Musselman 
Date: 03/01/2017 10:20 AM (GMT-08:00)
To: user@mahout.apache.org, d...@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

Nevermind, that was before building the src distro.

Shell works fine with src and binary distros.

On Wed, Mar 1, 2017 at 9:39 AM, Andrew Musselman  wrote:

> I'm getting this when starting the spark-shell on a Mac:
>
> Loading /Users/andrew.musselman/Downloads/mahout-testing/
> apache-mahout-distribution-0.13.0/bin/load-shell.scala...
> :36: error: object mahout is not a member of package org.apache
>import org.apache.mahout.math._
>  ^
> :19: error: object mahout is not a member of package org.apache
>import org.apache.mahout.math.scalabindings._
>  ^
> :19: error: object mahout is not a member of package org.apache
>import org.apache.mahout.math.drm._
>  ^
> :19: error: object mahout is not a member of package org.apache
>import org.apache.mahout.math.scalabindings.RLikeOps._
>  ^
> :19: error: object mahout is not a member of package org.apache
>import org.apache.mahout.math.drm.RLikeDrmOps._
>  ^
> :19: error: object mahout is not a member of package org.apache
>import org.apache.mahout.sparkbindings._
>  ^
> :21: error: object mahout is not a member of package org.apache
>implicit val sdc: 
> org.apache.mahout.sparkbindings.SparkDistributedContext
> = sc2sdc(sc)
> ^
> :21: error: not found: value sc2sdc
>implicit val sdc: 
> org.apache.mahout.sparkbindings.SparkDistributedContext
> = sc2sdc(sc)
>
> On Wed, Mar 1, 2017 at 9:21 AM, Andrew Musselman  wrote:
>
>> I've confirmed hashes and sigs; if someone other than me could confirm
>> all three sigs it'd be good, e.g.:
>>
>> `gpg --verify apache-mahout-distribution-0.13.0-src.tar.gz.asc`
>> `gpg --verify apache-mahout-distribution-0.13.0.pom.asc`
>> `gpg --verify apache-mahout-distribution-0.13.0.tar.gz.asc`
>>
>> I'll vote after running some tests.
>>
>> On Tue, Feb 28, 2017 at 10:58 PM, Andrew Musselman 
>> wrote:
>>
>>> This is the vote for release 0.13.0 of Apache Mahout.
>>>
>>> The vote will be going for at least 72 hours and will be closed on
>>> Friday,
>>> March 3rd, 2017 or once there are at least 3 PMC +1 binding votes (whichever
>>> occurs earlier).  Please download, test and vote with
>>>
>>> [ ] +1, accept RC as the official 0.13.0 release of Apache Mahout
>>> [ ] +0, I don't care either way,
>>> [ ] -1, do not accept RC as the official 0.13.0 release of Apache Mahout
>>> ,
>>> because...
>>>
>>>
>>> Maven staging repo:
>>>
>>> https://repository.apache.org/content/repositories/orgapache
>>> mahout-1033/org/apache/mahout/apache-mahout-distribution/0.13.0/
>>>
>>> The git tag to be voted upon is mahout-0.13.0
>>>
>>
>>
>


Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-03-04 Thread Andrew Palumbo
This vote has been cancelled due to issues found while testing..  RC2 will be 
out soon.


From: Pat Ferrel 
Sent: Friday, March 3, 2017 4:19:51 PM
To: d...@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

scratch that, anyone using sbt use the following resolver:

resolvers += “Apache Staging" at 
"https://repository.apache.org/content/repositories/orgapachemahout-1034 
<https://repository.apache.org/content/repositories/orgapachemahout-1034>”


On Mar 3, 2017, at 10:41 AM, Pat Ferrel  wrote:

My first observation is that the typical way to release Scala libs is with 
multiple versions for the currently popular Scalas. Akka for instance is not 
stable with Scala 2.10 anymore so consider it deprecated and it is the core of 
Spark. Many libs release 2.10 and 2.11 versions of binaries with many already 
on 2.12 also. Mixing versions is a much bigger deal than with Java and the most 
common version for current development seems to be 2.11.

There appear to be ways to mix versions in your builds but it is somewhat more 
complex and not guaranteed to work. 
http://docs.scala-lang.org/overviews/core/binary-compatibility-of-scala-releases.html
 
<http://docs.scala-lang.org/overviews/core/binary-compatibility-of-scala-releases.html>

On a separate note to test in the future it would facilitate things if the 
maven or sbt snippets were included to test the staging repo.

I assume sbt is:


scalaVersion := “2.10.4”

val mahoutVersion = “0.13.0"

resolvers += “Apache Staging" at 
"https://repository.apache.org/content/repositories/orgapache 
<https://repository.apache.org/content/repositories/orgapache><https://repository.apache.org/content/repositories/orgapache
 <https://repository.apache.org/content/repositories/orgapache>>”

// Mahout's Spark libs
"org.apache.mahout" %% "mahout-math-scala" % mahoutVersion,
"org.apache.mahout" %% "mahout-spark" % mahoutVersion,
"org.apache.mahout"  % "mahout-math" % mahoutVersion,
"org.apache.mahout"  % "mahout-hdfs" % mahoutVersion,


BTW I’ve compiled Mahout locally with Scala 2.11 so it may just be a case for 
someone having time to update the release process.

I’ll try to test today.



On Mar 3, 2017, at 9:19 AM, Andrew Palumbo mailto:ap@outlook.com>> wrote:





Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Andrew Musselman 
Date: 03/02/2017 10:17 PM (GMT-08:00)
To: d...@mahout.apache.org
Cc: user@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

Confirmed hashes and sigs, tested operations in the shell in src and bin
artifacts. Would like someone else to check sigs too.

+1 (binding)

On Wed, Mar 1, 2017 at 9:39 PM, Andrew Musselman  wrote:

> New RC for 0.13.0 release out; please try out the new artifacts at
> https://repository.apache.org/content/repositories/orgapachemahout-1034 
> <https://repository.apache.org/content/repositories/orgapachemahout-1034> 
> <https://repository.apache.org/content/repositories/orgapachemahout-1034 
> <https://repository.apache.org/content/repositories/orgapachemahout-1034>>
>
> The vote will be going for at least 72 hours and will be closed on Friday,
> March 3rd, 2017 or once there are at least 3 PMC +1 binding votes (whichever
> occurs earlier).  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.13.0 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.13.0 release of Apache Mahout,
> because...
>
> The git tag to be voted upon is mahout-0.13.0
>
> On Wed, Mar 1, 2017 at 11:45 AM, Andrew Palumbo  <mailto:ap@outlook.com>>
> wrote:
>
>> I will verify keys tonight.
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Andrew Musselman > <mailto:andrew.mussel...@gmail.com>>
>> Date: 03/01/2017 10:20 AM (GMT-08:00)
>> To: user@mahout.apache.org <mailto:user@mahout.apache.org>, 
>> d...@mahout.apache.org <mailto:d...@mahout.apache.org>
>> Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate
>>
>> Nevermind, that was before building the src distro.
>>
>> Shell works fine with src and binary distros.
>>
>> On Wed, Mar 1, 2017 at 9:39 AM, Andrew Musselman <
>> andrew.mussel...@gmail.com <mailto:andrew.mussel...@gmail.com>
>>> wrote:
>>
>>> I'm getting this when starting the spark-shell on a Mac:
>>>
>>> Loading /Users/andrew.musselman/Downloads/mahout-testing/
>>> apache-mahout-distribution-0.13.0/bin/load-shell.scala..

RE: Marketing

2017-03-23 Thread Andrew Palumbo
+1 on revamp.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Trevor Grant 
Date: 03/23/2017 12:36 PM (GMT-08:00)
To: user@mahout.apache.org, d...@mahout.apache.org
Subject: Marketing

Hey user and dev,

With 0.13.0 the Apache Mahout project has added some significant updates.

The website is starting to feel 'dated' I think it could use a reboot.

The blue person riding the elephant has less signifigance in
Mahout-Samsara's modular backends.

Would like to open the floor to discussion on website reboot (and who might
be willing to take on such a project), as well as new mascot.

To kick off- in an offline talk there was the idea of
A honey badger (bc honey-badger don't care, just like mahout don't care
what back end or native solvers you are using, and also bc a cobra bites a
honey badger and he takes a little nap then wakes up and finishes eating
the cobra. honey badger eats snakes, and does all the work while the other
animals pick up the scraps.
see this short documentary on the honey badger:
https://www.youtube.com/watch?v=4r7wHMg5Yjg )
^^audio not safe for work

Con: its almost tooo jokey.

Other idea: are coy-wolfs.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


RE: Marketing

2017-03-25 Thread Andrew Palumbo
That's pretty cool.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Ted Dunning 
Date: 03/24/2017 7:22 PM (GMT-08:00)
To: user@mahout.apache.org
Cc: Mahout Dev List 
Subject: Re: Marketing

On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel  wrote:

> maybe we should drop the name Mahout altogether.


I have been told that there is a cool secondary interpretation of Mahout as
well.

I think that the Hebrew word is pronounced roughly like Mahout.

מַהוּת

The cool thing is that this word means "essence" or possibly "truth". So
regardless of the guy riding the elephant, Mahout still has something to be
said for it.

(I have no Hebrew, btw)
(real speakers may want to comment here)


Re: Lambda and Kappa CCO

2017-04-09 Thread Andrew Palumbo
Pat-

What can we do from the mahout side?  Would we need any new data structures?  
Trevor and I were just discussing some of  the troubles of near real time 
matrix streaming.


From: Pat Ferrel 
Sent: Monday, March 27, 2017 2:42:55 PM
To: Ted Dunning; user@mahout.apache.org
Cc: Trevor Grant; Ted Dunning; s...@apache.org
Subject: Re: Lambda and Kappa CCO

Agreed. Downsampling was ignored in several places and with it a great deal of 
input is a noop. Without downsampling too many things need to change.

Also everything is dependent on this rather vague sentence. “- determine if the 
new interaction element cross-occurs with A and if so calculate the llr score”, 
which needs a lot more explanation. Whether to use Mahout in-memory objects or 
reimplement some in high speed data structures is a big question.

The good thing I noticed in writing this is that model update and real time can 
be arbitrarily far apart, that the system degrades gracefully. So during high 
load it may fall behind but as long as user behavior is up-to-date and 
persisted (it will be) we are still in pretty good shape.


On Mar 26, 2017, at 6:26 PM, Ted Dunning  wrote:


I think that this analysis omits the fact that one user interaction causes many 
cooccurrences to change.

This becomes feasible if you include the effect of down-sampling, but that has 
to be in the algorithm.


From: Pat Ferrel 
Sent: Saturday, March 25, 2017 12:01:00 PM
To: Trevor Grant; user@mahout.apache.org
Cc: Ted Dunning; s...@apache.org
Subject: Lambda and Kappa CCO

This is an overview and proposal for turning the multi-modal Correlated 
Cross-Occurrence (CCO) recommender from Lambda-style into an online streaming 
incrementally updated Kappa-style learner.

# The CCO Recommender: Lambda-style

We have largely solved the problems of calculating the multi-modal Correlated 
Cross-Occurrence models and serving recommendations in real time from real time 
user behavior. The model sits in Lucene (Elasticsearch or Solr) in a scalable 
way and the typical query to produce personalized recommendations comes from 
real time user behavior completes with 25ms latency.

# CCO Algorithm

A = rows are users, columns are items they have “converted” on (purchase, read, 
watch). A represents the conversion event—the interaction that you want to 
recommend.
B = rows are users columns are items that the user has shown some preference 
for but not necessarily the same items as A. B represent a different 
interaction than A. B might be a preference for some category, brand, genre, or 
just a detailed item page view—or all of these in B, C, D, etc
h_a = a particular user’s history of A type interactions, a vector of items 
that our user converted on.
h_b = a particular user’s history of B type interactions, a vector of items 
that our user had B type interactions with.

CCO says:

[A’A]h_a + [A’B]h_b + [A’C]h_c = r; where r is the weighted items from A that 
represent personalized recommendations for our particular user.

The innovation here is that A, B, C, … represent multi-modal data. Interactions 
of all types and on item-sets of arbitrary types. In other words we can look at 
virtually any action or possible indicator of user preference or taste. We 
strengthen the above raw cross-occurrence and cooccurrence formula by 
performing:

[llr(A’A)]h_a + [llr(A’B)]h_b + … = r adding llr (log-likelihood ratio) 
correlation scoring to filter out coincidental cross-occurrences.

The model becomes [llr(A’A)], [llr(A’B)], … each has items from A in rows and 
items from A, B, … in columns. This sits in Lucene as one document per items in 
A with a field for each of A, B, C items whose user interactions most strongly 
correlate to the conversion event on the row item. Put another way, the model 
is items from A. B, C… what have the most similar user interaction from users.

To calculate r we need to find the most simllar items in the model to the 
history or behavior of our example user. Since Lucene is basically a K-Nearest 
Neighbors engine that is particularly well tuned to work with sparse data (our 
model is typically quite sparse) all we need to do is segment the user history 
into h_a, h_b … and use it as the multi-field query on the model. This performs 
the equivalent of:

[llr(A’A)]h_a + [llr(A’B)]h_b + … = r where we substitute cosine similarity of 
h_a to every row in [llr(A’A)]h_a for the tensor math. Further Lucene sorts by 
score and returns only the top ranking items. Even further we note that since 
we have performed a multi-field query it does the entire multi-field similarity 
calculation and vector segment addition before doing the sort. Lucene does this 
a a very performant manner so the entire query, including fetching user 
history, forming the Lucene query and executing it will take something like 25 
ms and is indefinitely scalable to any number of simultaneous queries.

Problem solved?

Well, yes and no. The above method I’ve la

Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-04-12 Thread Andrew Palumbo

It looks like we're missing some jars from the binary distro:


bin

mahout-integration-0.13.0.jar

confmahout-math-0.13.0.jar derby.log
   mahout-math-scala_2.10-0.13.0.jar docs
mahout-mr-0.13.0.jar examplesmahout-mr-0.13.0-job.jar 
flink   mahout-spark_2.10-0.13.0-dependency-reduced.jar 
h2o mahout-spark_2.10-0.13.0.jar lib
 metastore_db LICENSE.txt NOTICE.txt 
mahout-examples-0.13.0.jar  README.md mahout-examples-0.13.0-job.jar  
viennacl mahout-hdfs-0.13.0.jar  viennacl-omp


From: Andrew Musselman 
Sent: Tuesday, April 11, 2017 12:05 PM
To: user@mahout.apache.org; d...@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

I've checked hashes and sigs, and run a build with passing tests for
vanilla, viennacl, and viennacl-omp profiles.

The spark shell runs the SparseSparseDrmTimer.mscala example in the binary
build and all three source profiles; I saw the GPU get exercised when
running the viennacl profile from source, and saw all cores on the CPU get
exercised when running the viennacl-omp profile from source.

So far I'm +1 (binding).



On Tue, Apr 11, 2017 at 8:55 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> This is the vote for release 0.13.0 of Apache Mahout.
>
> The vote will be going for at least 72 hours and will be closed on Friday,
> April 17th, 2017 or once there are at least 3 PMC +1 binding votes (whichever
> occurs earlier).  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.13.0 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.13.0 release of Apache Mahout,
> because...
>
>
> Maven staging repo:
>
> *https://repository.apache.org/content/repositories/orgapachemahout-1042/org/apache/mahout/apache-mahout-distribution/0.13.0/
> *
>
> The git tag to be voted upon is mahout-0.13.0
>


Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-04-12 Thread Andrew Palumbo
oops- sent last email accidentally without finishing..



It looks like we are missing some jars from the binary distro:


andy@micheal:~/sandbox/apache-mahout-distribution-0.13.0$ ls *.jar


mahout-integration-0.13.0.jar

mahout-math-0.13.0.jar

mahout-math-scala_2.10-0.13.0.jar

mahout-mr-0.13.0.jar

mahout-mr-0.13.0-job.jar

mahout-spark_2.10-0.13.0-dependency-reduced.jar h2o 
mahout-spark_2.10-0.13.0.jar

mahout-examples-0.13.0.jar

mahout-examples-0.13.0-job.jar

mahout-hdfs-0.13.0.jar


we are missing mahout-native-viennacl_2.10.jar and 
mahout-native-viennacl-omp.jar


I think that we need to try a different build command.


Andy



From: Andrew Palumbo 
Sent: Wednesday, April 12, 2017 10:42:30 PM
To: user@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate



It looks like we're missing some jars from the binary distro:


bin

mahout-integration-0.13.0.jar

confmahout-math-0.13.0.jar derby.log
   mahout-math-scala_2.10-0.13.0.jar docs
mahout-mr-0.13.0.jar examplesmahout-mr-0.13.0-job.jar 
flink   mahout-spark_2.10-0.13.0-dependency-reduced.jar 
h2o mahout-spark_2.10-0.13.0.jar lib
 metastore_db LICENSE.txt NOTICE.txt 
mahout-examples-0.13.0.jar  README.md mahout-examples-0.13.0-job.jar  
viennacl mahout-hdfs-0.13.0.jar  viennacl-omp


From: Andrew Musselman 
Sent: Tuesday, April 11, 2017 12:05 PM
To: user@mahout.apache.org; d...@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

I've checked hashes and sigs, and run a build with passing tests for
vanilla, viennacl, and viennacl-omp profiles.

The spark shell runs the SparseSparseDrmTimer.mscala example in the binary
build and all three source profiles; I saw the GPU get exercised when
running the viennacl profile from source, and saw all cores on the CPU get
exercised when running the viennacl-omp profile from source.

So far I'm +1 (binding).



On Tue, Apr 11, 2017 at 8:55 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> This is the vote for release 0.13.0 of Apache Mahout.
>
> The vote will be going for at least 72 hours and will be closed on Friday,
> April 17th, 2017 or once there are at least 3 PMC +1 binding votes (whichever
> occurs earlier).  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.13.0 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.13.0 release of Apache Mahout,
> because...
>
>
> Maven staging repo:
>
> *https://repository.apache.org/content/repositories/orgapachemahout-1042/org/apache/mahout/apache-mahout-distribution/0.13.0/
> <https://repository.apache.org/content/repositories/orgapachemahout-1042/org/apache/mahout/apache-mahout-distribution/0.13.0/>*
>
> The git tag to be voted upon is mahout-0.13.0
>


Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-04-15 Thread Andrew Palumbo
+1 (binding)


Built and tested source distribution with both profiles -Pviennacl and 
-Pviennacl-omp.


Ran SparseSparseDrmTimer.mscala through the shell in both pseudo cluster and 
local[2] mode

Tested with several iterations and combinations of arguments eg:

timeSparseDRMMMul(1000,1000,1000,5,.2,1234L)

Mostly with out issue in a consumer grade card.

Note: in the shell after the %*% is called by a partition and the GPU is in 
use, will get a native failure exception which is caught and allows for MMul of 
that partition to fall back to JVM MMul (single-threaded).. this should be 
changed in 0.13.1 to fall back to OpenMP MMul.

Note: Binary distribution is built for Tesla GPUs. My card was not compatible, 
though out target is higher end GPUs on AWS or PowerPC (PowerPC uses  teslas) 
so not a blocker IMO.

We will target a wider range of cards in the next distributions.












From: Andrew Musselman 
Sent: Saturday, April 15, 2017 2:48:17 AM
To: user@mahout.apache.org; d...@mahout.apache.org
Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate

Hashes and sigs confirmed, bin and src (viennacl and viennacl-omp)
artifacts run the spark shell and the sparse drm test fine, and kick off
the GPU.

+1 (binding)

On Fri, Apr 14, 2017 at 10:25 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> This is the vote for release 0.13.0 of Apache Mahout.
>
> The vote will be going for at least 72 hours and will be closed on
> Monday, April 17th, 2017 or once there are at least 3 PMC +1 binding votes
>  (whichever occurs earlier).  Please download, test and vote with
>
> [ ] +1, accept RC as the official 0.13.0 release of Apache Mahout
> [ ] +0, I don't care either way,
> [ ] -1, do not accept RC as the official 0.13.0 release of Apache Mahout,
> because...
>
>
> Maven staging repo:
>
> *https://repository.apache.org/content/repositories/orgapachemahout-1044/org/apache/mahout/apache-mahout-distribution/0.13.0
> *
>
> The git tag to be voted upon is mahout-0.13.0
>


Re: [RESULT] [VOTE] Apache Mahout 0.13.0 Release Candidate

2017-04-16 Thread Andrew Palumbo
Thanks very much, Andrew.  The link to the release Notes is here:


<https://docs.google.com/document/d/1pQGBLlBuBl5q1Xo7AmDCVp0hUFfKWpkjnJRf2x9zxGA/edit>

https://docs.google.com/document/d/1pQGBLlBuBl5q1Xo7AmDCVp0hUFfKWpkjnJRf2x9zxGA/edit


<https://docs.google.com/document/d/1pQGBLlBuBl5q1Xo7AmDCVp0hUFfKWpkjnJRf2x9zxGA/edit>

Thanks all for putting in such hard work on this milestone release.


Andy


From: Andrew Musselman 
Sent: Sunday, April 16, 2017 11:48:46 PM
To: d...@mahout.apache.org
Cc: user@mahout.apache.org
Subject: [RESULT] [VOTE] Apache Mahout 0.13.0 Release Candidate

With three binding +1s and no -1 or 0 votes this vote PASSES.

I'll publish the artifacts and ask people to pitch in to some post-release
tasks.

Andy, do you have a link to some release notes we can help with so we can
send out an announcement?

Thanks all for the hard work and persistence in the final steps. This is a
good one, happy to see things come together.

Best
Andrew

On Sun, Apr 16, 2017 at 5:44 PM, Trevor Grant 
wrote:

> +1 (binding)
>
> Judicial Opinion:
>
> Verified Signatures
>
> Built with `mvn clean package -Phadoop2` ; `mvn clean package -Phadoop2
> -Pviennacl-omp` ; `mvn clean package -Phadoop2 -DskipTests` (see below)
> For all builds, ran `spark item similarity` example with no issue, `wiki`
> example fails with non-mahout related file not found errors. need JIRA to
> update file path- as this is an example where the functionality is non
> related to Mahout, i find this non blocking bc it is likely all of the
> prior releases are now also broken (that is, if we had released this a
> month ago, the release would still now be broken)
>
> For viennacl, the build fails on the `sparse mmul microbench test`.  I have
> two machines, both fail.  Both have older graphics cards (specs below).  On
> the older of the two, I am able to make something very similar to the test
> pass in the shell if I set `s=625` (using `timeSparseDRMMMul`). This
> implies that the functionality is sound, but my cards simply aren't new
> enough to pass the test.  I recommend opening a JIRA to tune down the
> aggressiveness of the test- e.g. making `s=500` (currently `s=1000`).
>
> This recommendation is based on 1) my cards aren't _that_ old and 2) I
> don't want to buy new graphics cards just to pass unit tests.
> (Included CPU arch for reference wrt OMP)
>
> Dev Box:
>
> NVidia GeForce GT 740
> Driver Version: 352.63
> Memory: 1021MiB
>
> ViennaCL v 1.7.0
>
> CPU: Intel Core i7-3770K @3.50 Ghz (Ivy Bridge Micro Arch- 3rd Gen)
>
> Ubuntu 14.04.3 LTS
>
>
> Laptop:
> NVidia GeForce GTX 960M
> Driver Version: 367.57
> Memory: 2002MiB
>
> CPU: Intel Core i7-5500U @ 2.40 Ghz (Broadwell-U Micro arch - 5th Gen)
>
> ViennaCL v 1.7.0
>
> Ubuntu 16.04.1 LTS
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Sat, Apr 15, 2017 at 11:35 PM, Andrew Palumbo 
> wrote:
>
> > +1 (binding)
> >
> >
> > Built and tested source distribution with both profiles -Pviennacl and
> > -Pviennacl-omp.
> >
> >
> > Ran SparseSparseDrmTimer.mscala through the shell in both pseudo cluster
> > and local[2] mode
> >
> > Tested with several iterations and combinations of arguments eg:
> >
> > timeSparseDRMMMul(1000,1000,1000,5,.2,1234L)
> >
> > Mostly with out issue in a consumer grade card.
> >
> > Note: in the shell after the %*% is called by a partition and the GPU is
> > in use, will get a native failure exception which is caught and allows
> for
> > MMul of that partition to fall back to JVM MMul (single-threaded).. this
> > should be changed in 0.13.1 to fall back to OpenMP MMul.
> >
> > Note: Binary distribution is built for Tesla GPUs. My card was not
> > compatible, though out target is higher end GPUs on AWS or PowerPC
> (PowerPC
> > uses  teslas) so not a blocker IMO.
> >
> > We will target a wider range of cards in the next distributions.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 
> > From: Andrew Musselman 
> > Sent: Saturday, April 15, 2017 2:48:17 AM
> > To: user@mahout.apache.org; d...@mahout.apache.org
> > Subject: Re: [VOTE] Apache Mahout 0.13.0 Release Candidate
> >
> > Hashes and sigs confirmed, bin and src (viennacl and viennacl-omp)
> > artifacts run the spark shell and the sparse drm test fin

RE: Welcome our GSoC Student Aditya Sarma

2017-05-04 Thread Andrew Palumbo
Welcome!!



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Jim Jagielski 
Date: 05/04/2017 10:33 AM (GMT-08:00)
To: d...@mahout.apache.org
Cc: priv...@mahout.apache.org, user@mahout.apache.org, Aditya 

Subject: Re: Welcome our GSoC Student Aditya Sarma

Welcome!!
> On May 4, 2017, at 1:24 PM, Trevor Grant  wrote:
>
> Hello all,
>
> I want to extend a warm welcome to Aditya Sarma, who has been accepted to
> the Mahout Project as Part of the Google Summer of Code program.
>
> Aditya will be working on "DBSCAN Clustering In Mahout", if you go back in
> the archives you can see his full proposal.
>
> We're really excited to have him, and looking forward to a great summer.
>
> Aditya, would you like to say a few words to introduce yourself?



RE: New logo

2017-05-06 Thread Andrew Palumbo
+1 :)



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: "Scott C. Cote" 
Date: 05/06/2017 2:43 PM (GMT-08:00)
To: user@mahout.apache.org
Cc: Trevor Grant , Mahout Dev List 

Subject: Re: New logo

Will you be wearing “one of those t-shirts” on Monday  in Houston :)   ?
SCott
Scott C. Cote
scottcc...@gmail.com
972.672.6484



> On May 6, 2017, at 1:52 PM, Ted Dunning  wrote:
>
> I know where one of those t-shirts is.
>
>
>
> On Sat, May 6, 2017 at 7:13 AM, Isabel Drost-Fromm 
> wrote:
>
>> The green logo was the very first design iteration before iirc Robin came
>> up with the yellow one. The should be like five TShirts world wide with the
>> old logo printed in 2009.
>>
>>
>> Am 1. Mai 2017 20:41:43 MESZ schrieb Trevor Grant <
>> trevor.d.gr...@gmail.com>:
>>> Thanks Scott,
>>>
>>> You are correct- in fact we're going even further now, that you can do
>>> native optimization regardless of the architecture with native-solvers.
>>>
>>> Do you or anyone more familiar with the history of the website know
>>> anything about the origins/uses of this:
>>> https://mahout.apache.org/images/Mahout-logo-245x300.png
>>> It seems to be a green mahout logo.
>>>
>>> Also Scott, or anyone lurking who may be able to help.  As part of the
>>> website reboot I've included a "history" page and would really
>>> apppreciate
>>> some help capturing that from first person sources if possible. Ive put
>>> in
>>> some headers but those are only directional:
>>>
>>> https://github.com/rawkintrevo/mahout/blob/website/website/front/
>> community/history.md
>>>
>>>
>>>
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
>>> http://stackexchange.com/users/3002022/rawkintrevo
>>> http://trevorgrant.org
>>>
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>
>>>
>>> On Mon, May 1, 2017 at 11:18 AM, scott cote 
>>> wrote:
>>>
 Trevor et al:

 Some ideas to spur you on (and related points):

 Mahout is no longer a grab bag of algorithms and routines, but a math
 language right?  You don’t care about the under the cover
>>> implementation.
 Today its Spark with alternative implementations in Flink, etc ….

 Don’t know if that is the long term goal still  - haven’t kept up -
>>> but it
 seems like you are insulating yourself from the underlying
>>> technology.

 Math is a universal language.  Right?

 Tower of Babel is coming to mind ….

 SCott

> On Apr 27, 2017, at 10:27 PM, Trevor Grant
>>> 
 wrote:
>
> It also bugs me when I can't suggest any alternatives, yet don't
>>> like the
> ones in front of me...
>
> I became aware of a symbol a week or so ago, and it keeps coming
>>> back to
> me.
>
> The Enso.
> https://en.wikipedia.org/wiki/Ens%C5%8D
>
> Things I like about it:
> (all from wikipedia, since the only thing I knew about this symbol
>>> prior
 is
> that someone I met had a tattoo of it).
> It represents (among a few other things) enlightenment.
> ^^ This resonated with the 'alternate definition of mahout' from
>>> Hebrew-
> which may be something akin to essence or truth.
>
> It is a circle- which plays to the Samsara theme.
>
> It is very expressive, a simple one or two brush stroke circle
>>> which
> symbolizes several large concepts and things about the creator,
 expressive
> like our DSL (I feel gross comparing such a symbol to a Scala DSL,
>>> but
 I'm
> spit balling here, please forgive me- I am not so expressive).
>
> "Once the *ensō* is drawn, one does not change it. It evidences the
> character of its creator and the context of its creation in a
>>> brief,
> contiguous period of time." Which reminds me of the DRMs
>
> In closed form it represents something akin to Plato's perfection-
>>> which
 a
> little more wiki surfing tells me is the idea that no one can
>>> create a
> perfect circle because a circle is a collection of infinite points
>>> and
 how
> could ever be sure that you have arranged each one properly, yet
>>> such
> things must exist, or what blueprint would a creator of circles be
 striving
> for.  This, by-the-by reminds me of stochastic approaches to
>>> solving
> problems, and really statistics / "machine-learning" in general, in
>>> that
 we
> can't find perfect solutions, yet we believe solutions exist and
>>> serve as
> our blueprint.
>
> Finally, I like that it is simple.
>
> Things I don't like about it:
> Lucent Technologies used it back in the 90s, however they used a
>>> very
> specific red one, and this isn't a deal breaker for me.
>
> Other thoughts:
> Based on the tattoo I saw- one could make an Enso using old mahout
>>> color
> palatte if one were to dab their brush in the appropriate colors.
>>> This
> could also b

Re: New Website is Staged

2017-05-08 Thread Andrew Palumbo
Trevor, That link takes me back to the old m.a.o page.


From: Trevor Grant 
Sent: Monday, May 8, 2017 1:53:46 PM
To: Mahout Dev List; user@mahout.apache.org
Subject: New Website is Staged

Hey all,

The new website is staged. You can view it here

http://mahout.staging.apache.org/

Won't be publishing for a bit yet- there are still a few JIRAs left to do
before its ready, but you can check it out there anyway.

A couple of admin things:
1- New developer and community pages are linked from the landing site and
new navbar, the landing page isn't done yet btw (one of the last todos)

2- All linkbacks from the old site should continue to work, pages were
maintained however, they have had new skin applied to them.

3- The current website is also available in
http://mahout.staging.apache.org/docs/0.13.0/
and will be persevered for posterity.

4- new style docs, which I recommend everyone check out are available in
http://mahout.staging.apache.org/docs/0.13.1-SNAPSHOT/


We have 6 high level talks coming up in the next 2 weeks and would like to
have the shiny new website fielded if possible, working on hard on getting
it ready.

If you have any updates recommendations, etc, feel free to open a PR (all
website code is contained in master now).


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


Re: New Website is Staged

2017-05-08 Thread Andrew Palumbo
I disagree with it being too bland- I find the open space and the formatting 
much easier to navigate and read docs from.



From: Khurrum Nasim 
Sent: Monday, May 8, 2017 2:36:54 PM
To: Mahout Dev List; user@mahout.apache.org; d...@mahout.apache.org
Subject: Re: New Website is Staged

Too bland looking

Thanks,

Khurrum.

On May 8, 2017, 1:53 PM -0400, Trevor Grant , wrote:
> Hey all,
>
> The new website is staged. You can view it here
>
> http://mahout.staging.apache.org/
>
> Won't be publishing for a bit yet- there are still a few JIRAs left to do
> before its ready, but you can check it out there anyway.
>
> A couple of admin things:
> 1- New developer and community pages are linked from the landing site and
> new navbar, the landing page isn't done yet btw (one of the last todos)
>
> 2- All linkbacks from the old site should continue to work, pages were
> maintained however, they have had new skin applied to them.
>
> 3- The current website is also available in
> http://mahout.staging.apache.org/docs/0.13.0/
> and will be persevered for posterity.
>
> 4- new style docs, which I recommend everyone check out are available in
> http://mahout.staging.apache.org/docs/0.13.1-SNAPSHOT/
>
>
> We have 6 high level talks coming up in the next 2 weeks and would like to
> have the shiny new website fielded if possible, working on hard on getting
> it ready.
>
> If you have any updates recommendations, etc, feel free to open a PR (all
> website code is contained in master now).
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things." -Virgil*


Re: New Website is Staged

2017-05-09 Thread Andrew Palumbo
+1 to DFW talk, but I think we can all come up with a common set of bullet 
points for the front page if you want also.


From: Pat Ferrel 
Sent: Tuesday, May 9, 2017 11:31:09 AM
To: user@mahout.apache.org
Cc: Mahout Dev List
Subject: Re: New Website is Staged

Are you guys ready for serious comments on the new design or is this just a 
first running version?


On May 9, 2017, at 8:20 AM, Trevor Grant  wrote:

In the interest of getting this thing up and running, use DFW Meetup video
as a place holder for time being?

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, May 9, 2017 at 10:17 AM, Andrew Palumbo  wrote:

> I think its a great idea- I'm probably more camera shy than trevor 😊.
> Maybe we can spread the fun around the PMC at lunch after GTC
>
> 
> From: Trevor Grant 
> Sent: Tuesday, May 9, 2017 12:02:39 AM
> To: Mahout Dev List
> Cc: user@mahout.apache.org
> Subject: Re: New Website is Staged
>
> I agree it could be compelling. I'm somewhat video shy so I nominate our
> PMC chair and moral compass @apalumbo
>
> :)
>
> On May 8, 2017 6:29 PM, "Kimberly Brown"  wrote:
>
>> Hey I was looking and had an idea.  What if someone recorded a youtube
>> video whiteboarding the map-reduce mahout and how it’s evolved, so the
>> “transformation” story for people discovering or rediscovering mahout.
>> Benefits:  simple to create and embed, rich content (maybe 5-10 minutes)
>> but still keeps the clean look. And everyone loves a walk-through video
>> explanation over a paragraph ☺
>>
>> --
>>
>> Kim Brown
>> Founder, CEO | Centrally Human LLC
>> k...@centrallyhuman.com
>> LinkedIn <https://www.linkedin.com/in/kim-weisensee-brown-33178011>
>>
>>
>> On 5/8/17, 6:23 PM, "Trevor Grant"  wrote:
>>
>>Khurrum,
>>
>>Thanks for the feed back, anything more specific?
>>
>>
>>
>>
>>
>>Trevor Grant
>>Data Scientist
>>https://github.com/rawkintrevo
>>http://stackexchange.com/users/3002022/rawkintrevo
>>http://trevorgrant.org
>>
>>*"Fortunate is he, who is able to know the causes of things."
> -Virgil*
>>
>>
>>On Mon, May 8, 2017 at 4:57 PM, Andrew Palumbo 
>> wrote:
>>
>>> I disagree with it being too bland- I find the open space and the
>>> formatting much easier to navigate and read docs from.
>>>
>>>
>>> 
>>> From: Khurrum Nasim 
>>> Sent: Monday, May 8, 2017 2:36:54 PM
>>> To: Mahout Dev List; user@mahout.apache.org; d...@mahout.apache.org
>>> Subject: Re: New Website is Staged
>>>
>>> Too bland looking
>>>
>>> Thanks,
>>>
>>> Khurrum.
>>>
>>> On May 8, 2017, 1:53 PM -0400, Trevor Grant <
>> trevor.d.gr...@gmail.com>,
>>> wrote:
>>>> Hey all,
>>>>
>>>> The new website is staged. You can view it here
>>>>
>>>> http://mahout.staging.apache.org/
>>>>
>>>> Won't be publishing for a bit yet- there are still a few JIRAs
>> left to do
>>>> before its ready, but you can check it out there anyway.
>>>>
>>>> A couple of admin things:
>>>> 1- New developer and community pages are linked from the landing
>> site and
>>>> new navbar, the landing page isn't done yet btw (one of the last
>> todos)
>>>>
>>>> 2- All linkbacks from the old site should continue to work, pages
>> were
>>>> maintained however, they have had new skin applied to them.
>>>>
>>>> 3- The current website is also available in
>>>> http://mahout.staging.apache.org/docs/0.13.0/
>>>> and will be persevered for posterity.
>>>>
>>>> 4- new style docs, which I recommend everyone check out are
>> available in
>>>> http://mahout.staging.apache.org/docs/0.13.1-SNAPSHOT/
>>>>
>>>>
>>>> We have 6 high level talks coming up in the next 2 weeks and
> would
>> like
>>> to
>>>> have the shiny new website fielded if possible, working on hard
> on
>>> getting
>>>> it ready.
>>>>
>>>> If you have any updates recommendations, etc, feel free to open a
>> PR (all
>>>> website code is contained in master now).
>>>>
>>>>
>>>> Trevor Grant
>>>> Data Scientist
>>>> https://github.com/rawkintrevo
>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>> http://trevorgrant.org
>>>>
>>>> *"Fortunate is he, who is able to know the causes of things."
>> -Virgil*
>>>
>>
>>
>>
>>
>



Re: UnsatisfiedLinkError: jniViennaCL

2017-05-09 Thread Andrew Palumbo
Hello Sebsastian,


> - macOS Sierra 10.12.4 (MacBook Pro, Retina, 13-inch, Early 2015)


The native code in the javacpp modules does not have a build profile for MacOS.


We do currently have a jira open for this:  
https://issues.apache.org/jira/browse/MAHOUT-1908

the fix build for mac should be relativly simple,  the issue is with CLANG (OR 
LLVM) on mac not containing OpenMP you can get around this by working woth 
gcc6.  I had it almost finished at one point, but other things took priority..


the properties files are right here: 
https://github.com/apache/mahout/tree/master/viennacl

If you'd like to take a shot at creating and a mac-os.properties file it would 
be great, you the mac property file would just have to be swapped out here:

https://github.com/apache/mahout/blob/master/viennacl/pom.xml#L142

and the 
project rebuilt.


Thanks,


Andy








From: Sebastian Lehrig 
Sent: Tuesday, May 9, 2017 9:54:05 AM
To: user@mahout.apache.org
Subject: UnsatisfiedLinkError: jniViennaCL


Hi,

After installing the newest mahout release (13.0.0), I'm getting exactly
this issue:
https://issues.apache.org/jira/browse/MAHOUT-1946

How to fix that?

Some details about my setup:
- mahout 0.13.0
- spark 1.6.3 (spark-1.6.3-bin-hadoop2.6 package)
- java version "1.8.0_121"
- scala version 2.10.5
- macOS Sierra 10.12.4 (MacBook Pro, Retina, 13-inch, Early 2015)
- graphics card: Intel Iris Graphics 6100 1536 MB

Thanks and regards,
Sebastian


RE: Proposal for changing Mahout's Git branching rules

2017-06-21 Thread Andrew Palumbo
Pat - I just want to clear one point up.. Trevor volunteering to head up this 
release and the git-flow plans are independent of each other.  The 0.13.1 
release was originally planned as a quick follow up to 0.13.0 for each 
scala/spark conf combo I think this will be 6 artifacts.. spark 1.6.x - 2.1.x  
for scala 2.11 and scala 2.10. we'd hoped it would be straightforward and it 
was something almost automatable.

The git-flow change idea was floated by you, i believe you around the same time 
(correct me if I'm wrong.. this was all happening while I was sick).  I agree 
that it should be a team decision, but it also might take some time to 
transition.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Dmitriy Lyubimov 
Date: 06/21/2017 2:06 PM (GMT-08:00)
To: user@mahout.apache.org
Cc: Mahout Dev List 
Subject: Re: Proposal for changing Mahout's Git branching rules

so people need to make sure their PR merges to develop instead of master?
Do they need to PR against develop branch, and if not, who is responsible
for confict resolution then that is to arise from diffing and merging into
different targets?

On Tue, Jun 20, 2017 at 10:09 AM, Pat Ferrel  wrote:

> As I said I was sure there would be Jenkins issues but they must be small
> since it’s just renaming of target branches. Releases are still made from
> master so I don’t see the issue there at all. Only intermediate CI tasks
> are triggered on other branches. But they would have to be in your examples
> too so I don’t see the benefit of using an ad hoc method in terms of CI.
> We’ve used this method for years with Apache PredictionIO with minimal CI
> issues.
>
> No the process below is not equivalent, treating master as develop removes
> the primary (in my mind) benefit. In git flow the master is always stable
> and the reflection of the last primary/core/default release with only
> critical inter-release fixes. If someone wants to work with stable
> up-to-date source, where do they go with the current process? I would claim
> that there actually may be no place to find such a thing except by tracking
> down some working commit number. It would depend on what stage the project
> is in, in git flow there is never a question—master is always stable. Git
> flow also accounts for all the process exceptions and complexities you
> mention below but in a standardized way that is documented so anyone can
> read the rules and follow them. We/Mahout doesn’t even have to write them,
> they can just be referenced.
>
> But we are re-arguing something I thought was already voted on and that is
> another issue. If we need to re-debate this let’s make it stick one way or
> the other.
>
> I really appreciate you being release master and the thought and work
> you’ve put into this and if we decide to stick with it, fine. But it should
> be a project decision that release masters follow, not up to each release
> master. We are now embarking on a much more complex release than before
> with multiple combinations of dependencies for binaries and so multiple
> artifacts. We need to make the effort tame the complexity somehow or it
> will just multiply.
>
> Given the short nature of the current point release I’d even suggest that
> we target putting our decision in practice after the release, which is a
> better time to make a change if we are to do so.
>
>
> On Jun 19, 2017, at 9:04 PM, Trevor Grant 
> wrote:
>
> First issue, one does not simply just start using a develop branch.  CI
> only triggers off the 'main' branch, which is master by default.  If we
> move to the way you propose, then we need to file a ticket with INFRA I
> believe.  That can be done, but its not like we just start doing it one
> day.
>
> The current method is, when we cut a release- we make a new branch of that
> release. Master is treated like dev. If you want the latest stable, you
> would check out branch-0.13.0 .  This is the way most major projects
> (citing Spark, Flink, Zeppelin), including Mahout up to version 0.10.x
> worked.  To your point, there being a lack of a recent stable- that's fair,
> but partly that's because no one created branches with the release for
> 0.10.? - 0.12.2.
>
> For all intents and purposes, we are (now once again) following what you
> propose, the only difference is we are treating master as dev, and
> "branch-0.13.0" as master (e.g. last stable).  Larger features go on their
> own branch until they are ready to merge- e.g. ATM there is just one
> feature branch CUDA.  That was the big take away from this discussion last
> time- there needed to be feature branches, as opposed to everyone running
> around either working off WIP PRs or half baked merges, etc.  To that end-
> "website" was a feature branch, and iirc there has been one other feature
> branch that has merged in the last couple of months but I forget what it
> was at the moment.
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> 

[NOTICE] JIRA activity email migration to iss...@mahout.apache.org

2017-07-03 Thread Andrew Palumbo
All,


Please note that, in an effort to reduce noise on d...@mahout.apache.org, so 
that we may more easily use it for planning purposes and discussion, we will be 
migrating Mahout JIRA email comments to the newly formed 
iss...@mahout.apache.org mailing list.


As many notes are made and much discussion takes place on JIRA it will 
necessary that all committers, current contributors and all others interested 
subscribe to this list via:


  issues-subscr...@mahout.apache.org


This being a holiday week in the states, I am planning on this coming weekend 
to make the final configuration changes.


Current committers and developers, If this is too early, again with the July 4 
holiday, please do let me know.


Thanks very much,


Andy


[REMINDER] Jira emails to be moved off d...@mahout.apache.org

2017-07-06 Thread Andrew Palumbo
As a reminder:


Jira email comments will be moved to:


iss...@mahout.apache.org


all committers, devs and anyone else interested in Bug Fix, New Feature and 
issue planning comments must subscribe:


   issues-subscr...@mahout.apache.org


-- Andy


RE: [DISCUSS] Naming convention for multiple spark/scala combos

2017-07-08 Thread Andrew Palumbo
+1 if so  (sbt naming re: pats comment).

Also +1 on Zeppelin integration being non-trivial.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Pat Ferrel 
Date: 07/07/2017 10:35 PM (GMT-08:00)
To: d...@mahout.apache.org
Cc: Holden Karau , user@mahout.apache.org, Dmitriy 
Lyubimov , Andrew Palumbo 
Subject: Re: [DISCUSS] Naming convention for multiple spark/scala combos

IIRC these all fit sbt’s conventons?


On Jul 7, 2017, at 2:05 PM, Trevor Grant  wrote:

So to tie all of this together-

org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2_1

org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1_6
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_0
org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2_1

(will jars compiled with 2.1 dependencies run on 2.0? I assume not, but I
don't know) (afaik, mahout compiled for spark 1.6.x tends to work with
spark 1.6.y, anecdotal)

A non-trivial motivation here, is we would like all of these available to
tighten up the Apache Zeppelin integration, where the user could have a
number of different spark/scala combos going on and we want it to 'just
work' out of the box (which means a wide array of binaries available, to
dmitriy's point).

I'm +1 on this, and as RM will begin cutting a provisional RC, just to try
to figure out how all of this will work (it's my first time as release
master, and this is a new thing we're doing).

72 hour lazy consensus. (will probably take me 72 hours to figure out
anyway ;) )

If no objections expect an RC on Monday evening.

tg

On Fri, Jul 7, 2017 at 3:24 PM, Holden Karau  wrote:

> Trevor looped me in on this since I hadn't had a chance to subscribe to
> the list yet (on now :)).
>
> Artifacts from cross spark-version building isn't super standardized (and
> their are two sort of very different types of cross-building).
>
> For folks who just need to build for the 1.X and 2.X and branches
> appending _spark1 & _spark2 to the version string is indeed pretty common
> and the DL4J folks do something pretty similar as Trevor pointed out.
>
> The folks over at hammerlab have made some sbt specific tooling to make
> this easier to do on the publishing side (see https://github.com/hammer
> lab/sbt-parent )
>
> It is true some people build Scala 2.10 artifacts for Spark 1.X series and
> 2.11 artifacts for Spark 2.X series only and use that to differentiate (I
> don't personally like this approach since it is super opaque and someone
> could upgrade their Scala version and then accidentally be using a
> different version of Spark which would likely not go very well).
>
> For folks who need to hook into internals and cross build against
> different minor versions there is much less of a consistent pattern,
> personally spark-testing-base is released as:
>
> [artifactname]_[scalaversion]:[sparkversion]_[artifact releaseversion]
>
> But this really only makes sense when you have to cross-build for lots of
> different Spark versions (which should be avoidable for Mahout).
>
> Since you are likely not depending on the internals of different point
> releases, I'd think the _spark1 / _spark2 is probably the right way (or
> _spark_1 / _spark_2 is fine too).
>
>
> On Fri, Jul 7, 2017 at 11:43 AM, Trevor Grant 
> wrote:
>
>>
>> -- Forwarded message --
>> From: Andrew Palumbo 
>> Date: Fri, Jul 7, 2017 at 12:28 PM
>> Subject: Re: [DISCUSS] Naming convention for multiple spark/scala combos
>> To: "d...@mahout.apache.org" 
>>
>>
>> another option for artifact names (using jars for example here):
>>
>>
>> mahout-spark-2.11_2.10-0.13.1.jar
>> mahout-spark-2.11_2.11-0.13.1.jar
>> mahout-math-scala-2.11_2.10-0.13.1.jar
>>
>>
>> i.e. ---.jar
>>
>>
>> not exactly pretty.. I somewhat prefer Trevor's idea of Dl4j convention.
>>
>> 
>> From: Trevor Grant 
>> Sent: Friday, July 7, 2017 11:57:53 AM
>> To: Mahout Dev List; user@mahout.apache.org
>> Subject: [DISCUSS] Naming convention for multiple spark/scala combos
>>
>> Hey all,
>>
>> Working on releasing 0.13.1 with multiple spark/scala combos.
>>
>> Afaik, there is no 'standard' for multiple spark versions (but I may be
>> wrong, I don't claim expertise here).
>>
>> One approach is simply only release binaries for:
>> Spark-1.6 + Scala 2.10
>> Spark-2.1 + Scala 2.11
>>
>> OR
>>
>> We could do like dl4j
>>
>> org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
>> org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1
>>
>> org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
>> org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2
>>
>> OR
>>
>> some other option I don't know of.
>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
>



[ANNOUNCE] JIRA email notifications moved to iss...@mahout.apache.org

2017-07-13 Thread Andrew Palumbo
Please note:  with the exception of newly created issues for which email 
notifications will continue to be sent to d...@mahout.apache.org, all JIRA 
email notifications (comments, closing, etc.) have been moved to:


iss...@mahout.apache.org.


All committers and developers, please ensure that you are subscribed.


Thank you,


Andy


Re: New Committer: Holden Karau

2017-07-18 Thread Andrew Palumbo
Welcome again, Holden, Great to have you on board!

--andy


From: Trevor Grant 
Sent: Tuesday, July 18, 2017 12:32:07 AM
To: user@mahout.apache.org; Mahout Dev List
Subject: New Committer: Holden Karau

The Project Management Committee (PMC) for Apache Mahout
has invited Holden Karau to become a committer and we are pleased
to announce that she has accepted.

Holden brings a great deal of expertise and knowledge around the
Apache Spark project, and it working to improve the integration
between the two projects.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.

Please join mean in giving Holden a very warm welcome.


Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
We do currently have optimizations based on density analysis in use e.g.: in 
AtB.


https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431



+1 to PR. thanks for pointing this out.


--andy


From: Pat Ferrel 
Sent: Monday, August 21, 2017 2:26:58 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)

That looks like ancient code from the old mapreduce days. If is passes unit 
tests create a PR.

Just a guess here but there are times when this might not speed up thing but 
slow them down. However for vey sparse matrixes that you might see in CF this 
could work quite well. Some of the GPU optimization will eventually be keyed 
off the density of a matrix, or selectable from knowing it’s characteristics.

I use this code all the time and would be very interested in a version that 
works with CF style very sparse matrices.

Long story short, create a PR so the optimizer guys can think through the 
implications. If I can also test it I have some large real-world data where I 
can test real-world speedup.


On Aug 21, 2017, at 10:53 AM, Pat Ferrel  wrote:

Interesting indeed. What is “massive”? Does the change pass all unit tests?


On Aug 17, 2017, at 1:04 PM, Scruggs, Matt  wrote:

Thanks for the remarks guys!

I profiled the code running locally on my machine and discovered this loop is 
where these setQuick() and getQuick() calls originate (during matrix Kryo 
deserialization), and as you can see the complexity of this 2D loop can be very 
high:

https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240


Recall that this algorithm uses SparseRowMatrix whose rows are 
SequentialAccessSparseVector, so all this looping seems unnecessary. I created 
a new subclass of SparseRowMatrix that overrides that assign(matrix, function) 
method, and instead of looping through all the columns of each row, it calls 
SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells 
with values. I also had to customize MahoutKryoRegistrator a bit with a new 
default serializer for this new matrix class. This yielded a massive 
performance boost and I verified that the results match exactly for several 
test cases and datasets. I realize this could have side-effects in some cases, 
but I'm not using any other part of Mahout, only 
SimilaritAnalysis.cooccurrencesIDSs().

Any thoughts / comments?


Matt



On 8/16/17, 8:29 PM, "Ted Dunning"  wrote:

> It is common with large numerical codes that things run faster in memory on
> just a few cores if the communication required outweighs the parallel
> speedup.
>
> The issue is that memory bandwidth is slower than the arithmetic speed by a
> very good amount. If you just have to move stuff into the CPU and munch on
> it a bit it is one thing, but if you have to move the data to CPU and back
> to memory to distributed it around possibly multiple times, you may wind up
> with something much slower than you would have had if you were to attack
> the problem directly.
>
>
>
> On Wed, Aug 16, 2017 at 4:47 PM, Pat Ferrel  wrote:
>
>> This uses the Mahout blas optimizing solver, which I just use and do not
>> know well. Mahout virtualizes some things having to do with partitioning
>> and I’ve never quite understood how they work. There is a .par() on one of
>> the matrix classes that has a similar function to partition but in all
>> cases I’ve used .par(auto) and use normal spark repartitioning based on
>> parallelism. Mahout implements a mapBlock function, which (all things being
>> equal) looks at a partition at a time in memory. The reduce is not part of
>> the code I wrote.
>>
>> The reduce is not part of the code I wrote. Maybe someone else can explain
>> what blas is doing.
>>
>> BTW hashmap is O(log n) on average for large n—caveats apply. We use
>> fastutils for many things (I thought this was one case) because they are
>> faster than JVM implementations but feel free to dig in further. We use
>> downsampling to maintain an overall rough O(n) calculation speed where n =
>> # rows (users). As the model gets more dense there are greatly diminishing
>> returns for the density so after the elements per row threshold is reached
>> we don’t use more in the model creation math.
>>
>> Still feel free to dig into what the optimizer is doing.
>>
>>
>> On Aug 15, 2017, at 11:13 AM, Scruggs, Matt 
>> wrote:
>>
>> Thanks Pat, that's good to know!
>>
>> This is the "reduce" step (which gets its own stage in my Spark
>> jobs...this stage takes almost all the runtime) where most of the work is
>> being done, and takes longer the more shuffle partitions there are
>> (relative to # of CPUs):
>>
>> https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b4
>> 58

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
I should mention that the densisty is currently set quite high, and we've been 
discussing a user defined setting for this.  Something that we have not worked 
in yet.


From: Andrew Palumbo 
Sent: Monday, August 21, 2017 2:44:35 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)


We do currently have optimizations based on density analysis in use e.g.: in 
AtB.


https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431



+1 to PR. thanks for pointing this out.


--andy


From: Pat Ferrel 
Sent: Monday, August 21, 2017 2:26:58 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)

That looks like ancient code from the old mapreduce days. If is passes unit 
tests create a PR.

Just a guess here but there are times when this might not speed up thing but 
slow them down. However for vey sparse matrixes that you might see in CF this 
could work quite well. Some of the GPU optimization will eventually be keyed 
off the density of a matrix, or selectable from knowing it’s characteristics.

I use this code all the time and would be very interested in a version that 
works with CF style very sparse matrices.

Long story short, create a PR so the optimizer guys can think through the 
implications. If I can also test it I have some large real-world data where I 
can test real-world speedup.


On Aug 21, 2017, at 10:53 AM, Pat Ferrel  wrote:

Interesting indeed. What is “massive”? Does the change pass all unit tests?


On Aug 17, 2017, at 1:04 PM, Scruggs, Matt  wrote:

Thanks for the remarks guys!

I profiled the code running locally on my machine and discovered this loop is 
where these setQuick() and getQuick() calls originate (during matrix Kryo 
deserialization), and as you can see the complexity of this 2D loop can be very 
high:

https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240


Recall that this algorithm uses SparseRowMatrix whose rows are 
SequentialAccessSparseVector, so all this looping seems unnecessary. I created 
a new subclass of SparseRowMatrix that overrides that assign(matrix, function) 
method, and instead of looping through all the columns of each row, it calls 
SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells 
with values. I also had to customize MahoutKryoRegistrator a bit with a new 
default serializer for this new matrix class. This yielded a massive 
performance boost and I verified that the results match exactly for several 
test cases and datasets. I realize this could have side-effects in some cases, 
but I'm not using any other part of Mahout, only 
SimilaritAnalysis.cooccurrencesIDSs().

Any thoughts / comments?


Matt



On 8/16/17, 8:29 PM, "Ted Dunning"  wrote:

> It is common with large numerical codes that things run faster in memory on
> just a few cores if the communication required outweighs the parallel
> speedup.
>
> The issue is that memory bandwidth is slower than the arithmetic speed by a
> very good amount. If you just have to move stuff into the CPU and munch on
> it a bit it is one thing, but if you have to move the data to CPU and back
> to memory to distributed it around possibly multiple times, you may wind up
> with something much slower than you would have had if you were to attack
> the problem directly.
>
>
>
> On Wed, Aug 16, 2017 at 4:47 PM, Pat Ferrel  wrote:
>
>> This uses the Mahout blas optimizing solver, which I just use and do not
>> know well. Mahout virtualizes some things having to do with partitioning
>> and I’ve never quite understood how they work. There is a .par() on one of
>> the matrix classes that has a similar function to partition but in all
>> cases I’ve used .par(auto) and use normal spark repartitioning based on
>> parallelism. Mahout implements a mapBlock function, which (all things being
>> equal) looks at a partition at a time in memory. The reduce is not part of
>> the code I wrote.
>>
>> The reduce is not part of the code I wrote. Maybe someone else can explain
>> what blas is doing.
>>
>> BTW hashmap is O(log n) on average for large n—caveats apply. We use
>> fastutils for many things (I thought this was one case) because they are
>> faster than JVM implementations but feel free to dig in further. We use
>> downsampling to maintain an overall rough O(n) calculation speed where n =
>> # rows (users). As the model gets more dense there are greatly diminishing
>> returns for the density so after

Fwd: CFP: IEEE Computer Magazine -- Special Issue on Mobile and Embedded Deep Learning

2017-08-26 Thread Andrew Palumbo

Fyi @here


Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Nic Lane 
Date: 08/26/2017 3:33 AM (GMT-08:00)
To: Nic Lane 
Subject: CFP: IEEE Computer Magazine -- Special Issue on Mobile and Embedded 
Deep Learning

Hi folks, please consider submitting to the upcoming IEEE Computer Magazine 
special issue on mobile and embedded forms of deep learning: 
https://www.computer.org/computer-magazine/2017/07/10/mobile-and-embedded-deep-learning-call-for-papers/

Below you will find the complete CFP. the submission deadline is Sept 15. 
Please mail us with any questions.

Best,

Nic Lane (UCL and Nokia Bell Labs) and Pete Warden (Google Brain)
Special Issue Guest Editors -- IEEE Computer Magazine

===

Call for Papers: IEEE Computer Magazine -- Special Issue on Mobile and Embedded 
Deep Learning

DEADLINE (EXTENDED): 15 September 2017
Publication date: April 2018 (with recommended early release on 
arxiv.org)

https://www.computer.org/computer-magazine/2017/07/10/mobile-and-embedded-deep-learning-call-for-papers/

In recent years, breakthroughs from the field of deep learning have transformed 
how sensor data from cameras, microphones, and even accelerometers, LIDAR, and 
GPS can be analyzed to extract the high-level information needed by the 
increasingly commonplace examples of sensor-driven systems that range from 
smartphone apps and wearable devices to drones, robots, and autonomous cars.

Today, the state of the art in computational models that, for example, 
recognize a face in a crowd, translate one language into another, discriminate 
between a pedestrian and a stop sign, or monitor the physical activities of a 
user, are increasingly based on deep-learning principles and algorithms. 
Unfortunately, deep-learning models typically exert severe demands on local 
device resources, which typically limits their adoption in mobile and embedded 
platforms. As a result, in far too many cases, existing systems process sensor 
data with machine learning methods that were superseded by deep learning years 
ago.

Because the robustness and quality of sensory perception and reasoning is so 
critical to mobile and embedded computing, we must begin the careful work of 
addressing two core technical questions. First, to ensure that the 
sensor-inference problems that are central to this class of computing are 
adequately addressed, how should existing deep-learning techniques be applied 
and new forms of deep learning be developed? Meeting this challenge involves a 
combination of learning applications—some of which are familiar to other 
domains (such as in processing image and audio), as well as those more uniquely 
tied to wearable and mobile systems (such as activity recognition). Second, for 
the compute, memory, and energy overhead of current—and future—deep-learning 
innovations, what will be required to improve efficiency and effectively 
integrate into a variety of resource-constrained platforms? Solutions to such 
efficiency challenges will come from innovations in algorithms, systems 
software, and hardware (such as in ML-accelerators and changes to conventional 
processors).

In this special issue of Computer, the guest editors aim to consider these two 
broad themes, which drive further advances in mobile and embedded deep 
learning. More specific topics of interest include but are not limited to

= Compression of deep model architectures;
= Neural-based approaches for modeling user activities and behavior;
= Quantized and low-precision neural networks (including binary networks);
= Mobile vision supported by convolutional and deep networks;
= Optimizing commodity processors (GPUs, DSPs etc.) for deep models;
= Audio analysis and understanding through recurrent and deep architectures;
= Hardware accelerators for deep neural networks;
= Distributed deep model training approaches;
= Applications of deep neural networks with real-time requirements;
= Deep models of speech and dialog interaction or mobile devices; and
= Partitioned networks for improved cloud- and processor-offloading.

SUBMISSION DETAILS

Only submissions that describe previously unpublished, original, 
state-of-the-art research and that are not currently under review by a 
conference or journal will be considered.

There is a strict 6,000-word limit (figures and tables are equivalent to 300 
words each) for final manuscripts. Authors should be aware that Computer cannot 
accept or process papers that exceed this word limit.

Articles should be understandable by a broad audience of computer science and 
engineering professionals, avoiding a focus on theory, mathematics, jargon, and 
abstract concepts.

All manuscripts are subject to peer review on both technical merit and 
relevance to Computer’s readership. Accepted papers will be professionally 
edited for content and style. For accepted papers, authors will be required to 
provide electronic files for each figure according to the 

[REMINDER] Calendar Q3 Board report due 9/13

2017-09-02 Thread Andrew Palumbo



 A reminder - the Calendar Q3 board Meeting is 20 September.


We must submit a board report at least 1 week prior.


Please gather a record of recent talks, etc. for the report due 13 September.


I will post a Google doc in the upcoming days.


--andy


Re: [REMINDER] Calendar Q3 Board report due 9/13

2017-09-13 Thread Andrew Palumbo
My apologies, I was off sync by on our board report cycle. The Mahout board 
report is due for next month.  I will fie the report for the October meeting.


Thanks all, and sorry for any inconvenience.


--andy


From: Andrew Palumbo 
Sent: Saturday, September 2, 2017 2:43:05 PM
To: d...@mahout.apache.org; user@mahout.apache.org
Subject: [REMINDER] Calendar Q3 Board report due 9/13




 A reminder - the Calendar Q3 board Meeting is 20 September.


We must submit a board report at least 1 week prior.


Please gather a record of recent talks, etc. for the report due 13 September.


I will post a Google doc in the upcoming days.


--andy


Re: Congrats Palumbo and Holden

2018-05-04 Thread Andrew Palumbo
Thanks guys!


[ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-18 Thread Andrew Palumbo
Please join me in congratulating Andrew Musselman as the new Chair of the
Apache Mahout Project Management Committee. I would like to thank Andrew
for stepping up, all of us who have worked with him over the years know his
dedication to the project to be invaluable.  I look forward to Andrew
taking taking the project into the future.

Thank you,

Andy


Re: Friday hangout

2018-09-03 Thread Andrew Palumbo
Thanks Andrew talk to you then!


Re: Friday hangout

2018-09-03 Thread Andrew Palumbo
FYI, @andrew.. the calendar invite reads 9 am Sept 3 for me (android).


Re: Friday hangout

2018-09-03 Thread Andrew Palumbo
Probably my calendar messed it up.
Thx
--andy

On Sep 3, 2018 10:32 AM, Andrew Musselman  wrote:

Huh, it's supposed to be
"Friday, September 7
9:00 – 10:00am"

On Mon, Sep 3, 2018 at 10:28 AM Andrew Palumbo  wrote:

> FYI, @andrew.. the calendar invite reads 9 am Sept 3 for me (android).
>



Re: Planning moved to 28th

2018-09-14 Thread Andrew Palumbo
Looking forward to it



Re: AbstractJob class not found exception

2019-01-12 Thread Andrew Palumbo
If you're still looking for the deprecated MapReduce version of AbstractJob it 
is no longer In the `core` module.  It can be found in the `mahout-mr` module.

If you still wish to leave the list please send a message to:

 user-unsubsubscr...@mahout.apache.org

Thanks,

Andy


Re: AbstractJob class not found exception

2019-01-13 Thread Andrew Palumbo
my mistake-  i had a typo in the unsubscribe:

   user-unsubscr...@mahout.apache.org

To unsubscribe, send an email there.

Thank you,

Andy

From: Andrew Palumbo 
Sent: Saturday, January 12, 2019 4:34 PM
To: user@mahout.apache.org
Subject: Re: AbstractJob class not found exception

If you're still looking for the deprecated MapReduce version of AbstractJob it 
is no longer In the `core` module.  It can be found in the `mahout-mr` module.

If you still wish to leave the list please send a message to:

 user-unsubsubscr...@mahout.apache.org

Thanks,

Andy


[MEETING NOTES] 10 AM Friday 6 Jan 2019. Google Hangouts

2019-12-06 Thread Andrew Palumbo
Mahout meeting notes  12.6.2019

==



A meeting was held today, Friday 6 Dec 2019 to discuss to discuss the current 
state of the project, planned releases and a general path forward.

Joe Olson, Andrew Palumbo and Trevor Grant met via Google Hangouts at 10:15 AM.


Early discussion was based around AP and TG’s loose and quickly put together 
agenda and ideas. AP started the unofficial agenda doc <10 mins before the 
meeting start, so the agenda was quick n dirty.


An agreement was made early on by TG and AP to focus on the release as the 
build is currently working, and releases are deploying artifacts for Scala 
2.11, scala 2.12, pegged to Java 1.8 and mvn 3.3.9.  A heavy refactoring effort 
was made to After fixing the build, by revamping some very old poms and 
reverting back to the parertnt `Apache pom.xml` `release` goal and adding some 
new information to the release master’s  `.m2/setings.xml`, we are able to 
release and deploy artifacts with Java 1.8, cross compiled for Scala 2.11 and 
Scala 2.12.


https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-core_2.12kw,versionexpand

https://repository.apache.org/#nexus-search;gav~org.apache.mahout~mahout-hdfs_2.11kw,versionexpand


A release board was created:


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=348


And some minor issues were added.


A decision was made to move Docker files and certain planned AWS infrastructure 
as code (TerraForm) slated for the 14.1 release off the central repository, and 
onto both dockerhub.io under a newly created mahout namespace, JO will be 
handling the task of creating the hub.docker.com/<https://hub.docker.com/> 
"Mahout" organization, and moving the Docker files to that space.  
[MAHOUT-2074<https://issues.apache.org/jira/browse/MAHOUT-2074>]



AP will be creating a mahout-contrib repo on his personal page to be merged in 
later with some terraform code and examples, etc probably borrowing heavily 
from Pulsar and spark: https://github.com/apache/pulsar/tree/master/deployment. 
As well AP will (has) begin/begun leveraging some off project time onto this 
mahout-contrib package, or at least is keeping much in the org.apache.mahout 
namespace.  Some work has already been done with NiFi and MiNifi for an SDR 
project under the org.apache.mahout namespace which will be available in the 
mahout-contrib package or an other stand alone package [NOT DISCUSSED] at 
meeting. Forgot to bring up.


AP will fix the `change-scala-version.sh` script, [MAHOUT-2080] and will bump 
the scala version in master over the weekend 
[MAHOUT-20<https://issues.apache.org/jira/browse/MAHOUT-2074>82], at which 
point we will call a code freeze.  And attempt to release by next weekend 
(after cutting an RC).



TG was able to get Jenkins running again, building snapshots, fixing 
[MAHOUT-2073<https://issues.apache.org/jira/browse/MAHOUT-2073>].


JO will look into some other projects build-chains 
MAHOUT-2076<https://issues.apache.org/jira/browse/MAHOUT-2076>, and consider a 
scripts to cut down on RC creation and Release Deployment time by having a 
single script with all release commands, similar to Apache Spark (Pulsar was 
discussed as a reference but the project is moving quickly and they’ve 
refactored their build since last I’d (AP) looked, in fact, deploying pulsar is 
as simple as `mvn clean deploy`.


https://github.com/apache/pulsar/blob/master/.test-infra/jenkins/job_pulsar_release_nightly_snapshot.groovy.


Spark and Flink should have good examples…. E.g:

Spark:

https://github.com/apache/spark/blob/master/dev/make-distribution.sh

https://github.com/apache/spark/tree/master/dev/create-release

Flink:

https://github.com/apache/flink/tree/master/tools/releasing



TG will work on zeppelin integration for some easy mahout-python-ggplot2 
examples.


There was discussion of using the The US Census Api for a data examples.


A long running issue was resolved 
[MAHOUT-2023<https://issues.apache.org/jira/browse/MAHOUT-2023>] Broken Scopt 
Classes:


*ISSUE*:  we have no way of testing this, we need @pat to take a look.  With 
the nightly snapshots being build, the current version in master is available 
in NEXUS:


[JENKINS] Archiving 
/home/jenkins/jenkins-slave/workspace/mahout-nightly/community/spark-cli-drivers/target/mahout-spark-cli-drivers_2.11-14.1-SNAPSHOT.jar
 to 
org.apache.mahout/mahout-spark-cli-drivers_2.11/14.1-20191206.193308-1/mahout-spark-cli-drivers_2.11-14.1-20191206.193308-1.jar



We spoke quickly today, and these notes were compiled to the best of my 
recollection.  If I missed anything, please bring let me know.





Trevor’s Agenda:

  1.  Release # Addressed

 *   Path to release # addressed

 *   Steps -> jira tickets # addressed

 *   Code freeze date # addressed Monday, 9 Dec 2019.

  2.  Other Misc..#  discussed and addressed



Andy’s agenda..


# RELE

[ANNOUNCE] Apache Mahout 0.14.1 RC-1

2019-12-14 Thread Andrew Palumbo
Below please find the first candidate release for Mahout 14.1

https://repository.apache.org/content/repositories/orgapachemahout-1058

Off the bat I can see some trivial naming issues that we need to work out.  
This is in part due to the nested module structure.. We must come to a 
consensus on the naming there.

Otherwise please download test and verify the artifacts..


  1.  download and run tests on srv.tar.gz distribution
  2.  try running an app with some of he newly refactored code i.e. something 
that needs both `hdfs` and `core`
  3.  if possible please try to run on a cluster.  If not. a pseudocluster 
would be great.  We need some coverage on HDFS at least.
  4.  try running something with the `/bin/mahout` command, this may need an 
overhaul
  5.  verify artifacts

We need 3 (+1) Binding PMC votes.

The vote will expire in 72 hours.. Tuesday Dec 17 at 7:30.

Thank you very much.  This has been a long effort, and I believe that we can 
get the ball rolling with less mundane work, and regular releases.

--andy



Re: [ANNOUNCE] Apache Mahout 0.14.1 RC-1

2019-12-16 Thread Andrew Palumbo
Signs and hashes were not generated for some artifacts. Cancelling this vote.  
Will send a link for RC2 soon.
--andy

From: Andrew Palumbo 
Sent: Saturday, December 14, 2019 10:36 PM
To: d...@mahout.apache.org ; user@mahout.apache.org 
; priv...@mahout.apache.org 
; annou...@apache.org 
Subject: [ANNOUNCE] Apache Mahout 0.14.1 RC-1

Below please find the first candidate release for Mahout 14.1

https://repository.apache.org/content/repositories/orgapachemahout-1058

Off the bat I can see some trivial naming issues that we need to work out.  
This is in part due to the nested module structure.. We must come to a 
consensus on the naming there.

Otherwise please download test and verify the artifacts..


  1.  download and run tests on srv.tar.gz distribution
  2.  try running an app with some of he newly refactored code i.e. something 
that needs both `hdfs` and `core`
  3.  if possible please try to run on a cluster.  If not. a pseudocluster 
would be great.  We need some coverage on HDFS at least.
  4.  try running something with the `/bin/mahout` command, this may need an 
overhaul
  5.  verify artifacts

We need 3 (+1) Binding PMC votes.

The vote will expire in 72 hours.. Tuesday Dec 17 at 7:30.

Thank you very much.  This has been a long effort, and I believe that we can 
get the ball rolling with less mundane work, and regular releases.

--andy



[ANNOUNCE] Mahout 0.14 RC4

2020-02-01 Thread Andrew Palumbo

Hi All,

Ive finished RC4 of Apache Mahout 0.14.  Unfortunately Right off the bat, I can 
see an issue, which will require a new RC (only SHA-1) checksums have been 
included.  Apache requires at least sha256.  So I will -1 it myself.  I'd drop 
it and fix that now, but its late.

@mahout PMCs

Please bear with me as this is a completely new structure of the project, which 
required almost a complete rewrite of all poms, as well as my first time acting 
as release engineer.

@PMC,  As I've already this RC is invalid,  I don't think that a vote is 
necessary, but if you all could take a look through this, and let me know if 
you see any issues, from trivial to major, it would be a great help in getting 
this release out the door finally.  I'm going to take a page out of Suneels 
book and ask for certain tasks that we need to get get RC5 down hopefully to 
just running tests.

resolver: [1] 
https://repository.apache.org/content/repositories/orgapachemahout-1061
artifacts: [2] 
https://repository.apache.org/content/repositories/orgapachemahout-1061/org/apache/mahout/

@rawkintrevo, I know that you needed binaries, 
could you please try to resolve artifacts that you need and give them a spin, 
and  if you have time check to see if the binary distribution. tar.gz contains 
all the necessary?

@pat could you please see if the 2.11 artifacts 
suit your specific needs (and work)?

@akm could you please  give a once over (or twice, or thrice over if you have 
time) and see if there are any gaping, obvious issues, especially with the 
binary and source release artifacts, as well could you please let me know about 
the new structure (releasing 2.11 and 2.12 in the same repo), and if it seems 
to work?

I'll have to concentrate my limited time, this week, on squashing these last 
couple of bugs.

@all please let me know if the structure seems to work.

If anyone has a quick script to verify the sigs it would  be useful but that's 
probably best saved for RC5.

As well I've used the naming scheme used by SBT:

   ${module_name}_{scala.major.minor}-{mahout version}

Please keep an eye out for any transpositions between ${scala.major.minor} and 
${mahout version}.

Both SBT and maven should have no issue resolving these artifacts (I tested 
maven only in an earlier RC).


It is  late, and I've done no testing other than running tests before i cut 
this RC, nor have i looked closely at any of the artifacts, so anything that 
seems out of the ordinary, Please take note of

There still are still issues with releasing that i have to fix, I.e  the 
maven-release-plugin is not functioning correctly (I believe, haven't fully 
tracked down the bug) these artifacts were .  its a new issue since RC3 and 
there have been very few changes since then, it may be as simple as some 
spacing in the pom (first few answers i found when searching.  The issue may be 
with the formatting of the XML itself.  I'll try to track this down as when I 
get back to it.  And as I mentioned, the release:perform goal failed, when run 
as:

 $ mvn clean package install -DskipTests -Papache-release 
release:perform

I will try to create a Jirass for these issues, (release failing and sha1) I'm 
not sure how much work I'll be able to do on this next week or over the 
weekend.  The release instructions have changed slightly and I will document 
these in the Jiras or on the PR with the new required ~/.m2/settings.xml.

I will be on and offline for the next few days, so @akm, if you want to release 
before i get back to this, I'll finish documenting the release steps in either 
the Jira, or the PR that I sent to you yesterday [3].

If we run these few smoke tests, and sanity checks (and as many more as anyone 
has time for), it will go a long way in raising the probability of RC5 being 
the final.

Of course, as always, any testing and feedback is welcomed and greatly  
appreciated from anyone on the list.

Thanks very much,

Andy


BTW:

It seems that the RC was not assigned a test number for the the artifacts [. 
e.g:


 org.apache.mahout
 mahout-core_2.11
 14.1


This should not be an issue, as there is no other version 14.1 in maven central 
or other repos, nor are there a scala 2.11 or 2.12 versions of anything.  
However if you've tried testing a 14.1 RC before, please remember to use the -U 
option when building, or clear out your ~/.m2 ./repository.

[1] https://repository.apache.org/content/repositories/orgapachemahout-1061
[2] 
https://repository.apache.org/content/repositories/orgapachemahout-1061/org/apache/mahout/
[3] 
https://github.com/apache/mahout/pull/384/commits/8794c42f910ed6fc3ac93d4081a88385520cc84b
[4] https://repository.apache.org/#stagingRepositories
[5] 
https://mvnrepository.com/artifact/org.apache.mahout/mahout-spark_2.10/0.13.0






[ANNOUNCE] Mahout 14.1 RC4

2020-02-01 Thread Andrew Palumbo
Fixing some typos and mistyping:

Hi All,

Ive finished RC4 of Apache Mahout 0.14.  Unfortunately Right off the bat, I can 
see an issue, which will require a new RC (only SHA-1) checksums have been 
included.  Apache requires at least sha256.  So I will -1 it myself.  I'd drop 
it and fix that now, but its late.

Please bear with me as this is a completely new structure of the project, which 
required almost a complete rewrite of all poms, as well as my first time acting 
as release engineer.

@PMC,  As I've already this RC is invalid,  I don't think that a vote is 
necessary, but if you all could take a look through this, and let me know if 
you see any issues, from trivial to major, it would be a great help in getting 
this release out the door finally.  I'm going to take a page out of Suneels 
book and ask for certain tasks that we need to get get RC5 down hopefully to 
just running tests.

resolver: [1] 
https://repository.apache.org/content/repositories/orgapachemahout-1061
artifacts: [2] 
https://repository.apache.org/content/repositories/orgapachemahout-1061/org/apache/mahout/

@rawkintrevo, I know that you needed binaries, 
could you please try to resolve artifacts that you need and give them a spin, 
and  if you have time check to see if the binary distribution. tar.gz contains 
all the necessary?

@pat could you please see if the 2.11 artifacts 
suit your specific needs (and work)?

@akm could you please  give a once over (or twice, or thrice over if you have 
time) and see if there are any gaping, obvious issues, especially with the 
binary and source release artifacts, as well could you please let me know about 
the new structure (releasing 2.11 and 2.12 in the same repo), and if it seems 
to work?

I'll have to concentrate my limited time, this week, on squashing these last 
couple of bugs.

@all please let me know if the structure seems to work.

If anyone has a quick script to verify the sigs it would  be useful but that's 
probably best saved for RC5.

As well I've used the naming scheme used by SBT:

   ${module_name}_{scala.major.minor}-{mahout version}

Please keep an eye out for any transpositions between ${scala.major.minor} and 
${mahout version}.

Both SBT and maven should have no issue resolving these artifacts (I tested 
maven only in an earlier RC).


It is  late, and I've done no testing other than running tests before i cut 
this RC, nor have i looked closely at any of the artifacts, so anything that 
seems out of the ordinary, Please take note of

There still are still issues with releasing that i have to fix, I.e  the 
maven-release-plugin is not functioning correctly (I believe, haven't fully 
tracked down the bug) these artifacts were .  its a new issue since RC3 and 
there have been very few changes since then, it may be as simple as some 
spacing in the pom (first few answers i found when searching.  The issue may be 
with the formatting of the XML itself.  I'll try to track this down as when I 
get back to it.  And as I mentioned, the release:perform goal failed, when run 
as:

 $ mvn clean package install -DskipTests -Papache-release 
release:perform

I will try to create a Jirass for these issues, (release failing and sha1) I'm 
not sure how much work I'll be able to do on this next week or over the 
weekend.  The release instructions have changed slightly and I will document 
these in the Jiras or on the PR with the new required ~/.m2/settings.xml.

I will be on and offline for the next few days, so @akm, if you want to release 
before i get back to this, I'll finish documenting the release steps in either 
the Jira, or the PR that I sent to you yesterday [3].

If we run these few smoke tests, and sanity checks (and as many more as anyone 
has time for), it will go a long way in raising the probability of RC5 being 
the final.

Of course, as always, any testing and feedback is welcomed and greatly  
appreciated from anyone on the list.

Thanks very much,

Andy


BTW:

It seems that the RC was not assigned a test number for the the artifacts [4]. 
e.g:


 org.apache.mahout
 mahout-core_2.11
 14.1


This should not be an issue, as there is no other version 14.1 in maven central 
or other repos, nor are there a scala 2.11 or 2.12 versions of anything [4].  
However if you've tried testing a 14.1 RC before, please remember to use the -U 
option when building, or clear out your ~/.m2 ./repository.

[1] https://repository.apache.org/content/repositories/orgapachemahout-1061
[2] 
https://repository.apache.org/content/repositories/orgapachemahout-1061/org/apache/mahout/
[3] 
https://github.com/apache/mahout/pull/384/commits/8794c42f910ed6fc3ac93d4081a88385520cc84b
[4] https://repository.apache.org/#stagingRepositories
[5] 
https://mvnrepository.com/artifact/org.apache.mahout/mahout-spark_2.10/0.13.0






Re: PyMahout (incore) (alpha v0.1)

2021-01-10 Thread Andrew Palumbo
+1

From: Andrew Musselman 
Sent: Thursday, January 7, 2021 2:46 PM
To: user@mahout.apache.org 
Cc: Mahout Dev List 
Subject: Re: PyMahout (incore) (alpha v0.1)

Thanks Trevor, looking forward to trying it out.

On Wed, Jan 6, 2021 at 5:30 PM Peng Zhang  wrote:

> Well done Trevor.
>
> -peng
>
> On Thu, Jan 7, 2021 at 04:45 Trevor Grant 
> wrote:
>
> > Hey all,
> >
> > I made a branch for a thing I'm toying with. PyMahout.
> >
> > See https://github.com/rawkintrevo/pymahout/tree/trunk
> >
> > Right now, its sort of dumb- it just makes a couple of random incore
> > matrices... but it _does_ make them.
> >
> > Next I want to show I can do something with DRMs.
> >
> > Once I know its all possible- Ill make a batch of JIRA tickets and we can
> > start implementing a python like package so that in theory in a pyspark
> > workbook you could
> >
> > ```jupyter
> > !pip install pymahout
> > 
> >
> > import pymhout
> >
> > # do pymahot things here... in python.
> >
> > ```
> >
> > So if you're interested in helping /playing- reach out on here or direct-
> > if there is a bunch of interest I can commit all of this to a branch as
> we
> > play with it.
> >
> > Thanks!
> > tg
> >
>