Problems with mapBlock()

2014-05-31 Thread Sebastian Schelter
I've updated the codebase to work on the cooccurrence analysis algo, but 
I always run into this error now:


error: value mapBlock is not a member of 
org.apache.mahout.math.drm.DrmLike[Int]


I have the feeling that an implicit conversion might be missing, but I 
couldn't figure out where to put it, with out producing even more errors.


--sebastian


[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014918#comment-14014918
 ] 

Sebastian Schelter commented on MAHOUT-1566:


If its a mere showcase, could we maybe add it as an example in an example 
package, not a full fledged algorithm implementation somehow?

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: mlib versus spark

2014-05-31 Thread Sebastian Schelter

Hi Saikat,

The differences are that MLLib offers a different set of algorithms 
(e.g. you want find cooccurrence analysis or stochastic svd) and that 
their codebase consists of hand-tuned, spark-specific implementations.


Mahout on the other hand, allows to implement algorithms in an 
engine-agnostic, declarative way. This allows for the automatic 
optimization of our algorithms as well as for running the same code on 
multiple backends (there has been interested from h20 as well as Apache 
Flink to integrate with our DSL).


--sebastian

On 06/01/2014 01:41 AM, Saikat Kanjilal wrote:

Actually the subject of my email should say spark->mlib versus mahout->spark :)


From: sxk1...@hotmail.com
To: dev@mahout.apache.org
Subject: mlib versus spark
Date: Sat, 31 May 2014 16:38:13 -0700

Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit 
confused when I think of mahout using spark, since spark already uses an 
embedded machine learning library (mlib) what would be the impetus to use 
mahout instead, seems like you should be able to write or add algortihms to 
mlib and use spark, has someone from mahout looked at mlib to see if there will 
be a strongusecase for using one versus the other?
http://spark.apache.org/mllib/  







RE: mlib versus spark

2014-05-31 Thread Saikat Kanjilal
Actually the subject of my email should say spark->mlib versus mahout->spark :)

> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: mlib versus spark
> Date: Sat, 31 May 2014 16:38:13 -0700
> 
> Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit 
> confused when I think of mahout using spark, since spark already uses an 
> embedded machine learning library (mlib) what would be the impetus to use 
> mahout instead, seems like you should be able to write or add algortihms to 
> mlib and use spark, has someone from mahout looked at mlib to see if there 
> will be a strongusecase for using one versus the other?
> http://spark.apache.org/mllib/  
  

mlib versus spark

2014-05-31 Thread Saikat Kanjilal
Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit 
confused when I think of mahout using spark, since spark already uses an 
embedded machine learning library (mlib) what would be the impetus to use 
mahout instead, seems like you should be able to write or add algortihms to 
mlib and use spark, has someone from mahout looked at mlib to see if there will 
be a strongusecase for using one versus the other?
http://spark.apache.org/mllib/

[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014856#comment-14014856
 ] 

Suneel Marthi commented on MAHOUT-1505:
---

Let's leave Canopy for now then, the tests will be automatically removed once 
Canopy driver is removed

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505-canopy-removal.patch, MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning (dimensionality reduction)

2014-05-31 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014851#comment-14014851
 ] 

Ted Dunning commented on MAHOUT-1567:
-


Mairal's method is one of the absolute standards for this algorithm and sparse 
coding is an important algorithm.

Maciej,

Can you attempt to port to the new math DSL that Mahout is building?  That will 
be very instructive for the development of the DSL as well.



> Add online sparse dictionary learning (dimensionality reduction)
> 
>
> Key: MAHOUT-1567
> URL: https://issues.apache.org/jira/browse/MAHOUT-1567
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Maciej Kula
>
> I have recently implemented a sparse online dictionary learning algorithm, 
> with an emphasis on learning very high-dimensional and very sparse 
> dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for 
> Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online 
> variant of low-rank matrix factorization, suitable for sparse binary matrices 
> (such as implicit feedback matrices).
> I would be very happy to bring this up to the Mahout standard and contribute 
> to the main codebase --- is this something you would in principle be 
> interested in having?
> The code (as well as some examples) are here: 
> https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Andrew Musselman (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman updated MAHOUT-1505:
-

Attachment: MAHOUT-1505-canopy-removal.patch

Fixes suggested by Suneel above.  Note Canopy and CanopyDriver are still used 
in TestClusterEvaluation and other tests; how far do we want to go on this 
before removing Canopy entirely?

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505-canopy-removal.patch, MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014777#comment-14014777
 ] 

Suneel Marthi edited comment on MAHOUT-1505 at 5/31/14 9:45 PM:


Reviewing the patch and few places where the code needs some cleanup:-

1.  Since Canopy Clustering has been now marked as @deprecated, please remove 
all references and unit tests for Canopy Clustering in TestClusterInterface, 
TestClusterDumper and TestClusterClassifier.

2.  AbstractClusrer.java

Method formatVectorAsJson() -  What are we doing with vectorTerms ? I don't 
see that being used anywhere,  If vectorTerms is not needed, we may also not 
need the static class TermIndexWeight ?

3. Cluster.java - remove the following variable as its never used anywhere in 
the code
  
 String CLUSTERED_POINTS_DIR = "clusteredPoints";



was (Author: smarthi):
Reviewing the patch and few places where the code needs some cleanup:-

1.  Since Canopy Clustering has been now marked as @deprecated, please remove 
all references and unit tests for Canopy Clustering in TestClusterInterface, 
TestClusterDumper and TestClusterClassifier.

2.  AbstractClusrer.java

Method formatVectorAsJson() -  What are we doing with vectorTerms ?  If u 
don't need vectorTerms, then do we really need the static class TermIndexWeight 
?

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

2014-05-31 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014827#comment-14014827
 ] 

Andrew Palumbo commented on MAHOUT-1564:


Thanks for looking at this [~ssc].   I was worried about potential conflicts 
with 1252, but its really just a minor addition to the existing implementation, 
doesn't require any changes and can always come out.  Haven't had a chance to 
do much with it yet, will probably work on it later this week.  

> Naive Bayes Classifier for New Text Documents
> -
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> MapReduce Naive Bayes implementation currently lacks the ability to classify 
> a new document (outside of the training/holdout corpus).  I've begun some 
> work on a "ClassifyNew" job which will do the following:
> 1. Vectorize a new text document using the dictionary and document 
> frequencies from the training/holdout corpus 
> - assume the original corpus was vectorized using `seq2sparse`; step (1) 
> will use all of the same parameters. 
> 2. Score and label a new document using a previously trained model.
> I think that it will be a useful addition to the NB package.  Unfortunately, 
> this is going to be mostly MR workhorse code and doesn't really introduce 
> much new logic. I will try to keep any new logic separate from MR code so 
> that it can be called from scala for MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014826#comment-14014826
 ] 

Andrew Musselman commented on MAHOUT-1505:
--

Good catches, thanks, will fix.

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1567) Add online sparse dictionary learning (dimensionality reduction)

2014-05-31 Thread Maciej Kula (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Kula updated MAHOUT-1567:


Summary: Add online sparse dictionary learning (dimensionality reduction)  
(was: Add online sparse dictionary learning)

> Add online sparse dictionary learning (dimensionality reduction)
> 
>
> Key: MAHOUT-1567
> URL: https://issues.apache.org/jira/browse/MAHOUT-1567
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Maciej Kula
>
> I have recently implemented a sparse online dictionary learning algorithm, 
> with an emphasis on learning very high-dimensional and very sparse 
> dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for 
> Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online 
> variant of low-rank matrix factorization, suitable for sparse binary matrices 
> (such as implicit feedback matrices).
> I would be very happy to bring this up to the Mahout standard and contribute 
> to the main codebase --- is this something you would in principle be 
> interested in having?
> The code (as well as some examples) are here: 
> https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning

2014-05-31 Thread Maciej Kula (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014804#comment-14014804
 ] 

Maciej Kula commented on MAHOUT-1567:
-

Correct. It is effectively an online dimensionality reduction technique.

We start with a set of n vectors (datapoints, rows in a matrix), and then look 
for a smaller number of vectors m such that the original vectors can be well 
approximated by linear combinations of the m new vectors. In the parlance of 
dictionary learning, the m vectors are called dictionary atoms, and the linear 
combinations are called (sparse) codes.

The idea of learning a dictionary is that, instead of picking a predefined set 
of vectors, we fit the dictionary vectors to the data using an SGD-like process.

I can amend the issue title to reflect this a little bit better?

> Add online sparse dictionary learning
> -
>
> Key: MAHOUT-1567
> URL: https://issues.apache.org/jira/browse/MAHOUT-1567
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Maciej Kula
>
> I have recently implemented a sparse online dictionary learning algorithm, 
> with an emphasis on learning very high-dimensional and very sparse 
> dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for 
> Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online 
> variant of low-rank matrix factorization, suitable for sparse binary matrices 
> (such as implicit feedback matrices).
> I would be very happy to bring this up to the Mahout standard and contribute 
> to the main codebase --- is this something you would in principle be 
> interested in having?
> The code (as well as some examples) are here: 
> https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014788#comment-14014788
 ] 

Hudson commented on MAHOUT-1505:


SUCCESS: Integrated in Mahout-Quality #2625 (See 
[https://builds.apache.org/job/Mahout-Quality/2625/])
MAHOUT-1505: structure of clusterdump's JSON output (akm) (andrew.musselman: 
rev e0751eaaca4bb06adcf80108d85ad0dc6cc67074)
* mrlegacy/src/main/java/org/apache/mahout/clustering/AbstractCluster.java
* mrlegacy/src/main/java/org/apache/mahout/clustering/Cluster.java
* 
integration/src/main/java/org/apache/mahout/utils/clustering/JsonClusterWriter.java
* 
mrlegacy/src/test/java/org/apache/mahout/clustering/iterator/TestClusterClassifier.java
* mrlegacy/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java
* CHANGELOG
* integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
MAHOUT-1505: structure of clusterdump's JSON output (akm) This closes #5 (akm: 
rev 1d1134ee941c6a06804fdb5fef536e3ed95ed36e)
* CHANGELOG


> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1567) Add online sparse dictionary learning

2014-05-31 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014782#comment-14014782
 ] 

Ted Dunning commented on MAHOUT-1567:
-

Maciej,

Can you say a bit more about this?  I think that people might be used here by 
the use of the word dictionary.

I think that you mean dictionary in the sense of sparse coding, whcih is a 
common first step for deep learning as opposed to building a lookup table for 
converting strings to integers.

Is this correct?



> Add online sparse dictionary learning
> -
>
> Key: MAHOUT-1567
> URL: https://issues.apache.org/jira/browse/MAHOUT-1567
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Maciej Kula
>
> I have recently implemented a sparse online dictionary learning algorithm, 
> with an emphasis on learning very high-dimensional and very sparse 
> dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for 
> Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online 
> variant of low-rank matrix factorization, suitable for sparse binary matrices 
> (such as implicit feedback matrices).
> I would be very happy to bring this up to the Mahout standard and contribute 
> to the main codebase --- is this something you would in principle be 
> interested in having?
> The code (as well as some examples) are here: 
> https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014777#comment-14014777
 ] 

Suneel Marthi commented on MAHOUT-1505:
---

Reviewing the patch and few places where the code needs some cleanup:-

1.  Since Canopy Clustering has been now marked as @deprecated, please remove 
all references and unit tests for Canopy Clustering in TestClusterInterface, 
TestClusterDumper and TestClusterClassifier.

2.  AbstractClusrer.java

Method formatVectorAsJson() -  What are we doing with vectorTerms ?  If u 
don't need vectorTerms, then do we really need the static class TermIndexWeight 
?

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Apache group on github

2014-05-31 Thread Andrew Musselman
Awesome, thanks.


On Sat, May 31, 2014 at 11:52 AM, Ahmet Arslan  wrote:

> Hi Andrew,
>
> Taken from Noah's e-mail :
>
> If you want to be added, edit this file:
>
> https://svn.apache.org/repos/private/committers/docs/github_team.txt
>
> Ahmet
>
>
> On Saturday, May 31, 2014 8:30 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
> Does anyone know who to ask to be added to that group?
>
>


[jira] [Comment Edited] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1566 at 5/31/14 6:56 PM:
---

on implicit feedback issue, i can not just throw it in at the moment because 
like i said, it is implemented in bare Spark + scala bindings but porting it to 
drm algebra without going native to Spark seems to be a bit of a puzzle at the 
moment. I am also contemplating what additional primitives we may formalize in 
drm linalg before I resort to falling back to a Spark-coupled method.


was (Author: dlyubimov):
on implicit feedback issue, i can not just throw it in at the moment because 
like i said, it is implemented in bare Spark + scala bindings but porting it to 
drm without going native to Spark seems to be a bit of a puzzle at the moment. 
I am also contemplating what additional primitives we may formalize in drm 
linalg before I resort to falling back to a Spark-coupled method.

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1566 at 5/31/14 6:56 PM:
---

on implicit feedback issue, i can not just throw it in at the moment because 
like i said, it is implemented in bare Spark + scala bindings but porting it to 
drm without going native to Spark seems to be a bit of a puzzle at the moment. 
I am also contemplating what additional primitives we may formalize in drm 
linalg before I resort to falling back to a Spark-coupled method.


was (Author: dlyubimov):
on implicit feedback issue, i can just jump it up because like i said, it is 
implemented in bare Spark + scala bindings but porting it to drm without going 
native to Spark seems to be a bit of a puzzle at the moment. I am also 
contemplating what additional primitives we may formalize in drm linalg before 
I resort to falling back to a Spark-coupled method.

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014752#comment-14014752
 ] 

Dmitriy Lyubimov commented on MAHOUT-1566:
--

on implicit feedback issue, i can just jump it up because like i said, it is 
implemented in bare Spark + scala bindings but porting it to drm without going 
native to Spark seems to be a bit of a puzzle at the moment. I am also 
contemplating what additional primitives we may formalize in drm linalg before 
I resort to falling back to a Spark-coupled method.

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014747#comment-14014747
 ] 

Dmitriy Lyubimov commented on MAHOUT-1566:
--

primary merit here is an new "showcase" of drm algebra and iron out as many 
bugs and issues in preparation for implicit feedback issue. I have had implicit 
feedback issue implemented in spark and scala bindings for more than a year 
(long before it was added mllib) but it would be hard to address all issues at 
once with mahout drm algebra in a bigger method than in a smaller one. 

For example, in addition to already listed improvements and fixes in spark 
bindings, rmse computation makes it is pretty clear that elementwise logical op 
is missing a physical op for identically distributed matrices (which would 
produce, again, an identically distributed matrix to operands). It is also 
clear that there are probably some bona-fide cases when ABt operator can 
produce result identically distributed to A operand if result does not have a 
significant size skew (I am still contemplating the criteria of such identical 
distribution estimate here).

On a a more supportive argument side, it is obviously false dilemma. Since you 
mentioned there's some nonzero merit, and since both methods are of the same 
family and can coexist, merit of two is more than merit of either one, not 
less. Since drm algebra for the basic one is extremely simple, it has 
practically zero maintenance cost. 


> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Apache group on github

2014-05-31 Thread Ahmet Arslan
Hi Andrew,

Taken from Noah's e-mail : 

If you want to be added, edit this file:

https://svn.apache.org/repos/private/committers/docs/github_team.txt

Ahmet


On Saturday, May 31, 2014 8:30 PM, Andrew Musselman 
 wrote:
Does anyone know who to ask to be added to that group?



[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014737#comment-14014737
 ] 

Hudson commented on MAHOUT-1505:


SUCCESS: Integrated in Mahout-Quality #2624 (See 
[https://builds.apache.org/job/Mahout-Quality/2624/])
MAHOUT-1505: structure of clusterdump's JSON output (akm) (andrew.musselman: 
rev e0751eaaca4bb06adcf80108d85ad0dc6cc67074)
* mrlegacy/src/main/java/org/apache/mahout/clustering/Cluster.java
* mrlegacy/src/main/java/org/apache/mahout/clustering/AbstractCluster.java
* CHANGELOG
* 
mrlegacy/src/test/java/org/apache/mahout/clustering/iterator/TestClusterClassifier.java
* mrlegacy/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java
* integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
* 
integration/src/main/java/org/apache/mahout/utils/clustering/JsonClusterWriter.java
MAHOUT-1505: structure of clusterdump's JSON output (akm) This closes #5 (akm: 
rev 1d1134ee941c6a06804fdb5fef536e3ed95ed36e)
* CHANGELOG


> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1470) Topic dump

2014-05-31 Thread Andrew Musselman (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman updated MAHOUT-1470:
-

Affects Version/s: (was: 1.0)
   0.9

> Topic dump
> --
>
> Key: MAHOUT-1470
> URL: https://issues.apache.org/jira/browse/MAHOUT-1470
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Per 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Mahout's missing a clusterdump utility that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Apache group on github

2014-05-31 Thread Andrew Musselman
Does anyone know who to ask to be added to that group?


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014720#comment-14014720
 ] 

ASF GitHub Bot commented on MAHOUT-1505:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/5


> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread Andrew Musselman (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman resolved MAHOUT-1505.
--

Resolution: Fixed

Merged and pushed to master.

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Sketching out scala traits and 1.0 API

2014-05-31 Thread Pat Ferrel
Many if not most Mahout committers and contributors will be new to Scala and 
Spark, certainly to the Mahout Scala DSL.

I’m a complete noob to Spark and Scala so I dove into Scala as a first step. It 
is deceptively simple but you run into odd limitations and special cases 
quickly. Anyway a good starting point seems to be Scala, especially its 
functional programming features. Those plus Spark’s architecture, the Mahout 
Scala DSL, and (especially for the scientist types out there) the Mahout Shell 
will make doing new code a couple of orders of magnitude easier than 
java/hadoop/mapreduce.

There is very strong support for Scala on stackoverflow. You will see my 
simpleton questions there and I encourage everyone to take advantage because 
the volume of stuff to Google is much smaller than for Java (obviously?)

On May 30, 2014, at 3:12 PM, Andrew Palumbo  wrote:

>> IMO we should wait on core DSL functionality if it’s
>> not there but if you are doing something that is external
>> then full blown dataframes may not block you or even help you. 
>> Drms are pretty mature. You’ll have to decide that based on 
>> your own needs.Also wanted to say I agree completely- not trying to jump the 
>> gun on this. 

From: ap@outlook.com
To: dev@mahout.apache.org
Subject: RE: Sketching out scala traits and 1.0 API
Date: Fri, 30 May 2014 18:04:33 -0400




Just jumping in here real quick.. not trying to derail the conversation...

I have a lot of catching up to do on the status of the Dataframe 
implementation, the DSL, Pat's ItemSimiliarity implementation so that i can 
better understand what's going on and. I'm going to try to take a look at this 
stuff over the weekend

I think i see how my thinking of this has been wrong in terms of "Translating a 
Dataframe to a DRM".  Also I think that NB was a bad example because it's kind 
of a special case classifier.

I guess from my end what im wondering of in terms of laying out traits for 
classifiers is are we going to try to provide a kind of weka or R-like 
pluggable interface? and if so, how would that look?  I guess I'm speaking 
specifically about about batch trained, supervised, classification algorithms 
at this point. (Which im not sure going forward if anybody is interested in, 
but I am).

For example, I'm doing some work right that involves comparing results from 
some off the shelf algorithms. Working in R, with a small dense dataset- 
nothing really novel.  Once my dataframe is all set up, switching classifiers 
looks like basically like this:

# Train a random forest
res.rf <- randomForest( formula=formula, data=d_train, nodesize=1,
   classwt=CLASSWT, sampsize=length(d_train[,1]), 
   proximity=F, na.action=na.roughfix, ntree=1000)  
# Train an rPartTree
res.rf <-rpart( formula=formula, data=d_train, method="class",
   control=rpart.control(minsplit=2, cp=0))

I know that this is not that useful to the typical Mahout user right now.  But 
with a shell/script, a Linear Algebra DSL with a distributed back end and a 
bunch of algorithms in the library, i think that this will be, or will draw in 
new users.  

The reason I brought up the full NB pipeline is to ensure that if we are to lay 
out traits for new (classification) algorithms, it is done so in a the most 
robust way possible, and in a way that eases development from prototyping in 
the shell to deployment. 





> Date: Fri, 30 May 2014 14:54:20 -0700
> Subject: Re: Sketching out scala traits and 1.0 API
> From: dlie...@gmail.com
> To: dev@mahout.apache.org
> 
> Frankly, except for columnar organization and sine math summarization
> functionality,  i don't see much difference between these data frames and
> e.g. scalding tuple-based manipulations.
> 
> 
> On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov  wrote:
> 
>> I am not sure i understand the question. It would possible to save results
>> of rowSimilarityJob as a data frame. No, data frames do not support quick
>> bidirectional indexing on demand in a sense if we wanted to bring full
>> column or row to front-end process very quickly (e.g. row id -> row vector,
>> or columnName -> column). They will support iterative filtering and
>> mutating just like in dplyr package of R. (I hope).
>> 
>> In general, i'd only say that data frames are called data frames because
>> the scope of functionality and intent is that of R data frames (there's no
>> other source for the term of "data frame", i.e. matlab doesn't have those i
>> think) minus quick random individual cell access which is replaced by
>> dplyr-style FP computations.
>> 
>> So really i'd say one needs to look at dplyr and R to understand the scope
>> of this at this point in my head.
>> 
>> Filtering over rows (including there labels) is implied by dplyr and R.
>> column selection pattern is a bit different, via %.% select() and %.%
>> mutate (it assumes data frames are like tables, few attributes but a lot of
>> rows). Data frames

[jira] [Created] (MAHOUT-1567) Add online sparse dictionary learning

2014-05-31 Thread Maciej Kula (JIRA)
Maciej Kula created MAHOUT-1567:
---

 Summary: Add online sparse dictionary learning
 Key: MAHOUT-1567
 URL: https://issues.apache.org/jira/browse/MAHOUT-1567
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Maciej Kula


I have recently implemented a sparse online dictionary learning algorithm, with 
an emphasis on learning very high-dimensional and very sparse dictionaries. It 
is based on J. Mairal et al 'Online Dictionary Learning for Sparse Coding' 
(http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online variant of 
low-rank matrix factorization, suitable for sparse binary matrices (such as 
implicit feedback matrices).

I would be very happy to bring this up to the Mahout standard and contribute to 
the main codebase --- is this something you would in principle be interested in 
having?

The code (as well as some examples) are here: 
https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014575#comment-14014575
 ] 

Suneel Marthi edited comment on MAHOUT-1552 at 5/31/14 9:03 AM:


Is this related to the MR configuration not being passed across the job 
pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 
0.8?  If so this can be resolved as not a problem, given that this is still 
being reported against Mahout 0.7.

I would also add that any issue that's being reported against 0.7 needs to be 
first confirmed against present trunk before accepting any patches (most likely 
to happen with existing CDH 4x distros that were packaged with Mahout 0.7).


was (Author: smarthi):
Is this related to the MR configuration not being passed across the job 
pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 
0.8?  If so this can be resolved as not a problem, given that this is still 
being reported against Mahout 0.7.

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014575#comment-14014575
 ] 

Suneel Marthi commented on MAHOUT-1552:
---

Is this related to the MR configuration not being passed across the job 
pipeline which was definitely an issue in Mahout 0.7 and was fixed in Mahout 
0.8?  If so this can be resolved as not a problem, given that this is still 
being reported against Mahout 0.7.

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014574#comment-14014574
 ] 

ASF GitHub Bot commented on MAHOUT-1505:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/5#issuecomment-44743038
  
looks good to me, +1 for including this


> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
> Attachments: MAHOUT-1505.patch
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014573#comment-14014573
 ] 

Sebastian Schelter commented on MAHOUT-1566:


I'm not sure whether we should really include the "standard" ALS in the new 
codebase. It is optimized for rating prediction on Netflix-like data which 
rarely exists outside of academia. I think we should rather focus on the ALS 
version targeted for implicit data (clicks, views, etc).

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1565:
---

Fix Version/s: 1.0

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014572#comment-14014572
 ] 

Sebastian Schelter commented on MAHOUT-1564:


I don't see any reason to veto this, as it will make stuff that we have more 
useful.

> Naive Bayes Classifier for New Text Documents
> -
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> MapReduce Naive Bayes implementation currently lacks the ability to classify 
> a new document (outside of the training/holdout corpus).  I've begun some 
> work on a "ClassifyNew" job which will do the following:
> 1. Vectorize a new text document using the dictionary and document 
> frequencies from the training/holdout corpus 
> - assume the original corpus was vectorized using `seq2sparse`; step (1) 
> will use all of the same parameters. 
> 2. Score and label a new document using a previously trained model.
> I think that it will be a useful addition to the NB package.  Unfortunately, 
> this is going to be mostly MR workhorse code and doesn't really introduce 
> much new logic. I will try to keep any new logic separate from MR code so 
> that it can be called from scala for MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014570#comment-14014570
 ] 

Sebastian Schelter commented on MAHOUT-1543:


Could you create a pull request to the current mahout codebase?

> JSON output format for classifying with random forests
> --
>
> Key: MAHOUT-1543
> URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: larryhu
>  Labels: patch
> Fix For: 0.7
>
> Attachments: MAHOUT-1543.patch
>
>
> This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014571#comment-14014571
 ] 

Sebastian Schelter commented on MAHOUT-1552:


Could you suggest a way to fix the bug?

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1552:
---

Fix Version/s: 1.0

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1551:
---

Fix Version/s: 1.0

> Add document to describe how to use mlp with command line
> -
>
> Key: MAHOUT-1551
> URL: https://issues.apache.org/jira/browse/MAHOUT-1551
> Project: Mahout
>  Issue Type: Documentation
>  Components: Classification, CLI, Documentation
>Affects Versions: 0.9
>Reporter: Yexi Jiang
>  Labels: documentation
> Fix For: 1.0
>
>
> Add documentation about the usage of multi-layer perceptron in command line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1524:
---

Fix Version/s: 1.0

> Script to auto-generate and view the Mahout website on a local machine 
> ---
>
> Key: MAHOUT-1524
> URL: https://issues.apache.org/jira/browse/MAHOUT-1524
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Saleem Ansari
> Fix For: 1.0
>
> Attachments: mahout-website.sh
>
>
> Attached with this ticket is a script that creates a simple setup for editing 
> Mahout Website on a local machine.
> It is useful in the sense that, we can edit the source and the changes are 
> automatically reflected in the generated site. All we need to do is refresh 
> the browser. No further steps required.
> So now one can review the website changes ( the complete website ), on a 
> developer's machine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)