date:20150305

Re: JIRA- legacy & scala labels

2015-03-05 Thread Andrew Musselman

Thanks AP

On Thursday, March 5, 2015, Andrew Palumbo  wrote:

> I went through all of the unresolved JIRA issues and marked all with at
> least a "legacy" or "scala". (for lack of a better name for all that is not
> legacy) label. Hopefully I got them all.
>
> Some are labelled with both (math, build, documentation related to both or
> neither, etc.)
>
> legacy issues:
>
> https://issues.apache.org/jira/browse/MAHOUT-1522?jql=
> project%20%3D%20MAHOUT%20AND%20resolution%20%3D%
> 20Unresolved%20AND%20labels%20%3D%20scala%20ORDER%20BY%20priority%20DESC
>
> "scala" issues:
>
> https://issues.apache.org/jira/browse/MAHOUT-1522?jql=
> project%20%3D%20MAHOUT%20AND%20resolution%20%3D%
> 20Unresolved%20AND%20labels%20%3D%20legacy%20ORDER%20BY%20priority%20DESC
>
> Hopefully this will help us get started closing up some old issues. I'll
> try to make another pass over them and close tomorrow and try to find some
> that need to be closed out.
>

JIRA- legacy & scala labels

2015-03-05 Thread Andrew Palumbo

I went through all of the unresolved JIRA issues and marked all with at 
least a "legacy" or "scala". (for lack of a better name for all that is 
not legacy) label. Hopefully I got them all.


Some are labelled with both (math, build, documentation related to both 
or neither, etc.)


legacy issues:

https://issues.apache.org/jira/browse/MAHOUT-1522?jql=project%20%3D%20MAHOUT%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20scala%20ORDER%20BY%20priority%20DESC

"scala" issues:

https://issues.apache.org/jira/browse/MAHOUT-1522?jql=project%20%3D%20MAHOUT%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20legacy%20ORDER%20BY%20priority%20DESC

Hopefully this will help us get started closing up some old issues. I'll 
try to make another pass over them and close tomorrow and try to find 
some that need to be closed out.

[jira] [Updated] (MAHOUT-1630) Incorrect SparseColumnMatrix.numSlices() causes IndexException in toString()

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1630:
---
Labels: legacy math scala  (was: )

> Incorrect SparseColumnMatrix.numSlices() causes IndexException in toString()
> 
>
> Key: MAHOUT-1630
> URL: https://issues.apache.org/jira/browse/MAHOUT-1630
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0, 0.9
>Reporter: Oleg Nitz
>  Labels: legacy, math, scala
>
> SparseColumnMatrix overrides the numSlices() method incorrectly: it returns 
> numCols() instead of numRows(). 
> As a result, AbstractMaxtrix.toString() for wide matrices throws an exception.
> For example, this code:
> {quote}
> Matrix matrix = new SparseColumnMatrix(1, 2);
> matrix.toString();
> {quote}
> causes
> {quote}
> org.apache.mahout.math.IndexException: Index 1 is outside allowable range of 
> [0,1)
> at 
> org.apache.mahout.math.MatrixVectorView.(MatrixVectorView.java:42)
> at 
> org.apache.mahout.math.AbstractMatrix.viewRow(AbstractMatrix.java:290)
> at 
> org.apache.mahout.math.AbstractMatrix$1.computeNext(AbstractMatrix.java:68)
> at 
> org.apache.mahout.math.AbstractMatrix$1.computeNext(AbstractMatrix.java:59)
> at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at 
> org.apache.mahout.math.AbstractMatrix.toString(AbstractMatrix.java:787)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1628) Propagation of Updates in DF

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349826#comment-14349826
 ] 

Andrew Palumbo commented on MAHOUT-1628:


"Data Frames" being DRMs I am assuming.

> Propagation of Updates in DF
> 
>
> Key: MAHOUT-1628
> URL: https://issues.apache.org/jira/browse/MAHOUT-1628
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Suminda Dharmasena
>  Labels: DSL, scala
>
> Given data frame :
> C = A + B
> If some cells in A or B are updated:
> 1) calculate C using the latest values in A and B when C is accessed (pull)
> 2) when A or B is updated then propagate changes to C (push)
> You can use different operator for = in the above cases  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1627) Problem with ALS Factorizer MapReduce version when working with oozie because of files in distributed cache. Error: Unable to read sequence file from cache.

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1627:
---
Labels: legacy  (was: )

> Problem with ALS Factorizer MapReduce version when working with oozie because 
> of files in distributed cache. Error: Unable to read sequence file from cache.
> 
>
> Key: MAHOUT-1627
> URL: https://issues.apache.org/jira/browse/MAHOUT-1627
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 1.0
> Environment: Hadoop
>Reporter: Srinivasarao Daruna
>  Labels: legacy
>
> There is a problem with ALS Factorizer when working with distributed 
> environment and oozie.
> Steps:
> 1) Built mahout 1.0 jars and picked mahout-mrlegacy jar.
> 2) I have created a Java class in which i have called 
> ParallelALSFactorizationJob with respective inputs.
> 3) Submitted the job and there are list of Map Reduce jobs which got 
> submitted to perform the factorization.
> 4) Job failed at MultithreadedSharingMapper with the error Unable to read 
> Sequnce file ".jar" pointing the code at 
> org.apache.mahout.cf.taste.hadoop.als.ALS and 
> readMatrixByRowsFromDistributedCache method.
> Cause: The ALS class picks up input files which are sequential files from the 
> distributed cache using readMatrixByRowsFromDistributedCache method. However, 
> when we are working in oozie environment, the program jar as well being 
> copied to distributed cache with input files. As the ALS class trying to read 
> all the files in distributed cache, it is failing when it encounters jar. 
> The remedy would be setting a condition to pick files those are other than 
> jars. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1628) Propagation of Updates in DF

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1628:
---
Labels: DSL scala  (was: legacy)

> Propagation of Updates in DF
> 
>
> Key: MAHOUT-1628
> URL: https://issues.apache.org/jira/browse/MAHOUT-1628
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Suminda Dharmasena
>  Labels: DSL, scala
>
> Given data frame :
> C = A + B
> If some cells in A or B are updated:
> 1) calculate C using the latest values in A and B when C is accessed (pull)
> 2) when A or B is updated then propagate changes to C (push)
> You can use different operator for = in the above cases  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1637) RecommenderJob of ALS fails in the mapper because it uses the instance of other class

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1637:
---
Labels: als collaborative-filtering legacy  (was: als 
collaborative-filtering)

> RecommenderJob of ALS fails in the mapper because it uses the instance of 
> other class
> -
>
> Key: MAHOUT-1637
> URL: https://issues.apache.org/jira/browse/MAHOUT-1637
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Cristian Galán
>  Labels: als, collaborative-filtering, legacy
> Fix For: 1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> In the map method of PredictionMapper when executes the next line:   
> Pair, OpenIntObjectHashMap> uAndM = 
> getSharedInstance();
> the job fail because obtains the instance of other class. This occurs because 
> I launch a local job, so the instance exists previously and for this doesn't 
> make the new correct instance for ALS.
> The solution that it works me is to add the next line:.
> SharingMapper.reset();
> in method run of JobRecommender of org.apache.mahout.cf.taste.hadoop.als 
> package 
> I have to test it in my environment with distributed mapreduce, hadoop fs, 
> zookeeper and others if it works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1628) Propagation of Updates in DF

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1628:
---
Labels: legacy  (was: )

> Propagation of Updates in DF
> 
>
> Key: MAHOUT-1628
> URL: https://issues.apache.org/jira/browse/MAHOUT-1628
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Suminda Dharmasena
>  Labels: legacy
>
> Given data frame :
> C = A + B
> If some cells in A or B are updated:
> 1) calculate C using the latest values in A and B when C is accessed (pull)
> 2) when A or B is updated then propagate changes to C (push)
> You can use different operator for = in the above cases  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1629) Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1629:
---
Labels: legacy  (was: )

> Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder 
> as --input
> 
>
> Key: MAHOUT-1629
> URL: https://issues.apache.org/jira/browse/MAHOUT-1629
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
> Environment: AWS EMR with AMI 3.2.3
>Reporter: Markus Paaso
>  Labels: legacy
>
> When running 'mahout cvb' command on AWS EMR having option --input with value 
> like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) 
> the content of doc-topic output is really non-sense. It seems like the docIds 
> in doc-topic output are shuffled. But the topic model output (p(term|topic) 
> for each topic) looks still fine.
> The workaround is to first copy input files from s3 to cluster's hdfs with 
> command:
>  {code:none}hadoop fs -cp s3://mybucket/input /input{code}
> and then running mahout cvb with option --input /input .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1635) Getting an exception when I provide classification labels manually for Naive Bayes

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1635:
---
Labels: legacy  (was: )

> Getting an exception when I provide classification labels manually for Naive 
> Bayes
> --
>
> Key: MAHOUT-1635
> URL: https://issues.apache.org/jira/browse/MAHOUT-1635
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 1.0
>Reporter: Suman Somasundar
>Assignee: Andrew Palumbo
>  Labels: legacy
> Attachments: zip_1
>
>
> If I let the Naive Bayes program itself extract the classification labels, 
> the program runs fine. But, I get the following error when I provide the 
> classification labels for the dataset manually.
> Error: java.lang.IllegalArgumentException: Wrong numLabels: 0. Must be > 0!
> at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
> at 
> org.apache.mahout.classifier.naivebayes.training.WeightsMapper.setup(WeightsMapper.java:45)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:169)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1640)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1634) ALS don't work when it adds new files in Distributed Cache

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1634:
---
Labels: ALS legacy  (was: ALS)

> ALS don't work when it adds new files in Distributed Cache
> --
>
> Key: MAHOUT-1634
> URL: https://issues.apache.org/jira/browse/MAHOUT-1634
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.9
> Environment: Cloudera 5.1 VM, eclipse, zookeeper
>Reporter: Cristian Galán
>  Labels: ALS, legacy
> Fix For: 1.0
>
> Attachments: mahout.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ALS algorithm uses distributed cache to temp files, but the distributed cache 
> have other uses too, especially to add dependencies
> (http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/),
>  so when in a hadoop's job we add a dependency library (or other file) ALS 
> fails because it reads ALL files in Distribution Cache without distinction.
> This occurs in the project of my company because we need to add Mahout 
> dependencies (mahout, lucene,...) in an hadoop Configuration to run Mahout's 
> jobs, otherwise the Mahout's job fails because it don't find the dependencies.
> I propose two options (I think two valid options):
> 1) Eliminate all .jar in the return of HadoopUtil.getCacheFiles
> 2) Elliminate all Path object distinct of /part-*
> I prefer the first because it's less aggressive, and I think this solution 
> will be resolve all problems.
> Pd: Sorry if my english is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1252) Add support for Finite State Transducers (FST) as a DictionaryType.

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1252:
---
Labels: scala spark  (was: )

> Add support for Finite State Transducers (FST) as a DictionaryType.
> ---
>
> Key: MAHOUT-1252
> URL: https://issues.apache.org/jira/browse/MAHOUT-1252
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>  Labels: scala, spark
> Fix For: 1.0
>
>
> Add support for Finite State Transducers (FST) as a DictionaryType, this 
> should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1540:
---
Labels: DSL scala spark  (was: )

> Reuters example for spectral clustering
> ---
>
> Key: MAHOUT-1540
> URL: https://issues.apache.org/jira/browse/MAHOUT-1540
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.0
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>
> Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
> spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1567) Add online sparse dictionary learning (dimensionality reduction)

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1567:
---
Labels: DSL scala  (was: )

> Add online sparse dictionary learning (dimensionality reduction)
> 
>
> Key: MAHOUT-1567
> URL: https://issues.apache.org/jira/browse/MAHOUT-1567
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Maciej Kula
>  Labels: DSL, scala
>
> I have recently implemented a sparse online dictionary learning algorithm, 
> with an emphasis on learning very high-dimensional and very sparse 
> dictionaries. It is based on J. Mairal et al 'Online Dictionary Learning for 
> Sparse Coding' (http://www.di.ens.fr/willow/pdfs/icml09.pdf). It's an online 
> variant of low-rank matrix factorization, suitable for sparse binary matrices 
> (such as implicit feedback matrices).
> I would be very happy to bring this up to the Mahout standard and contribute 
> to the main codebase --- is this something you would in principle be 
> interested in having?
> The code (as well as some examples) are here: 
> https://github.com/maciejkula/dictionarylearning



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1570) Adding support for Stratosphere as a backend for the Mahout DSL

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1570:
---
Labels: DSL flink scala  (was: )

> Adding support for Stratosphere as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of logical Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL.
> I like to evaluate to what extent this can already be done for Stratosphere 
> and what can be done to solve possibly occuring problems. 
> The biggest difference between Spark and Stratosphere at the moment is 
> probably the incremental rollout of plans, which is triggered by Spark's 
> actions and which is not supported by Stratosphere yet. However, the 
> Stratosphere team is working on this issue. For the moment, it should be 
> possible to circumvent this problem by writing intermediate results required 
> by an action to HDFS and reading from there.
> Thus, this work shall rather be considered as a proof of concept than a 
> strongly efficient implementation and has the purpose to evaluate where the 
> logical plan abstraction might be refined in order to support different 
> backends. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1569) Create CLI driver that supports Spark jobs

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349804#comment-14349804
 ] 

Andrew Palumbo commented on MAHOUT-1569:


[~pferrel] this can be closed, right?

> Create CLI driver that supports Spark jobs
> --
>
> Key: MAHOUT-1569
> URL: https://issues.apache.org/jira/browse/MAHOUT-1569
> Project: Mahout
>  Issue Type: New Feature
>  Components: CLI
> Environment: Scala, Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: scala, spark
>
> Create a design for CLI drivers, including an option parser, base 
> MahoutDriver for Spark, that uses a text file I/O mechanism MAHOUT-1568
> A version of the proposal is implemented and running for ItemSimilarity on 
> Spark. MAHOUT-1541
> A proposal is running with ItemSimilarity on Spark and is documented on the 
> github wiki here: https://github.com/pferrel/harness/wiki
> Comments are appreciated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1568) Build an I/O model that can replace sequence files for import/export

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1568:
---
Labels: scala spark  (was: )

> Build an I/O model that can replace sequence files for import/export
> 
>
> Key: MAHOUT-1568
> URL: https://issues.apache.org/jira/browse/MAHOUT-1568
> Project: Mahout
>  Issue Type: New Feature
>  Components: CLI
> Environment: Scala, Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: scala, spark
>
> Implement mechanisms to read and write data from/to flexible stores. These 
> will support tuples streams and drms but with extensions that allow keeping 
> user defined values for IDs. The mechanism in some sense can replace Sequence 
> Files for import/export and will make the operation much easier for the user. 
> In many cases directly consuming their input files.
> Start with text delimited files for input/output in the Spark version of 
> ItemSimilarity
> A proposal is running with ItemSimilarity on Spark and is documented on the 
> github wiki here: https://github.com/pferrel/harness/wiki
> Comments are appreciated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1569) Create CLI driver that supports Spark jobs

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1569:
---
Labels: scala spark  (was: )

> Create CLI driver that supports Spark jobs
> --
>
> Key: MAHOUT-1569
> URL: https://issues.apache.org/jira/browse/MAHOUT-1569
> Project: Mahout
>  Issue Type: New Feature
>  Components: CLI
> Environment: Scala, Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: scala, spark
>
> Create a design for CLI drivers, including an option parser, base 
> MahoutDriver for Spark, that uses a text file I/O mechanism MAHOUT-1568
> A version of the proposal is implemented and running for ItemSimilarity on 
> Spark. MAHOUT-1541
> A proposal is running with ItemSimilarity on Spark and is documented on the 
> github wiki here: https://github.com/pferrel/harness/wiki
> Comments are appreciated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1626:
---
Labels: DSL scala spark  (was: )

> Support for required quasi-algebraic operations and starting with aggregating 
> rows/blocks
> -
>
> Key: MAHOUT-1626
> URL: https://issues.apache.org/jira/browse/MAHOUT-1626
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 1.0
>Reporter: Gokhan Capan
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1625) lucene2seq: failure to convert a document that does not contain a field (the field is not required)

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1625:
---
Labels: LuceneIndexHelper easyfix legacy lucene lucene2seq mahout  (was: 
LuceneIndexHelper easyfix lucene lucene2seq mahout)

> lucene2seq: failure to convert a document that does not contain a field (the 
> field is not required)
> ---
>
> Key: MAHOUT-1625
> URL: https://issues.apache.org/jira/browse/MAHOUT-1625
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: CentOS 6.5
>Reporter: Tom Lampert
>  Labels: LuceneIndexHelper, easyfix, legacy, lucene, lucene2seq, 
> mahout
> Fix For: 1.0
>
>
> When trying to convert a lucene index in which not all fields are required 
> (and therefore in some documents the field does not exist) the following 
> exception is thrown:
> java.lang.IllegalArgumentException: Field 'MISSING_FIELDNAME' does not exist 
> in the index
>   at 
> org.apache.mahout.text.LuceneIndexHelper.fieldShouldExistInIndex(LuceneIndexHelper.java:36)
>   at 
> org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:63)
>   at 
> org.apache.mahout.text.LuceneSegmentInputFormat.createRecordReader(LuceneSegmentInputFormat.java:76)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> It would be good to either ignore missing field values by default or to have 
> an additional parameter that turns ignoring them on or off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1507) Support input and output using user defined ID wherever possible

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1507:
---
Labels: DSL scala spark  (was: spark)

> Support input and output using user defined ID wherever possible
> 
>
> Key: MAHOUT-1507
> URL: https://issues.apache.org/jira/browse/MAHOUT-1507
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9
> Environment: Spark Scala, Mahout v2
>Reporter: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel  wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov  wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov  wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel  wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov  wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> In my experience Names and Properties are primarily used to store
> external keys, which are quite healthy.
> Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
> I am with you on this.
> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> Note that sequence file keys are bigger than

[jira] [Updated] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1541:
---
Labels: DSL scala spark  (was: )

> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: New Feature
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1538:
---
Labels: DSL Spark scala  (was: )

> Port spectral clustering to Mahout DSL
> --
>
> Key: MAHOUT-1538
> URL: https://issues.apache.org/jira/browse/MAHOUT-1538
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 1.0
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, Spark, scala
> Fix For: 1.0
>
>
> Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
> (already ported) and K-means (currently in progress, or can use Spark MLlib 
> implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1539:
---
Labels: DSL scala spark  (was: )

> Implement affinity matrix computation in Mahout DSL
> ---
>
> Key: MAHOUT-1539
> URL: https://issues.apache.org/jira/browse/MAHOUT-1539
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 1.0
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
> Attachments: ComputeAffinities.scala
>
>
> This has the same goal as MAHOUT-1506, but rather than code the pairwise 
> computations in MapReduce, this will be done in the Mahout DSL.
> An orthogonal issue is the format of the raw input (vectors, text, images, 
> SequenceFiles), and how the user specifies the distance equation and any 
> associated parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1524:
---
Labels: legacy scala  (was: )

> Script to auto-generate and view the Mahout website on a local machine 
> ---
>
> Key: MAHOUT-1524
> URL: https://issues.apache.org/jira/browse/MAHOUT-1524
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Saleem Ansari
>  Labels: legacy, scala
> Fix For: 1.0
>
> Attachments: mahout-website.sh
>
>
> Attached with this ticket is a script that creates a simple setup for editing 
> Mahout Website on a local machine.
> It is useful in the sense that, we can edit the source and the changes are 
> automatically reflected in the generated site. All we need to do is refresh 
> the browser. No further steps required.
> So now one can review the website changes ( the complete website ), on a 
> developer's machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1613) classifier.df.tools.Describe does not handle -D parameters

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1613:
---
Labels: legacy  (was: )

> classifier.df.tools.Describe does not handle -D parameters
> --
>
> Key: MAHOUT-1613
> URL: https://issues.apache.org/jira/browse/MAHOUT-1613
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Haohui Mai
>  Labels: legacy
> Attachments: MAHOUT-1613.000.patch
>
>
> classifier.df.tools.Describe does not handle -D parameters:
> {noformat}
> hadoop jar mahout-examples-0.9.0.2.1.3.0-1887-job.jar 
> org.apache.mahout.classifier.df.tools.Describe -Dio.sort.factor=30 --path 
> /user/hdp/glass.data --file /user/hdp/glass2.info --descriptor I 9 N L
> {noformat}
> Output:
> {noformat}
> org.apache.commons.cli2.OptionException: Unexpected -Dio.sort.factor while 
> processing Options
> {noformat}
> Credit to Dave Wannemacher reporting this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1614) Test failure and failure when converting from Sequnce Files due to Permissions

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1614:
---
Labels: legacy  (was: )

> Test failure and failure when converting from Sequnce Files due to Permissions
> --
>
> Key: MAHOUT-1614
> URL: https://issues.apache.org/jira/browse/MAHOUT-1614
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, Examples, Math
>Affects Versions: 0.9
> Environment: Windows 7, 32bit, Command Line and Eclipse
>Reporter: Kian Momtahan
>  Labels: legacy
>
> I've noticed an issue with Mahout 0.9 when running on Windows 7.  Attempting 
> to convert a sequence files to sparse vectors results in the error "Failed to 
> set permissions of path:  
> mahout\cote\target\mahout-SpareVectorsFromSequenceFilesTest...\hadoop...\mapred\staging\..."
> Initially I encountered this issue in an eclipse project, but in attempting 
> to build Mahout from source, the problem occurred while throughout many of 
> the tests.
> Further issues were noticed during ClusterClassification as a ThreadLeak 
> occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1607) spark-shell:scheduler.DAGScheduler: Failed to run fold at CheckpointedDrmSpark.scala:192

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349793#comment-14349793
 ] 

Andrew Palumbo commented on MAHOUT-1607:


this looks like it was probably due to a the user's spark version: 1.0.2 while 
we were still on 0.9.1.

> spark-shell:scheduler.DAGScheduler: Failed to run fold at 
> CheckpointedDrmSpark.scala:192
> 
>
> Key: MAHOUT-1607
> URL: https://issues.apache.org/jira/browse/MAHOUT-1607
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: ubuntu 13.x   x64
> jdk1.7.0_65
> scala 2.10.4
> spark 1.0.2
>Reporter: hhlin
>  Labels: DSL, scala, spark, test
> Fix For: 1.0
>
>
> follow the step by 
> http://mahout.apache.org/users/sparkbindings/play-with-shell.html, mahou 
> spark-shell startup normally,but when  exec "val drmX = drmData(::, 0 until 
> 4);" ,it throw  exception as bellow:
> 14/08/17 20:13:20 INFO scheduler.DAGScheduler: Failed to run fold at 
> CheckpointedDrmSpark.scala:192
> 14/08/17 20:13:20 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.ArrayStoreException: scala.Tuple3 [duplicate 6]
> 14/08/17 20:13:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, 
> whose tasks have all completed, from pool 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 
> failed 4 times, most recent failure: Exception failure in TID 6 on host 
> iZ23qefud7nZ: java.lang.ArrayStoreException: scala.Tuple3
> 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
> 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
> 
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:118)
> 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
> 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
> 
> org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:120)
> 
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1956)
> 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1850)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(Array

[jira] [Updated] (MAHOUT-1607) spark-shell:scheduler.DAGScheduler: Failed to run fold at CheckpointedDrmSpark.scala:192

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1607:
---
Labels: DSL scala spark test  (was: test)

> spark-shell:scheduler.DAGScheduler: Failed to run fold at 
> CheckpointedDrmSpark.scala:192
> 
>
> Key: MAHOUT-1607
> URL: https://issues.apache.org/jira/browse/MAHOUT-1607
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: ubuntu 13.x   x64
> jdk1.7.0_65
> scala 2.10.4
> spark 1.0.2
>Reporter: hhlin
>  Labels: DSL, scala, spark, test
> Fix For: 1.0
>
>
> follow the step by 
> http://mahout.apache.org/users/sparkbindings/play-with-shell.html, mahou 
> spark-shell startup normally,but when  exec "val drmX = drmData(::, 0 until 
> 4);" ,it throw  exception as bellow:
> 14/08/17 20:13:20 INFO scheduler.DAGScheduler: Failed to run fold at 
> CheckpointedDrmSpark.scala:192
> 14/08/17 20:13:20 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.ArrayStoreException: scala.Tuple3 [duplicate 6]
> 14/08/17 20:13:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, 
> whose tasks have all completed, from pool 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 
> failed 4 times, most recent failure: Exception failure in TID 6 on host 
> iZ23qefud7nZ: java.lang.ArrayStoreException: scala.Tuple3
> 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
> 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
> 
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:118)
> 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
> 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
> 
> org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:120)
> 
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1956)
> 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1850)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   a

[jira] [Updated] (MAHOUT-1612) NullPointerException happens during JSON output format for clusterdumper

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1612:
---
Labels: legacy  (was: )

> NullPointerException happens during JSON output format for clusterdumper
> 
>
> Key: MAHOUT-1612
> URL: https://issues.apache.org/jira/browse/MAHOUT-1612
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Guo Ruijing
>  Labels: legacy
>
> 1. download datafile from:
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> 2. put data file on hdfs:
> hdfs dfs -mkdir testdata
> hdfs dfs -put synthetic_control.data testdata/
> 3. run a mahout clustering job:
> mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> 4. run clusterdump with JSON format:
> mahout clusterdump i output/clusters*-final -p output/clusteredPoints -o 
> /tmp/report -of JSON
> expected:
> clusterdump with JSON format should succeeded same as CSV and TEXT
> actually:
> clusterdump with JSON format throw NullPointerException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1490) Data frame R-like bindings

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1490:
---
Labels: DSL scala  (was: )

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Grant Ingersoll
>  Labels: DSL, scala
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1495:
---
Labels: legacy  (was: )

> Create a website describing the distributed item-based recommender
> --
>
> Key: MAHOUT-1495
> URL: https://issues.apache.org/jira/browse/MAHOUT-1495
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>  Labels: legacy
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1477) Clean up website on Logistic Regression

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1477:
---
Labels: legacy  (was: )

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>  Labels: legacy
> Fix For: 1.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1469:
---
Labels: legacy  (was: )

> Streaming KMeans fails when executed in MapReduce mode and 
> REDUCE_STREAMING_KMEANS is set to true
> -
>
> Key: MAHOUT-1469
> URL: https://issues.apache.org/jira/browse/MAHOUT-1469
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>  Labels: legacy
> Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag 
> set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: > 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and 
> test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0]
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
>   at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
>   at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: 
> job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient: Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output materialized 
> bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient: Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient: Total committed heap usage 
> (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 
> 8.4378167)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349780#comment-14349780
 ] 

Andrew Palumbo commented on MAHOUT-1464:


We can close this, right?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1464) Cooccurrence Analysis on Spark

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1464:
---
Labels: DSL scala spark  (was: )

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1485) Clean up Recommender Overview page

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1485:
---
Labels: legacy  (was: )

> Clean up Recommender Overview page
> --
>
> Key: MAHOUT-1485
> URL: https://issues.apache.org/jira/browse/MAHOUT-1485
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>  Labels: legacy
> Fix For: 1.0
>
>
> Clean up the recommender overview page, remove outdated content and make sure 
> the examples work.
> https://mahout.apache.org/users/recommender/recommender-documentation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1425) SGD classifier example with bank marketing dataset

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1425:
---
Labels: legacy  (was: )

> SGD classifier example with bank marketing dataset
> --
>
> Key: MAHOUT-1425
> URL: https://issues.apache.org/jira/browse/MAHOUT-1425
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.0
>Reporter: Frank Scholten
>Assignee: Frank Scholten
>  Labels: legacy
> Fix For: 1.0
>
> Attachments: MAHOUT-1425.patch
>
>
> As discussed on the mailing list a few weeks back I started working on an SGD 
> classifier example with the bank marketing dataset from UCI: 
> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
> See https://github.com/frankscholten/mahout-sgd-bank-marketing
> Ted has also made further changes that were very useful so I suggest to add 
> this example to Mahout
> Ted: can you tell a bit more about the log transforms? Some of them are just 
> Math.log while others are more complex expressions. 
> What else is needed to contribute it to Mahout? Anything that could improve 
> the example?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1462) Cleaning up Random Forests documentation on Mahout website

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1462:
---
Labels: legacy  (was: )

> Cleaning up Random Forests documentation on Mahout website
> --
>
> Key: MAHOUT-1462
> URL: https://issues.apache.org/jira/browse/MAHOUT-1462
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Manoj Awasthi
>  Labels: legacy
> Fix For: 1.0
>
>
> Following are the items which need be added or changed. 
> I think this page can be broken into two segments. First can be following: 
> 
> Introduction to Random Forests
> Random Forests are an ensemble machine learning technique originally proposed 
> by Leo Breiman (UCB) which uses classification and regression trees as 
> underlying classification mechanism. Trademark to Random Forest is maintained 
> by Leo Breiman and Adele Cutler. 
> Official website for Random Forests: 
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> Original paper published: http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> 
> Second section can following: 
> 
> Classifying with random forests with Mahout
> 
> This section can be what it is right now on the website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1443) Update "How to release page"

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1443:
---
Labels: legacy scala  (was: )

> Update "How to release page"
> 
>
> Key: MAHOUT-1443
> URL: https://issues.apache.org/jira/browse/MAHOUT-1443
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
>  Labels: legacy, release, scala
> Fix For: 1.0
>
>
> I have a favor to ask. Could you have a look at the "How To Release" 
> page and tell if the information there is still correct? I'm asking you 
> this because you have done the latest release. After your OK, I'll go 
> and improve formatting and readability of that page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1443) Update "How to release page"

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1443:
---
Labels: legacy release scala  (was: legacy scala)

> Update "How to release page"
> 
>
> Key: MAHOUT-1443
> URL: https://issues.apache.org/jira/browse/MAHOUT-1443
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
>  Labels: legacy, release, scala
> Fix For: 1.0
>
>
> I have a favor to ask. Could you have a look at the "How To Release" 
> page and tell if the information there is still correct? I'm asking you 
> this because you have done the latest release. After your OK, I'll go 
> and improve formatting and readability of that page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1552:
---
Labels: legacy  (was: )

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
>  Labels: legacy
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1551:
---
Labels: documentation legacy  (was: documentation)

> Add document to describe how to use mlp with command line
> -
>
> Key: MAHOUT-1551
> URL: https://issues.apache.org/jira/browse/MAHOUT-1551
> Project: Mahout
>  Issue Type: Documentation
>  Components: Classification, CLI, Documentation
>Affects Versions: 0.9
>Reporter: Yexi Jiang
>  Labels: documentation, legacy
> Fix For: 1.0
>
> Attachments: README.md, README.md
>
>
> Add documentation about the usage of multi-layer perceptron in command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1586) Downloads must have hashes

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1586:
---
Labels: legacy scala  (was: )

> Downloads must have hashes
> --
>
> Key: MAHOUT-1586
> URL: https://issues.apache.org/jira/browse/MAHOUT-1586
> Project: Mahout
>  Issue Type: Bug
>  Components: collections
>Affects Versions: 1.0
> Environment: http://www.apache.org/dist/mahout/mahout-collections-1.0/
> https://dist.apache.org/repos/dist/release/mahout/mahout-collections-1.0/
>Reporter: Sebb
>Assignee: Suneel Marthi
>  Labels: legacy, scala
>
> The download archives in 
> http://www.apache.org/dist/mahout/mahout-collections-1.0/ don't have any 
> hashes. These are required; please add either MD5 or SHA hashes (or both) to 
> https://dist.apache.org/repos/dist/release/mahout/mahout-collections-1.0/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1582) Create simpler row and column aggregation API at local level

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1582:
---
Labels: legacy math scala  (was: )

> Create simpler row and column aggregation API at local level
> 
>
> Key: MAHOUT-1582
> URL: https://issues.apache.org/jira/browse/MAHOUT-1582
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
>  Labels: legacy, math, scala
>
> The issue is that the current row and column aggregation API makes it 
> difficult to do anything but row by row aggregation using anonymous classes.  
> There is no scope for being aware of locality, nor to use the well known 
> function definitions in Functions.  This makes lots of optimizations 
> impossible and many of these are optimizations that we want to have.  An 
> example would be adding up absolute values of values.  With the current API, 
> it would be very hard to optimize for sparse matrices and the wrong direction 
> of iteration but with a different API, this should be easy.
> What I suggest is an API of this form:
> {code}
>Vector aggregateRows(DoubleDoubleFunction combiner, DoubleFunction mapper)
> {code}
> This will produce a vector with one element per row in the original.  The 
> nice thing here is that if the matrix is row major, we can iterate over rows 
> and accumulate a value for each row using sparsity as available.  On the 
> other hand, if the matrix is column major, we can keep a vector of 
> accumulators and still use sparsity as appropriate.  
> The use of sparsity comes in because the matrix code now has control over 
> both of the loops involved and also has visibility into properties of the map 
> and combine functions.  For instance, ABS(0) == 0 so if we combine with PLUS, 
> we can use a sparse iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1543) JSON output format for classifying with random forests

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1543:
---
Labels: legacy patch  (was: patch)

> JSON output format for classifying with random forests
> --
>
> Key: MAHOUT-1543
> URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: larryhu
>  Labels: legacy, patch
> Fix For: 0.7
>
> Attachments: MAHOUT-1543.patch
>
>
> This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1598:
---
Labels: legacy  (was: )

> extend seq2sparse to handle multiple text blocks of same document
> -
>
> Key: MAHOUT-1598
> URL: https://issues.apache.org/jira/browse/MAHOUT-1598
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Wolfgang Buchner
>  Labels: legacy
>
> Currently the seq2sparse or in particular the 
> org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one 
> text block per document.
> I stumbled on this because i'm having an use case where one document 
> represents a ticket which can have several text blocks in different 
> languages. 
> So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall 
> tokenize each text block itself. So i can use language specific features in 
> our Lucene Analyzer.
> Unfortunately the current implementation doesn't support this.
> But with just minor changes this can be made possible.
> The only thing which has to be changed would be the 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values 
> of the iterable (not just the 1st one >.<)
> An Alternative would be to change this Reducer to a Mapper, i don't get why 
> in the 1st place this is implemented as an reducer. Is there any benefit from 
> this?
> I will provide a PR via github.
> Please have a look onto this and tell me if i am assuming anything wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1620) how to use mahout command matrixmult ?

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1620:
---
Labels: legacy  (was: )

> how to use mahout command  matrixmult ?
> ---
>
> Key: MAHOUT-1620
> URL: https://issues.apache.org/jira/browse/MAHOUT-1620
> Project: Mahout
>  Issue Type: Question
>Reporter: zuozhibin
>  Labels: legacy
>
> dear everyone :
> this is my problem,firstly,i create two matrix files a and  b,
> a = 1 3 2  b = 3 2
>2 3 11 5
>1 4 52 1
> then i put them to filesystem. 
> i use mahout command :mahout seqdirectory change a and b to seq file a1 and 
> b1.
> then,use mahout command :mahout matrixmult , a1*b1
> it works,but the anser is 
> Input Path: hdfs://hacluster/user/zzb/c1/part-0
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
> org.apache.mahout.math.VectorWritable
> Count: 0
> where am i wrong?  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1621) k-fold cross-validation in MapReduce Random Forest example?

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1621:
---
Labels: legacy  (was: )

> k-fold cross-validation in MapReduce Random Forest example?
> ---
>
> Key: MAHOUT-1621
> URL: https://issues.apache.org/jira/browse/MAHOUT-1621
> Project: Mahout
>  Issue Type: Question
>  Components: Examples
> Environment: Ubuntu Linux 14.04
>Reporter: Tawfiq Hasanin
>  Labels: legacy
> Fix For: 1.0
>
>
> My goal is to modify MapReduce Random Forest example by combining 
> BuildForest.java and TestForest.java into a new class called RandomForest.java
> The main point is to input one data file which is going to be used in 
> training and testing; with k-fold cross-validation. 
> I have a big data with hight diminutional features and small amount of 
> instances. 
> Seems to be a frustrating dead-end. is this process achievable? Or is it 
> against MapReduce nature? 
> Thanks..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1512) Hadoop 2 compatibility

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1512:
---
Labels: legacy scala  (was: )

> Hadoop 2 compatibility
> --
>
> Key: MAHOUT-1512
> URL: https://issues.apache.org/jira/browse/MAHOUT-1512
> Project: Mahout
>  Issue Type: Task
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
>Priority: Critical
>  Labels: legacy, scala
> Fix For: 1.0
>
>
> We must ensure that all our MR code also runs on Hadoop 2. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1522) Handle logging levels via log4j.xml

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1522:
---
Labels: legacy scala  (was: )

> Handle logging levels via log4j.xml
> ---
>
> Key: MAHOUT-1522
> URL: https://issues.apache.org/jira/browse/MAHOUT-1522
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Critical
>  Labels: legacy, scala
> Fix For: 1.0
>
>
> We don't have a properties file to tell log4j what to do, so we inherit other 
> frameworks' settings.
> Suggestion is to add a log4j.xml file in a canonical place and set up logging 
> levels, maybe separating out components for ease of setting levels during 
> debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1640:
---
Labels: legacy math scala  (was: )

> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1641:
---
Labels: DSL scala spark  (was: )

> Add conversion from a RDD[(String, String)] to a Drm[Int]
> -
>
> Key: MAHOUT-1641
> URL: https://issues.apache.org/jira/browse/MAHOUT-1641
> Project: Mahout
>  Issue Type: Question
>  Components: spark
>Affects Versions: 1.0
>Reporter: Erlend Hamnaberg
>  Labels: DSL, scala, spark
>
> Hi.
> We are using the coocurrence part of mahout as a library. We get our data 
> from other sources, like for instance Cassandra. We dont want to write that 
> data to disk, and read it back since we already have the data on each slave.
> I have created some conversion functions based on one of the 
> IndexedDatasetSpark readers, cant remember which one at the moment.
> Is there interest in the community for this kind of feature? I can probably 
> clean it up and add this as a github pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1632) Please help me im stuck on using 20 newsgroups example on Windows

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1632:
---
Labels: legacy  (was: )

> Please help me im stuck on using 20 newsgroups example on Windows
> -
>
> Key: MAHOUT-1632
> URL: https://issues.apache.org/jira/browse/MAHOUT-1632
> Project: Mahout
>  Issue Type: Question
>Reporter: Mishari SH
>  Labels: legacy
>
> Hello there, I've been using hadoop & mahout on my windows OS and I started 
> the hadoop cluster before starting the mahout in order to use the cluster for 
> it, then, I did start the mahout to test the 20newsgroups example but it 
> throws an exception as not a valid DFS filename as show below in details from 
> the beginning :
> Microsoft Windows [Version 6.1.7601]
> Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
> C:\Users\Admin>cd\
> C:\>cd mahout
> C:\mahout>cd examples
> C:\mahout\examples>cd bin
> C:\mahout\examples\bin>classify-20newsgroups.sh
> Welcome to Git (version 1.9.4-preview20140815)
> Run 'git help git' to display the help index.
> Run 'git help ' to display help for specific commands.
> Please select a number to choose the corresponding task to run
> 1. cnaivebayes
> 2. naivebayes
> 3. sgd
> 4. clean -- cleans up the work area in /tmp/mahout-work-
> Enter your choice : 2
> ok. You chose 2 and we'll use naivebayes
> creating work directory at /tmp/mahout-work-
> + echo 'Preparing 20newsgroups data'
> Preparing 20newsgroups data
> + rm -rf /tmp/mahout-work-/20news-all
> + mkdir /tmp/mahout-work-/20news-all
> + cp -R /tmp/mahout-work-/20news-bydate/20news-bydate-test/alt.atheism 
> /tmp/maho
> ut-work-/20news-bydate/20news-bydate-test/comp.graphics 
> /tmp/mahout-work-/20news
> -bydate/20news-bydate-test/comp.os.ms-windows.misc 
> /tmp/mahout-work-/20news-byda
> te/20news-bydate-test/comp.sys.ibm.pc.hardware 
> /tmp/mahout-work-/20news-bydate/2
> 0news-bydate-test/comp.sys.mac.hardware 
> /tmp/mahout-work-/20news-bydate/20news-b
> ydate-test/comp.windows.x 
> /tmp/mahout-work-/20news-bydate/20news-bydate-test/mis
> c.forsale /tmp/mahout-work-/20news-bydate/20news-bydate-test/rec.autos 
> /tmp/maho
> ut-work-/20news-bydate/20news-bydate-test/rec.motorcycles 
> /tmp/mahout-work-/20ne
> ws-bydate/20news-bydate-test/rec.sport.baseball 
> /tmp/mahout-work-/20news-bydate/
> 20news-bydate-test/rec.sport.hockey 
> /tmp/mahout-work-/20news-bydate/20news-bydat
> e-test/sci.crypt 
> /tmp/mahout-work-/20news-bydate/20news-bydate-test/sci.electron
> ics /tmp/mahout-work-/20news-bydate/20news-bydate-test/sci.med 
> /tmp/mahout-work-
> /20news-bydate/20news-bydate-test/sci.space 
> /tmp/mahout-work-/20news-bydate/20ne
> ws-bydate-test/soc.religion.christian 
> /tmp/mahout-work-/20news-bydate/20news-byd
> ate-test/talk.politics.guns 
> /tmp/mahout-work-/20news-bydate/20news-bydate-test/t
> alk.politics.mideast 
> /tmp/mahout-work-/20news-bydate/20news-bydate-test/talk.pol
> itics.misc 
> /tmp/mahout-work-/20news-bydate/20news-bydate-test/talk.religion.misc
>  /tmp/mahout-work-/20news-bydate/20news-bydate-train/alt.atheism 
> /tmp/mahout-wor
> k-/20news-bydate/20news-bydate-train/comp.graphics 
> /tmp/mahout-work-/20news-byda
> te/20news-bydate-train/comp.os.ms-windows.misc 
> /tmp/mahout-work-/20news-bydate/2
> 0news-bydate-train/comp.sys.ibm.pc.hardware 
> /tmp/mahout-work-/20news-bydate/20ne
> ws-bydate-train/comp.sys.mac.hardware 
> /tmp/mahout-work-/20news-bydate/20news-byd
> ate-train/comp.windows.x 
> /tmp/mahout-work-/20news-bydate/20news-bydate-train/mis
> c.forsale /tmp/mahout-work-/20news-bydate/20news-bydate-train/rec.autos 
> /tmp/mah
> out-work-/20news-bydate/20news-bydate-train/rec.motorcycles 
> /tmp/mahout-work-/20
> news-bydate/20news-bydate-train/rec.sport.baseball 
> /tmp/mahout-work-/20news-byda
> te/20news-bydate-train/rec.sport.hockey 
> /tmp/mahout-work-/20news-bydate/20news-b
> ydate-train/sci.crypt 
> /tmp/mahout-work-/20news-bydate/20news-bydate-train/sci.el
> ectronics /tmp/mahout-work-/20news-bydate/20news-bydate-train/sci.med 
> /tmp/mahou
> t-work-/20news-bydate/20news-bydate-train/sci.space 
> /tmp/mahout-work-/20news-byd
> ate/20news-bydate-train/soc.religion.christian 
> /tmp/mahout-work-/20news-bydate/2
> 0news-bydate-train/talk.politics.guns 
> /tmp/mahout-work-/20news-bydate/20news-byd
> ate-train/talk.politics.mideast 
> /tmp/mahout-work-/20news-bydate/20news-bydate-tr
> ain/talk.politics.misc 
> /tmp/mahout-work-/20news-bydate/20news-bydate-train/talk.
> religion.misc /tmp/mahout-work-/20news-all
> + '[' 'C:\hadp' '!=' '' ']'
> + '[' '' == '' ']'
> + echo 'Copying 20newsgroups data to HDFS'
> Copying 20newsgroups data to HDFS
> + set +e
> + 'C:\hadp/bin/hadoop' dfs -rmr /tmp/mahout-work-/20news-all
> /c/hadp/etc/hadoop/hadoop-env.sh: line 103: /c/hadp/bin:

[jira] [Updated] (MAHOUT-1633) Failure to execute query when solr index contains documents with different fields

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1633:
---
Labels: LuceneSegmentRecordReader easyfix legacy lucene 
lucene2SeqConfiguration lucene2seq mahout  (was: LuceneSegmentRecordReader 
easyfix lucene lucene2SeqConfiguration lucene2seq mahout)

> Failure to execute query when solr index contains documents with different 
> fields
> -
>
> Key: MAHOUT-1633
> URL: https://issues.apache.org/jira/browse/MAHOUT-1633
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: CentOS 6.5
>Reporter: Tom Lampert
>  Labels: LuceneSegmentRecordReader, easyfix, legacy, lucene, 
> lucene2SeqConfiguration, lucene2seq, mahout
> Fix For: 1.0
>
>
> When using Lucene2Seq on a lucene Index that contains documents that have 
> different fields the following error is output:
> java.lang.IllegalArgumentException: Could not create query scorer for query: 
> tableName:code
>   at 
> org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:69)
>   at 
> org.apache.mahout.text.LuceneSegmentInputFormat.createRecordReader(LuceneSegmentInputFormat.java:76)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> The query that is used executes fine in Solr upon the same index. If the 
> index does not contain documents having different fields (from the same 
> source) the function executes without a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1631) Streaming Series and DataFrames

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1631:
---
Labels: scala spark  (was: )

> Streaming Series and DataFrames
> ---
>
> Key: MAHOUT-1631
> URL: https://issues.apache.org/jira/browse/MAHOUT-1631
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Suminda Dharmasena
>  Labels: scala, spark
>
> For computation using live / streaming data, add thee functionality to have 
> DF to which new data gets appended when them become available and can be used 
> to process streaming data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MAHOUT-1603) Tweaks for Spark 1.0.x

2015-03-05 Thread Dmitriy Lyubimov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1603.
--
Resolution: Fixed

> Tweaks for Spark 1.0.x 
> ---
>
> Key: MAHOUT-1603
> URL: https://issues.apache.org/jira/browse/MAHOUT-1603
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>
> Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1604) Create a RowSimilarity for Spark

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1604:
---
Labels: DSL scala spark  (was: )

> Create a RowSimilarity for Spark
> 
>
> Key: MAHOUT-1604
> URL: https://issues.apache.org/jira/browse/MAHOUT-1604
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
>
> Using CooccurrenceAnalysis.cooccurrence create a driver that reads a text DRM 
> or two and produces LLR similarity/cross-similarity matrices.
> This will produce the same results as ItemSimilarity but take a Drm as input 
> instead of individual cells.
> The first version will only support LLR, other similarity measures will need 
> to be in separate Jiras



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349754#comment-14349754
 ] 

Andrew Palumbo commented on MAHOUT-1603:


this can be closed, right?

> Tweaks for Spark 1.0.x 
> ---
>
> Key: MAHOUT-1603
> URL: https://issues.apache.org/jira/browse/MAHOUT-1603
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>
> Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1603) Tweaks for Spark 1.0.x

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1603:
---
Labels: DSL scala spark  (was: )

> Tweaks for Spark 1.0.x 
> ---
>
> Key: MAHOUT-1603
> URL: https://issues.apache.org/jira/browse/MAHOUT-1603
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
>
> Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1602) Euclidean Distance Similarity Math

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1602:
---
Labels: legacy  (was: )

> Euclidean Distance Similarity Math 
> ---
>
> Key: MAHOUT-1602
> URL: https://issues.apache.org/jira/browse/MAHOUT-1602
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Math
>Reporter: Leonardo Fernandez Sanchez
>  Labels: legacy
>
> Within the file:
> /mrlegacy/src/main/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarity.java
> Mentions that the implementation should be sqrt(n) / (1 + distance).
> Once the equation is simplified, should be: 
> 1 / ((1 + distance) / sqrt(n))
> Coded:
> return 1.0 / ((1.0 + Math.sqrt(sumXYdiff2)) / Math.sqrt(n));
> But instead is (missing grouping brackets): 
> 1 / (1 + distance / sqrt (n))
> Coded:
> return 1.0 / (1.0 + Math.sqrt(sumXYdiff2) / Math.sqrt(n));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1589) mahout.cmd has duplicated content

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1589:
---
Labels: legacy scala  (was: )

> mahout.cmd has duplicated content
> -
>
> Key: MAHOUT-1589
> URL: https://issues.apache.org/jira/browse/MAHOUT-1589
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.0
> Environment: Windows
>Reporter: Venkat Ranganathan
>  Labels: legacy, scala
> Fix For: 1.0
>
> Attachments: MAHOUT-1589.patch
>
>
> bin/mahout.cmd has duplicated contents.   Need to trim it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1590) mahout unit test failures due to guava version conflict on hadoop 2

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1590:
---
Labels: DSL scala spark  (was: )

> mahout unit test failures due to guava version conflict on hadoop 2
> ---
>
> Key: MAHOUT-1590
> URL: https://issues.apache.org/jira/browse/MAHOUT-1590
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 1.0
> Environment: Hadoop 2.x 
>Reporter: Venkat Ranganathan
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
> Attachments: MAHOUT-1590.patch
>
>
> Running 
>mvn clean test -Dhadoop2.version=2.4.0 
> has many unit test failures because guava version mismatch.   
> For example:
> ==
> completeJobToyExample(org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest)
>   Time elapsed: 0.736 sec  <<< ERROR!
> java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
> at 
> __randomizedtesting.SeedInfo.seed([BEBBF9ACD237F984:B570D1523391FD4E]:0)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:278)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:375)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.run(ParallelALSFactorizationJob.java:172)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.explicitExample(ParallelALSFactorizationJobTest.java:112)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.completeJobToyExample(ParallelALSFactorizationJobTest.java:71)
> =
> hadoop mapreduce V2 is using guava v11.0.2 and mahout is using guava version 
> 16.0
> After trying different versions guava version 14.0 seems to have hadoop MR V2 
> compatible jars and mahout needed classes. 
> The unit tests ran successfully after changing the dependency in mahout to 
> v14.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1578) Optimizations in matrix serialization

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349748#comment-14349748
 ] 

Andrew Palumbo commented on MAHOUT-1578:


this can be closed right?

> Optimizations in matrix serialization
> -
>
> Key: MAHOUT-1578
> URL: https://issues.apache.org/jira/browse/MAHOUT-1578
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
>  Labels: legacy
> Fix For: 1.0
>
>
> MatrixWritable contains inefficient code in a few places:
>  
>  * type and size are stored with every vector, although they are the same for 
> every vector
>  * in some places vectors are added to the matrix via assign() in places 
> where we could directly use the instance
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1578) Optimizations in matrix serialization

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1578:
---
Labels: legacy  (was: )

> Optimizations in matrix serialization
> -
>
> Key: MAHOUT-1578
> URL: https://issues.apache.org/jira/browse/MAHOUT-1578
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
>  Labels: legacy
> Fix For: 1.0
>
>
> MatrixWritable contains inefficient code in a few places:
>  
>  * type and size are stored with every vector, although they are the same for 
> every vector
>  * in some places vectors are added to the matrix via assign() in places 
> where we could directly use the instance
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1636:
---
Labels: DSL scala spark  (was: )

> Class dependencies for the spark module are put in a job.jar, which is very 
> inefficient
> ---
>
> Key: MAHOUT-1636
> URL: https://issues.apache.org/jira/browse/MAHOUT-1636
> Project: Mahout
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 1.0-snapshot
>Reporter: Pat Ferrel
>Assignee: Ted Dunning
>  Labels: DSL, scala, spark
> Fix For: 1.0-snapshot
>
>
> using a maven plugin and an assembly job.xml a job.jar is created with all 
> dependencies including transitive ones. This job.jar is in 
> mahout/spark/target and is included in the classpath when a Spark job is run. 
> This allows dependency classes to be found at runtime but the job.jar include 
> a great deal of things not needed that are duplicates of classes found in the 
> main mrlegacy job.jar.  If the job.jar is removed, drivers will not find 
> needed classes. A better way needs to be implemented for including class 
> dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for 
> now. Whoever picks up this Jira will have to remove it after deciding on a 
> better method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1642) Iterator class within SimilarItems class always misses the first element

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1642:
---
Labels: legacy  (was: )

> Iterator class within SimilarItems class always misses the first element
> 
>
> Key: MAHOUT-1642
> URL: https://issues.apache.org/jira/browse/MAHOUT-1642
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Reporter: Guohua Hao
>  Labels: legacy
>
> In the next() function of SimilarItemsIterator class within SimilarItems 
> class, variable 'index' is incremented before returning the actual element at 
> that position, therefore the first element when iterating will always be 
> missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1638) H2O bindings fail at drmParallelizeWithRowLabels(...)

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1638:
---
Labels: DSL h2o scala  (was: DSL scala)

> H2O bindings fail at drmParallelizeWithRowLabels(...)
> -
>
> Key: MAHOUT-1638
> URL: https://issues.apache.org/jira/browse/MAHOUT-1638
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: Andrew Palumbo
>  Labels: DSL, h2o, scala
> Fix For: 1.0
>
>
> The H2OHelper.drmFromMatrix(...) function fails when trying to write row 
> label String keys to a water.fvec.Vec.:
> {code:java}
>  java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
> {code} 
> This causes an exception when calling drm.drmParallelizeWithRowLabels(...)
> To reproduce, apply [PR#72: Enable Naive Bayes Tests in h2o 
> Module|https://github.com/apache/mahout/pull/72] and run:
> {code} $ mvn test 
> {code}
> from the h2o module:
> {code:java}
> - NB Aggregator *** FAILED ***
>   java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
>   
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply$mcV$sp(NBTestBase.scala:91)
> 
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
>
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
>
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349739#comment-14349739
 ] 

Andrew Palumbo commented on MAHOUT-1643:


for future reference:
{quote}
note that with MAHOUT_OPTS you have a choice. You can either set up
env or you can use inline syntax like

MAHOUT_OPTS='-Dk=n' bin/mahout spark-shell
{quote}

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 1.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 1.0
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349710#comment-14349710
 ] 

Andrew Palumbo commented on MAHOUT-1643:


Ok, I'll leave this open.  It seems that either way we need to address the 
issue of tuning the spark the configuration for spark-shell either via CLI 
options or MAHOUT_OPS or both. 

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 1.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 1.0
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Comment Edited] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Dmitriy Lyubimov

note that with MAHOUT_OPTS you have a choice. You can either set up
env or you can use inline syntax like

MAHOUT_OPTS='-Dk=n' bin/mahout spark-shell

On Thu, Mar 5, 2015 at 4:50 PM, Andrew Palumbo (JIRA)  wrote:
>
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349655#comment-14349655
>  ]
>
> Andrew Palumbo edited comment on MAHOUT-1643 at 3/6/15 12:50 AM:
> -
>
> Yeah- talking about the shell.  Do we want to process CLI args for the spark 
> configuration here ie:
>
> {code}
> $bin/mahout spark-shell -D:k=n
> {code}
>
> or should i just close this and we'll just go off of MAHOUT_OPTS?
>
>
>
> was (Author: andrew_palumbo):
> Yeah- talking about the shell.  Do we want to process CLI args here ie:
>
> {code}
> $bin/mahout spark-shell -D:k=n
> {code}
>
> or should i just close this and we'll just go off of MAHOUT_OPTS?
>
>
>> CLI arguments are not being processed in spark-shell
>> 
>>
>> Key: MAHOUT-1643
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
>> Project: Mahout
>>  Issue Type: Bug
>>  Components: CLI, spark
>>Affects Versions: 1.0
>> Environment: spark spark-shell
>>Reporter: Andrew Palumbo
>>  Labels: DSL, scala, spark, spark-shell
>> Fix For: 1.0
>>
>>
>> The CLI arguments are not being processed in spark-shell.  Most importantly 
>> the spark options are not being passed to the spark configuration via:
>> {code}
>> $ mahout spark-shell -D:k=n
>> {code}
>> The arguments are preserved it through {code}$ bin/mahout{code}There should 
>> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
>> simply parsing the args array.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

[jira] [Comment Edited] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349655#comment-14349655
 ] 

Andrew Palumbo edited comment on MAHOUT-1643 at 3/6/15 12:50 AM:
-

Yeah- talking about the shell.  Do we want to process CLI args for the spark 
configuration here ie:

{code}
$bin/mahout spark-shell -D:k=n
{code}

or should i just close this and we'll just go off of MAHOUT_OPTS?



was (Author: andrew_palumbo):
Yeah- talking about the shell.  Do we want to process CLI args here ie:

{code}
$bin/mahout spark-shell -D:k=n
{code}

or should i just close this and we'll just go off of MAHOUT_OPTS?


> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 1.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 1.0
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349695#comment-14349695
 ] 

Pat Ferrel commented on MAHOUT-1643:


I used the the spark shell methods to change the master to my cluster, for 
instance. This should work on the Mahout shell too and is documented. Might be 
nice to support a few special ones like MASTER of -D:k=v but isn't strictly 
required. 

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 1.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 1.0
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1639) streamingkmeans doesn't properly validate estimatedNumMapClusters -km

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1639:
---
Labels: legacy  (was: )

> streamingkmeans doesn't properly validate estimatedNumMapClusters -km
> -
>
> Key: MAHOUT-1639
> URL: https://issues.apache.org/jira/browse/MAHOUT-1639
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.9
>Reporter: Peter Sergeant
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: legacy
> Fix For: 1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The value of -km isn't checked by the CLI, which means if you don't specify 
> it, you get the rather cryptic:
> {noformat}
> Exception in thread "main" java.lang.NumberFormatException: null
>   at java.lang.Integer.parseInt(Integer.java:454)
>   at java.lang.Integer.parseInt(Integer.java:527)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.configureOptionsForWorkers(StreamingKMeansDriver.java:252)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:239)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>   at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> {noformat}
> Other parameters give helpful error messages when required



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1278) Improve inheritance of Apache parent pom

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1278:
---
Labels: legacy scala  (was: )

> Improve inheritance of Apache parent pom
> 
>
> Key: MAHOUT-1278
> URL: https://issues.apache.org/jira/browse/MAHOUT-1278
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: legacy, scala
> Fix For: 1.0
>
>
> We should update dependency on Apache parent pom (currently we depend on 
> version 9, while 13 is already released).
> With the upgrade we should make the most of inherited settings and plugin 
> versions from Apache parent pom, so we override only what is necessary, to 
> make Mahout POMs smaller and easier to maintain.
> Hopefully by the time this issue gets worked on, 
> maven-remote-resources-plugin with 
> [MRRESOURCES-53|http://jira.codehaus.org/browse/MRRESOURCES-53] fix will be 
> released (since we're affected by it - test jars are being resolved from 
> remote repository instead from the current build / rector repository), and 
> updated Apache parent pom released.
> Implementation note: Mahout parent module and mahout-buildtools module both 
> use Apache parent pom as parent, so both need to be updated. 
> mahout-buildtools module had to be separate from the mahout parent pom (not 
> inheriting it), so that buildtools module can be referenced as dependency of 
> various source quality check plugins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1277) Lose dependency on custom commons-cli

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1277:
---
Labels: legacy scala  (was: )

> Lose dependency on custom commons-cli
> -
>
> Key: MAHOUT-1277
> URL: https://issues.apache.org/jira/browse/MAHOUT-1277
> Project: Mahout
>  Issue Type: Improvement
>  Components: build, CLI
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: legacy, scala
> Fix For: 1.0
>
>
> In 0.8 we have dependency on custom commons-cli fork 
> org.apache.mahout.commons:commons-cli. There are no sources for this under 
> Mahout version control. It's a risk keeping this as dependency.
> We should either use officially released and maintained commons-cli version, 
> or if it's not sufficient for Mahout project needs, replace it completely 
> with something else (e.g. like [JCommander|http://jcommander.org/]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1563:
---
Labels: DSL scala  (was: )

> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0
>Reporter: Andrew Musselman
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.9
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1562) Publish Scaladocs

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1562:
---
Labels: DSL scala scaladocs spark spark-shell  (was: scaladocs)

> Publish Scaladocs
> -
>
> Key: MAHOUT-1562
> URL: https://issues.apache.org/jira/browse/MAHOUT-1562
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Priority: Minor
>  Labels: DSL, scala, scaladocs, spark, spark-shell
> Fix For: 1.0
>
>
> The poms that relate to Scala have the maven-scala-plugin, which will 
> generate Scaladocs if you navigate to the scala subproject and run
> mvn scala:doc
> They will appear in target/site/scaladocs for the subproject.
> This is not supported from the parent mahout pom
> This should be incorporated into the process that publishes the javadocs. It 
> appears the Scala code is not publically browsable as Scaladocs. Not sure 
> where this process lives but the scala stuff should probably follow that same 
> pattern/process as javadocs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1624) Compilation errors when changing Lucene version to 4.10.1

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1624:
---
Labels: legacy lucene  (was: )

> Compilation errors when changing Lucene version to 4.10.1
> -
>
> Key: MAHOUT-1624
> URL: https://issues.apache.org/jira/browse/MAHOUT-1624
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 1.0
> Environment: CentOS 6.5
>Reporter: Tom Lampert
>Priority: Minor
>  Labels: legacy, lucene
> Fix For: 1.0
>
>
> When changing Lucene version to 4_10_1 in all code and 4.10.1 in pom.xml the 
> following compile errors (and warnings) are observed:
> [WARNING] COMPILATION WARNING : 
> [INFO] -
> [WARNING] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:
>  Recompile with -Xlint:deprecation for details.
> [WARNING] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java:
>  Some input files use unchecked or unsafe operations.
> [WARNING] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java:
>  Recompile with -Xlint:unchecked for details.
> [INFO] 4 warnings 
> [INFO] -
> [INFO] -
> [ERROR] COMPILATION ERROR : 
> [INFO] -
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[29,31]
>  cannot find symbol
>   symbol:   class BufferedIndexOutput
>   location: package org.apache.lucene.store
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[305,47]
>  cannot find symbol
>   symbol:   class BufferedIndexOutput
>   location: class org.apache.mahout.text.ReadOnlyFileSystemDirectory
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[163,12]
>  incompatible types
>   required: org.apache.lucene.store.IndexOutput
>   found:
> org.apache.mahout.text.ReadOnlyFileSystemDirectory.FileSystemIndexOutput
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[178,23]
>   is not 
> abstract and does not override abstract method close() in 
> org.apache.lucene.store.Lock
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[319,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[327,14]
>  abstract method close() in org.apache.lucene.store.Directory cannot be 
> accessed directly
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[324,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[335,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[340,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/ReadOnlyFileSystemDirectory.java:[345,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /mahout2/trunk/integration/src/main/java/org/apache/mahout/text/LuceneSegmentRecordReader.java:[67,20]
>  method scorer in class org.apache.lucene.search.Weight cannot be applied to 
> given types;
>   required: 
> org.apache.lucene.index.AtomicReaderContext,org.apache.lucene.util.Bits
>   found: 
> org.apache.lucene.index.AtomicReaderContext,boolean,boolean,
>   reason: actual and formal argument lists differ in length
> [INFO] 11 errors 
> [INFO] -
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Mahout Build Tools  SUCCESS [  1.808 s]
> [INFO] Apache Mahout . SUCCESS [  0.437 s]
> [INFO] Mahout Math ... SUCCESS [ 11.337 s]
> [INFO] Mahout MapReduce

[jira] [Updated] (MAHOUT-1557) Add support for sparse training vectors in MLP

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1557:
---
Labels: legacy mlp  (was: mlp)

> Add support for sparse training vectors in MLP
> --
>
> Key: MAHOUT-1557
> URL: https://issues.apache.org/jira/browse/MAHOUT-1557
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Karol Grzegorczyk
>Priority: Minor
>  Labels: legacy, mlp
> Fix For: 1.0
>
> Attachments: mlp_sparse.diff
>
>
> When the number of input units of MLP is big, it is likely that input vector 
> will be sparse. It should be possible to read input files in a sparse format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1470) Topic dump

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1470:
---
Labels: legacy  (was: )

> Topic dump
> --
>
> Key: MAHOUT-1470
> URL: https://issues.apache.org/jira/browse/MAHOUT-1470
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
>  Labels: legacy
> Fix For: 1.0
>
>
> Per 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Mahout's missing a clusterdump utility that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1516) run classify-20newsgroups.sh failed cause by /tmp/mahout-work-jpan/20news-all does not exists in hdfs.

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1516:
---
Labels: legacy patch  (was: patch)

> run classify-20newsgroups.sh failed cause by /tmp/mahout-work-jpan/20news-all 
> does not exists in hdfs.
> --
>
> Key: MAHOUT-1516
> URL: https://issues.apache.org/jira/browse/MAHOUT-1516
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: hadoop2.2.0 mahout0.9 ubuntu12.04 
>Reporter: Jian Pan
>Priority: Minor
>  Labels: legacy, patch
> Fix For: 1.0
>
>
> + echo 'Copying 20newsgroups data to HDFS'
> Copying 20newsgroups data to HDFS
> + set +e
> + /home/jpan/Software/hadoop-2.2.0/bin/hadoop dfs -rmr 
> /tmp/mahout-work-jpan/20news-all
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> rmr: DEPRECATED: Please use 'rm -r' instead.
> 14/04/17 10:26:25 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> rmr: `/tmp/mahout-work-jpan/20news-all': No such file or directory
> + set -e
> + /home/jpan/Software/hadoop-2.2.0/bin/hadoop dfs -put 
> /tmp/mahout-work-jpan/20news-all /tmp/mahout-work-jpan/20news-all
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> 14/04/17 10:26:26 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> put: `/tmp/mahout-work-jpan/20news-all': No such file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1579:
---
Labels: legacy  (was: )

> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Xiaomeng Huang
>Priority: Minor
>  Labels: legacy
> Attachments: Mahout-1579.000.patch, Mahout-1579.001.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. 
> It's necessay to implement a data model which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-03-05 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel reassigned MAHOUT-1618:
--

Assignee: Pat Ferrel

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: collections-1.0
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, scala, spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-05 Thread Dmitriy Lyubimov

the hack i have only takes MAHOUT_OPTS.

it normally actually makes more sense to set it there since spark
options are too numerous and too long to enter on command line.

so i'd say we need to support MAHOUT_OPTS at minimum; or both.

On Thu, Mar 5, 2015 at 4:04 PM, Andrew Palumbo (JIRA)  wrote:
>
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349655#comment-14349655
>  ]
>
> Andrew Palumbo commented on MAHOUT-1643:
> 
>
> Yeah- talking about the shell.  Do we want to process CLI args here ie:
>
> {code}
> $bin/mahout spark-shell -D:k=n
> {code}
>
> or should i just close this and we'll just go off of MAHOUT_OPTS?
>
>
>> CLI arguments are not being processed in spark-shell
>> 
>>
>> Key: MAHOUT-1643
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
>> Project: Mahout
>>  Issue Type: Bug
>>  Components: CLI, spark
>>Affects Versions: 1.0
>> Environment: spark spark-shell
>>Reporter: Andrew Palumbo
>>  Labels: DSL, scala, spark, spark-shell
>> Fix For: 1.0
>>
>>
>> The CLI arguments are not being processed in spark-shell.  Most importantly 
>> the spark options are not being passed to the spark configuration via:
>> {code}
>> $ mahout spark-shell -D:k=n
>> {code}
>> The arguments are preserved it through {code}$ bin/mahout{code}There should 
>> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
>> simply parsing the args array.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

[jira] [Updated] (MAHOUT-1622) MultithreadedBatchItemSimilarities outputs incorrect number of similarities.

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1622:
---
Labels: legacy  (was: )

> MultithreadedBatchItemSimilarities outputs incorrect number of similarities.
> 
>
> Key: MAHOUT-1622
> URL: https://issues.apache.org/jira/browse/MAHOUT-1622
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Jesse Daniels
>Priority: Minor
>  Labels: legacy
> Attachments: batchSimilarities.patch
>
>
> In some cases the Output class in MultithreadedBatchItemSimilarities does not 
> output all of the similarity pairs that it should. It is very possible for 
> the number of active workers to go to zero while in the while loop, in which 
> case the remaining similarities for the finished workers will not be flushed 
> to the output. This is because the while loop is only conditioned on whether 
> there are active workers or not. An easy fix is to also check to make sure 
> the results structure is not empty. This way both the number of active 
> workers must be 0 and the result set must be empty to exit the while loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1593) cluster-reuters.sh does not work complaining java.lang.IllegalStateException

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1593:
---
Labels: legacy patch  (was: patch)

> cluster-reuters.sh does not work complaining java.lang.IllegalStateException
> 
>
> Key: MAHOUT-1593
> URL: https://issues.apache.org/jira/browse/MAHOUT-1593
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: Hadoop version: 2.4.0.2.1.1.0-385
> Git hash: 2b65475c3ab682ebd47cffdc6b502698799cd2c8 (trunk)
>Reporter: jaehoon ko
>Priority: Minor
>  Labels: legacy, patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1593.patch
>
>
> When I choose "kmeans clustering" in cluster-reuters.sh, clusterdump 
> complains java.lang.IllegalStateException as follows:
> {code:borderStyle=solid}
> Exception in thread "main" java.lang.IllegalStateException: 
> /tmp/mahout-work-user/reuters-kmeans/clusters-*-final
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78)
> at 
> org.apache.mahout.clustering.evaluation.ClusterEvaluator.loadClusters(ClusterEvaluator.java:93)
> at 
> org.apache.mahout.clustering.evaluation.ClusterEvaluator.(ClusterEvaluator.java:81)
> at 
> org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:208)
> at 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:157)
> at 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:101)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.io.FileNotFoundException: File 
> /tmp/mahout-work-user/reuters-kmeans/clusters-*-final does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.(SequenceFileDirValueIterator.java:70)
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:76)
> ... 18 more
> {code}
> Other clustering options run well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1594) Example factorize-movielens-1M.sh does not use HDFS

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1594:
---
Labels: legacy newbie patch  (was: newbie patch)

> Example factorize-movielens-1M.sh does not use HDFS
> ---
>
> Key: MAHOUT-1594
> URL: https://issues.apache.org/jira/browse/MAHOUT-1594
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: Hadoop version: 2.4.0.2.1.1.0-385
> Git hash: 2b65475c3ab682ebd47cffdc6b502698799cd2c8 (trunk)
>Reporter: jaehoon ko
>Priority: Minor
>  Labels: legacy, newbie, patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1594.patch
>
>
> It seems that factorize-movielens-1M.sh does not use HDFS at all. All paths 
> look local paths, not HDFS. So the example crashes immeidately because it 
> cannot find input data from HDFS:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: /tmp/mahout-work-hoseog.lee/movielens/ratings.csv
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:320)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:375)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> at 
> org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter.run(DatasetSplitter.java:94)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at 
> org.apache.mahout.cf.taste.hadoop.als.DatasetSplitter.main(DatasetSplitter.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1619) HighDFWordsPruner overwrites cache files

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1619:
---
Labels: legacy  (was: )

> HighDFWordsPruner overwrites cache files
> 
>
> Key: MAHOUT-1619
> URL: https://issues.apache.org/jira/browse/MAHOUT-1619
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 1.0, 0.9
>Reporter: Burke Webster
>Priority: Minor
>  Labels: legacy
>
> HighDFWordsPruner uses DistributedCache.setCacheFiles which will overwrite 
> any files already in the cache.  Per the fix in MAHOUT-1498 we should be 
> using addCacheFile, which will not overwrite existing cache files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1592) bin/maout's seqdirectory doesn't work when MAHOUT_LOCAL non-empty

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1592:
---
Labels: legacy  (was: )

> bin/maout's seqdirectory doesn't work when MAHOUT_LOCAL non-empty
> -
>
> Key: MAHOUT-1592
> URL: https://issues.apache.org/jira/browse/MAHOUT-1592
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.9
> Environment: Linux
>Reporter: Alex Ott
>Priority: Minor
>  Labels: legacy
>
> trying to run seqdirectory with MAHOUT_LOCAL set to non-empty lead to 
> following error:
> {noformat}
> >mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq -ow  
> > 13:48 0
> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
>
> MAHOUT_LOCAL is set, running locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/ott/work/mahout-head/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/ott/work/mahout-head/examples/target/dependency/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 14/07/08 13:50:39 INFO common.AbstractJob: Command line arguments: 
> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
> --input=[/home/ott/work/exps/mh/20news-all], --keyPrefix=[], 
> --method=[mapreduce], --output=[/home/ott/work/exps/mh/20news-seq], 
> --overwrite=null, --startPhase=[0], --tempDir=[temp]}
> 14/07/08 13:50:39 INFO common.HadoopUtil: Deleting 
> /home/ott/work/exps/mh/20news-seq
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /home/ott/work/url-cat-exps/mh/20news-all
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:558)
> at 
> org.apache.mahout.text.SequenceFilesFromDirectory.runMapReduce(SequenceFilesFromDirectory.java:162)
> at 
> org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:91)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at 
> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:65)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> {noformat}
> But directory exists in the specified folder:
> {noformat}
> ott@mercury:work/exps/mh\>ls -lsd 20news-all  
>   13:50 0
> 4 drwxrwxr-x 22 ott ott 4096 Jul  8 08:49 20news-all/
> {noformat}
> If I explicitly specify {{-xm sequential}} flag, then there is no error, but 
> the task isn't performed at all:
> {noformat}
> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
> MAHOUT_LOCAL is set, running locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/ott/work/mahout-head/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/ott/work/mahout-head/examples/target/dependency/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 14/07/08 13:54:19 INFO common.AbstractJob: Command line arguments: 
> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
> --input=[/home/ott/work/exps/mh/20news-all], --keyPrefix=[], 
> --method=[sequential], --output=[/home/ott/work/exps/mh/20news-seq], 
> --overwrite=null, --startPhase=[0], --tempDir=[temp]}
> 14/07/08 13:54:19 INFO driver.MahoutDriver: Program took 548 ms (Minutes: 
> 0.009134)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1588) Multiple input path support in recommendation job

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1588:
---
Labels: legacy  (was: )

> Multiple input path support in recommendation job
> -
>
> Key: MAHOUT-1588
> URL: https://issues.apache.org/jira/browse/MAHOUT-1588
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Xiaomeng Huang
>Priority: Minor
>  Labels: legacy
> Attachments: Mahout-1588.000.patch
>
>
> Now recommendation job can only import a input path via "--input", and can't 
> load file from different path. Customers may put preference data in different 
> path. This is a very usual scenario.
> I add a option named "--multiInput(-mi)", and don't remove the original input 
> option. These two input option can set together. And the modification only 
> refer to  PreparePreferenceMatrixJob, which load data from filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1601) Add javadoc for the classes - as there is no clue what the class is for .

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1601:
---
Labels: documentation legacy  (was: documentation)

> Add javadoc for the classes - as there is no clue what the class is for .
> -
>
> Key: MAHOUT-1601
> URL: https://issues.apache.org/jira/browse/MAHOUT-1601
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Harish Kayarohanam
>Priority: Minor
>  Labels: documentation, legacy
>
> I found that the following classes 
> org.apache.mahout.cf.taste.impl.neighborhood.DummySimilarity
> org.apache.mahout.cf.taste.impl.similarity.GenericUserSimilarity
> org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity
> did not have java doc . So I was unable to find what these classes are for .
> Shall we add java doc for the same ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1585) Temporarily Remove or Fix links to missing Javadocs

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1585:
---
Labels: DSL legacy scala spark  (was: )

> Temporarily Remove or Fix links to missing Javadocs
> ---
>
> Key: MAHOUT-1585
> URL: https://issues.apache.org/jira/browse/MAHOUT-1585
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Andrew Palumbo
>Priority: Minor
>  Labels: DSL, legacy, scala, spark
> Fix For: 1.0
>
>
> On the "Developer Resources" page:
> http://mahout.apache.org/developers/developer-resources.html
> The links to Javadocs for Math, Integration, MR-Legacy and Examples are all 
> redirect to a password protected Build page. Since MR-Legacy is currently the 
> only Javadoc being published, fix that link and temporarily remove the others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1584) Create a detailed example of how to index an arbitrary dataset and run LDA on it

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1584:
---
Labels: documentation legacy  (was: documentation)

> Create a detailed example of how to index an arbitrary dataset and run LDA on 
> it
> 
>
> Key: MAHOUT-1584
> URL: https://issues.apache.org/jira/browse/MAHOUT-1584
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Nico Scherer
>Priority: Minor
>  Labels: documentation, legacy
> Fix For: 0.9
>
> Attachments: lda.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As students from Sebastian Schelters class, we will create a detailed example 
> of how to index an arbitraty dataset and run Mahout LDA on it. Also, we will 
> have a look at the current dev page descriptions of LDA and see if the 
> documentation is up to date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1559) Add documentation for and clean up the wikipedia classifier example

2015-03-05 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349671#comment-14349671
 ] 

Andrew Palumbo commented on MAHOUT-1559:


I have a scala script that I'd like to use as documentation  for this.  Still 
reliant on legacy vectorization though.

> Add documentation for and clean up the wikipedia classifier example
> ---
>
> Key: MAHOUT-1559
> URL: https://issues.apache.org/jira/browse/MAHOUT-1559
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 1.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Minor
>  Labels: DSL, legacy, scala
> Fix For: 1.0
>
>
> Add documentation for the wikipedia classifer example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-03-05 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349670#comment-14349670
 ] 

Pat Ferrel commented on MAHOUT-1618:


If someone want's to pick this up. The best idea seems to be a client, maybe 
based on Jsolr, that will store interactions in a user collection and the 
cooccurrence matrix in an items collection in Solr itself. This can be done 
pretty easily by updating through the Solr API. Then recs can be returned for a 
user ID by first fetching history and using it to create a recs query.

Think I'll start on this myself, please ping me if you want to help.

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: collections-1.0
>Reporter: Thejas Prasad
>Priority: Trivial
>  Labels: DSL, scala, spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1559) Add documentation for and clean up the wikipedia classifier example

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1559:
---
Labels: DSL legacy scala  (was: legacy)

> Add documentation for and clean up the wikipedia classifier example
> ---
>
> Key: MAHOUT-1559
> URL: https://issues.apache.org/jira/browse/MAHOUT-1559
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 1.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Minor
>  Labels: DSL, legacy, scala
> Fix For: 1.0
>
>
> Add documentation for the wikipedia classifer example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1605) Make VisualizerTest locale independent

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1605:
---
Labels: legacy  (was: )

> Make VisualizerTest locale independent  
> 
>
> Key: MAHOUT-1605
> URL: https://issues.apache.org/jira/browse/MAHOUT-1605
> Project: Mahout
>  Issue Type: Test
>Affects Versions: 0.9
>Reporter: Frank Rosner
>Priority: Trivial
>  Labels: legacy
>
> h5. Problem
> When trying to build Mahout on a machine with a locale that uses a different 
> decimal separator, {{org.apache.mahout.classifier.df.tools.VisualizerTest}} 
> fails because of String assertions that are locale dependent.
> Expected: {{humidity < 77.5 : yes}}
> Actual: {{humidity < 77,5 : yes}}
> h5. Solution
> Make assertions locale independent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1600) Algorithms for computing correlation and covariance

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1600:
---
Labels: DSL scala  (was: )

> Algorithms for computing correlation and covariance
> ---
>
> Key: MAHOUT-1600
> URL: https://issues.apache.org/jira/browse/MAHOUT-1600
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.9
>Reporter: Nagamallikarjuna
>Priority: Trivial
>  Labels: DSL, scala
>
> I checked the list of Mahout algorithms, I didn't find algorithms for 
> computing correlation and covariance. I already written those two algorithms 
> to solve real world business problems. I am planning to contribute them to 
> Mahout



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1623) MAHOUT.CMD contains duplicated code

2015-03-05 Thread Andrew Palumbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1623:
---
Labels: build easyfix  (was: build easyfix legacy)

> MAHOUT.CMD contains duplicated code
> ---
>
> Key: MAHOUT-1623
> URL: https://issues.apache.org/jira/browse/MAHOUT-1623
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0
>Reporter: Srinivas Chamarthi
>Priority: Trivial
>  Labels: build, easyfix
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 122 matches

Mail list logo