[jira] [Updated] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1598:
--
Assignee: Andrew Musselman

> extend seq2sparse to handle multiple text blocks of same document
> -
>
> Key: MAHOUT-1598
> URL: https://issues.apache.org/jira/browse/MAHOUT-1598
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Wolfgang Buchner
>Assignee: Andrew Musselman
>  Labels: legacy
> Fix For: 0.10.0
>
>
> Currently the seq2sparse or in particular the 
> org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one 
> text block per document.
> I stumbled on this because i'm having an use case where one document 
> represents a ticket which can have several text blocks in different 
> languages. 
> So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall 
> tokenize each text block itself. So i can use language specific features in 
> our Lucene Analyzer.
> Unfortunately the current implementation doesn't support this.
> But with just minor changes this can be made possible.
> The only thing which has to be changed would be the 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values 
> of the iterable (not just the 1st one >.<)
> An Alternative would be to change this Reducer to a Mapper, i don't get why 
> in the 1st place this is implemented as an reducer. Is there any benefit from 
> this?
> I will provide a PR via github.
> Please have a look onto this and tell me if i am assuming anything wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1598) extend seq2sparse to handle multiple text blocks of same document

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1598:
--
Affects Version/s: (was: 1.0)
Fix Version/s: 0.10.0

> extend seq2sparse to handle multiple text blocks of same document
> -
>
> Key: MAHOUT-1598
> URL: https://issues.apache.org/jira/browse/MAHOUT-1598
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Wolfgang Buchner
>  Labels: legacy
> Fix For: 0.10.0
>
>
> Currently the seq2sparse or in particular the 
> org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one 
> text block per document.
> I stumbled on this because i'm having an use case where one document 
> represents a ticket which can have several text blocks in different 
> languages. 
> So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall 
> tokenize each text block itself. So i can use language specific features in 
> our Lucene Analyzer.
> Unfortunately the current implementation doesn't support this.
> But with just minor changes this can be made possible.
> The only thing which has to be changed would be the 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values 
> of the iterable (not just the 1st one >.<)
> An Alternative would be to change this Reducer to a Mapper, i don't get why 
> in the 1st place this is implemented as an reducer. Is there any benefit from 
> this?
> I will provide a PR via github.
> Please have a look onto this and tell me if i am assuming anything wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1661) Remove Lanczos from the code base

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1661:
--
Summary: Remove Lanczos from the code base  (was: Deprecate Lanczos from 
the code base)

> Remove Lanczos from the code base
> -
>
> Key: MAHOUT-1661
> URL: https://issues.apache.org/jira/browse/MAHOUT-1661
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Assignee: Shannon Quinn
>Priority: Critical
> Fix For: 0.10.0
>
>
> Lanczos has long been deprecated from the code base but the code doesn't 
> reflect that.  Now that Spectral KMeans has been refactored to use SSVD, 
> Lanczos can be purged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1661) Deprecate Lanczos from the code base

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1661:
--
Description: 
Lanczos has long been deprecated from the code base but the code doesn't 
reflect that.  Now that Spectral KMeans has been refactored to use SSVD, 
Lanczos can be purged.


  was:
Lanczos has long been deprecated from the code base but the code doesn't 
reflect that.  Its only used now in the Spectral KMeans which needs to be 
refactored to use SSVD. 




> Deprecate Lanczos from the code base
> 
>
> Key: MAHOUT-1661
> URL: https://issues.apache.org/jira/browse/MAHOUT-1661
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Assignee: Shannon Quinn
>Priority: Critical
> Fix For: 0.10.0
>
>
> Lanczos has long been deprecated from the code base but the code doesn't 
> reflect that.  Now that Spectral KMeans has been refactored to use SSVD, 
> Lanczos can be purged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1659.
---
Resolution: Fixed

> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386272#comment-14386272
 ] 

Suneel Marthi commented on MAHOUT-1659:
---

Shannon, committed the patch to trunk with few other changes. Thanks again. 

> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386270#comment-14386270
 ] 

ASF GitHub Bot commented on MAHOUT-1659:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/88


> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1612) NullPointerException happens during JSON output format for clusterdumper

2015-03-29 Thread Manoj Awasthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386257#comment-14386257
 ] 

Manoj Awasthi commented on MAHOUT-1612:
---

Ok.

> NullPointerException happens during JSON output format for clusterdumper
> 
>
> Key: MAHOUT-1612
> URL: https://issues.apache.org/jira/browse/MAHOUT-1612
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Guo Ruijing
>Assignee: Suneel Marthi
>  Labels: legacy
> Fix For: 0.10.0
>
>
> 1. download datafile from:
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> 2. put data file on hdfs:
> hdfs dfs -mkdir testdata
> hdfs dfs -put synthetic_control.data testdata/
> 3. run a mahout clustering job:
> mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> 4. run clusterdump with JSON format:
> mahout clusterdump i output/clusters*-final -p output/clusteredPoints -o 
> /tmp/report -of JSON
> expected:
> clusterdump with JSON format should succeeded same as CSV and TEXT
> actually:
> clusterdump with JSON format throw NullPointerException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386067#comment-14386067
 ] 

Suneel Marthi edited comment on MAHOUT-1659 at 3/30/15 4:29 AM:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/88#issuecomment-87507795
  
Looks good Shannon, another change that would be good to include in this PR

 Replace all Guava API calls in DisplaySpectralKMeans.java with the appropriate 
Java 7 api.
   


was (Author: githubbot):
Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/88#issuecomment-87507795
  
Looks good Shannon, a few other changes that would be good to have in this 
PR

1.  Replace all Guava API calls in DisplaySpectralKMeans.java with the 
appropriate Java 7 api.
2.  think we should now purge Lanczos for good. Correct? If so please 
either create a new Jira for that or update this Jira with deprecated Lanczos.  
 Regardless Lanczos needs to be marked as deprecated in the code.


> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1638) H2O bindings fail at drmParallelizeWithRowLabels(...)

2015-03-29 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386079#comment-14386079
 ] 

Andrew Palumbo edited comment on MAHOUT-1638 at 3/30/15 1:50 AM:
-

Current status:

{code}
Map map = m.getRowLabelBindings();
if (map != null) {
  // label vector must be similarly partitioned like the Frame
  byte []typeArr = {Vec.T_STR};
  labels = frame.lastVec().makeCons(1, frame.numRows(), null , typeArr)[0];
  Vec.Writer writer = labels.open();

  Map rmap = reverseMap(map);
  for (int r = 0; r < m.rowSize(); r++){
  writer.set(r, rmap.get(r).toString());
   }
   writer.close(closer);
}

{code}
When I run tests i can verify that: 
{code}
labels.isString() == true
labels.chunkForRow(r).getClass().getSimpleName() == "C0LChunk"
{code}

As far as I can tell, we need a {{CStrChunk}} or a {{NewChunk}} to be able to 
set a String Value.

still getting the exception:
{code}
Not a String
java.lang.IllegalArgumentException: Not a String
at water.fvec.Chunk.set_impl(Chunk.java:494)
at water.fvec.Chunk.set(Chunk.java:456)
at water.fvec.Chunk.set_abs(Chunk.java:358)
at water.fvec.Vec$Writer.set(Vec.java:821)
{code}



was (Author: andrew_palumbo):
Current status:

{code}
Map map = m.getRowLabelBindings();
if (map != null) {
  // label vector must be similarly partitioned like the Frame
  byte []typeArr = {Vec.T_STR};
  labels = frame.lastVec().makeCons(1, frame.numRows(), null , typeArr)[0];
  Vec.Writer writer = labels.open();

  Map rmap = reverseMap(map);
  for (int r = 0; r < m.rowSize(); r++){
 writer.set(r, rmap.get(r).toString());
  }
  writer.close(closer);
}

{code}
When I run tests i can verify that 
{code}
labels.isString() == true
labels.chunkForRow(r).getClass().getSimpleName() == "C0LChunk"
{code}

As far as I can tell, we need a {{CStrChunk}} or a {{NewChunk}} to be able to 
set a String Value.

still getting the exception:
{code}
Not a String
java.lang.IllegalArgumentException: Not a String
at water.fvec.Chunk.set_impl(Chunk.java:494)
at water.fvec.Chunk.set(Chunk.java:456)
at water.fvec.Chunk.set_abs(Chunk.java:358)
at water.fvec.Vec$Writer.set(Vec.java:821)
{code}


> H2O bindings fail at drmParallelizeWithRowLabels(...)
> -
>
> Key: MAHOUT-1638
> URL: https://issues.apache.org/jira/browse/MAHOUT-1638
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
>  Labels: DSL, h2o, scala
> Fix For: 0.10.0
>
>
> The H2OHelper.drmFromMatrix(...) function fails when trying to write row 
> label String keys to a water.fvec.Vec.:
> {code:java}
>  java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
> {code} 
> This causes an exception when calling drm.drmParallelizeWithRowLabels(...)
> To reproduce, apply [PR#72: Enable Naive Bayes Tests in h2o 
> Module|https://github.com/apache/mahout/pull/72] and run:
> {code} $ mvn test 
> {code}
> from the h2o module:
> {code:java}
> - NB Aggregator *** FAILED ***
>   java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
>   
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply$mcV$sp(NBTestBase.scala:91)
> 
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
>
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
> 

[jira] [Commented] (MAHOUT-1638) H2O bindings fail at drmParallelizeWithRowLabels(...)

2015-03-29 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386079#comment-14386079
 ] 

Andrew Palumbo commented on MAHOUT-1638:


Current status:

{code}
Map map = m.getRowLabelBindings();
if (map != null) {
  // label vector must be similarly partitioned like the Frame
  byte []typeArr = {Vec.T_STR};
  labels = frame.lastVec().makeCons(1, frame.numRows(), null , typeArr)[0];
  Vec.Writer writer = labels.open();

  Map rmap = reverseMap(map);
  for (int r = 0; r < m.rowSize(); r++){
 writer.set(r, rmap.get(r).toString());
  }
  writer.close(closer);
}

{code}
When I run tests i can verify that 
{code}
labels.isString() == true
labels.chunkForRow(r).getClass().getSimpleName() == "C0LChunk"
{code}

As far as I can tell, we need a {{CStrChunk}} or a {{NewChunk}} to be able to 
set a String Value.

still getting the exception:
{code}
Not a String
java.lang.IllegalArgumentException: Not a String
at water.fvec.Chunk.set_impl(Chunk.java:494)
at water.fvec.Chunk.set(Chunk.java:456)
at water.fvec.Chunk.set_abs(Chunk.java:358)
at water.fvec.Vec$Writer.set(Vec.java:821)
{code}


> H2O bindings fail at drmParallelizeWithRowLabels(...)
> -
>
> Key: MAHOUT-1638
> URL: https://issues.apache.org/jira/browse/MAHOUT-1638
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
>  Labels: DSL, h2o, scala
> Fix For: 0.10.0
>
>
> The H2OHelper.drmFromMatrix(...) function fails when trying to write row 
> label String keys to a water.fvec.Vec.:
> {code:java}
>  java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
> {code} 
> This causes an exception when calling drm.drmParallelizeWithRowLabels(...)
> To reproduce, apply [PR#72: Enable Naive Bayes Tests in h2o 
> Module|https://github.com/apache/mahout/pull/72] and run:
> {code} $ mvn test 
> {code}
> from the h2o module:
> {code:java}
> - NB Aggregator *** FAILED ***
>   java.lang.IllegalArgumentException: Not a String
>   at water.fvec.Chunk.set_impl(Chunk.java:507)
>   at water.fvec.Chunk.set0(Chunk.java:469)
>   at water.fvec.Chunk.set(Chunk.java:371)
>   at water.fvec.Vec$Writer.set(Vec.java:803)
>   at org.apache.mahout.h2obindings.H2OHelper.drmFromMatrix(H2OHelper.java:331)
>   at 
> org.apache.mahout.h2obindings.H2OEngine$.drmParallelizeWithRowLabels(H2OEngine.scala:83)
>
>   at 
> org.apache.mahout.math.drm.package$.drmParallelizeWithRowLabels(package.scala:67)
>   
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply$mcV$sp(NBTestBase.scala:91)
> 
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
>
>   at 
> org.apache.mahout.classifier.naivebayes.NBTestBase$$anonfun$2.apply(NBTestBase.scala:70)
>
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386067#comment-14386067
 ] 

ASF GitHub Bot commented on MAHOUT-1659:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/88#issuecomment-87507795
  
Looks good Shannon, a few other changes that would be good to have in this 
PR

1.  Replace all Guava API calls in DisplaySpectralKMeans.java with the 
appropriate Java 7 api.
2.  think we should now purge Lanczos for good. Correct? If so please 
either create a new Jira for that or update this Jira with deprecated Lanczos.  
 Regardless Lanczos needs to be marked as deprecated in the code.


> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Suneel Marthi
Replace Iterators.skip() to Iterators.advance() to get past that error.

On Sun, Mar 29, 2015 at 7:44 PM, Pat Ferrel (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386018#comment-14386018
> ]
>
> Pat Ferrel commented on MAHOUT-1655:
> 
>
> Merged master but now integration fails
>
> [ERROR]
> /Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[120,14]
> cannot find symbol
>   symbol:   method
> skip(java.util.Iterator,int)
>   location: class com.google.common.collect.Iterators
> [ERROR]
> /Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[160,14]
> cannot find symbol
>   symbol:   method
> skip(java.util.Iterator,int)
>   location: class com.google.common.collect.Iterators
> [ERROR]
> /Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[163,16]
> cannot find symbol
>   symbol:   method
> skip(java.util.Iterator,int)
>  r
> [INFO] 3 errors
> [INFO] -
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Mahout Build Tools  SUCCESS [
> 1.679 s]
> [INFO] Apache Mahout . SUCCESS [
> 0.410 s]
> [INFO] Mahout Math ... SUCCESS [01:04
> min]
> [INFO] Mahout HDFS ... SUCCESS [
> 2.193 s]
> [INFO] Mahout Map-Reduce . SUCCESS [12:49
> min]
> [INFO] Mahout Integration  FAILURE [
> 1.743 s]
> [INFO] Mahout Examples ... SKIPPED
> [INFO] Mahout Release Package  SKIPPED
> [INFO] Mahout Math Scala bindings  SKIPPED
> [INFO] Mahout Spark bindings . SKIPPED
> [INFO] Mahout Spark bindings shell ... SKIPPED
> [INFO] Mahout H2O backend  SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 14:00 min
> [INFO] Finished at: 2015-03-29T15:34:03-08:00
> [INFO] Final Memory: 58M/436M
> [INFO]
> 
>
>
> > Refactor module dependencies
> > 
> >
> > Key: MAHOUT-1655
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: mrlegacy
> >Affects Versions: 0.9
> >Reporter: Pat Ferrel
> >Assignee: Andrew Musselman
> >Priority: Critical
> > Fix For: 0.10.0
> >
> >
> > Make a new module, call it mahout-hadoop. Move anything there that is
> currently in mrlegacy but used in math-scala or spark. Remove dependencies
> on mrlegacy altogether if possible by using other core classes.
> > The goal is to have math-scala and spark module depend on math, and a
> small module called mahout-hadoop (much smaller than mrlegacy).
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


[jira] [Updated] (MAHOUT-1516) run classify-20newsgroups.sh failed cause by /tmp/mahout-work-jpan/20news-all does not exists in hdfs.

2015-03-29 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1516:
---
Resolution: Not a Problem
  Assignee: Andrew Palumbo
Status: Resolved  (was: Patch Available)

We'll need to use  {{hadoop dfs -rmr}} in as long as we support hadoop 1.2.1.  

Running classify-20newsgroups.sh on hadoop 2.2.0 in pseudo-cluster mode i get 
no errors:
{code}
+ '[' /home/andy/apache/hadoop-2.2.0 '!=' '' ']'
+ '[' '' == '' ']'
+ echo 'Copying 20newsgroups data to HDFS'
Copying 20newsgroups data to HDFS
+ set +e
+ /home/andy/apache/hadoop-2.2.0/bin/hadoop dfs -rmr 
/tmp/mahout-work-andy/20news-all
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead.
15/03/29 20:32:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/29 20:32:07 INFO fs.TrashPolicyDefault: Namenode trash configuration: 
Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/mahout-work-andy/20news-all
+ /home/andy/apache/hadoop-2.2.0/bin/hadoop dfs -rmr 
/tmp/mahout-work-andy/spark-model
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead.
15/03/29 20:32:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/29 20:32:12 INFO fs.TrashPolicyDefault: Namenode trash configuration: 
Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/mahout-work-andy/spark-model
+ set -e
+ /home/andy/apache/hadoop-2.2.0/bin/hadoop dfs -put 
/tmp/mahout-work-andy/20news-all /tmp/mahout-work-andy/20news-all
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/03/29 20:32:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable

15/03/29 20:32:15 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
+ echo 'Creating sequence files from 20newsgroups data'
Creating sequence files from 20newsgroups data
+ ./bin/mahout seqdirectory -i /tmp/mahout-work-andy/20news-all -o 
/tmp/mahout-work-andy/20news-seq -ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/andy/apache/hadoop-2.2.0/bin/hadoop and 
HADOOP_CONF_DIR=/home/andy/apache/hadoop-2.2.0/etc/hadoop
MAHOUT-JOB: 
/home/andy/sandbox/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar

{code}




> run classify-20newsgroups.sh failed cause by /tmp/mahout-work-jpan/20news-all 
> does not exists in hdfs.
> --
>
> Key: MAHOUT-1516
> URL: https://issues.apache.org/jira/browse/MAHOUT-1516
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: hadoop2.2.0 mahout0.9 ubuntu12.04 
>Reporter: Jian Pan
>Assignee: Andrew Palumbo
>Priority: Minor
>  Labels: legacy, patch
> Fix For: 0.10.0
>
>
> + echo 'Copying 20newsgroups data to HDFS'
> Copying 20newsgroups data to HDFS
> + set +e
> + /home/jpan/Software/hadoop-2.2.0/bin/hadoop dfs -rmr 
> /tmp/mahout-work-jpan/20news-all
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> rmr: DEPRECATED: Please use 'rm -r' instead.
> 14/04/17 10:26:25 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> rmr: `/tmp/mahout-work-jpan/20news-all': No such file or directory
> + set -e
> + /home/jpan/Software/hadoop-2.2.0/bin/hadoop dfs -put 
> /tmp/mahout-work-jpan/20news-all /tmp/mahout-work-jpan/20news-all
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> 14/04/17 10:26:26 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> put: `/tmp/mahout-work-jpan/20news-all': No such file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2015-03-29 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386045#comment-14386045
 ] 

Andrew Musselman commented on MAHOUT-1538:
--

Is there a ticket for k-means being moved to the DSL?

> Port spectral clustering to Mahout DSL
> --
>
> Key: MAHOUT-1538
> URL: https://issues.apache.org/jira/browse/MAHOUT-1538
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, Spark, scala
> Fix For: 0.10.1
>
>
> Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
> (already ported) and K-means (currently in progress, or can use Spark MLlib 
> implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-29 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386044#comment-14386044
 ] 

Andrew Musselman commented on MAHOUT-1540:
--

I can help you as needed; let's both assist with M-1538 and M-1539 being 
completed.

> Reuters example for spectral clustering
> ---
>
> Key: MAHOUT-1540
> URL: https://issues.apache.org/jira/browse/MAHOUT-1540
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 0.10.1
>
>
> Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
> spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1538) Port spectral clustering to Mahout DSL

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1538:
--
Fix Version/s: (was: 0.10.0)
   0.10.1

> Port spectral clustering to Mahout DSL
> --
>
> Key: MAHOUT-1538
> URL: https://issues.apache.org/jira/browse/MAHOUT-1538
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, Spark, scala
> Fix For: 0.10.1
>
>
> Move spectral clustering logic to Mahout DSL. Dependencies include SSVD 
> (already ported) and K-means (currently in progress, or can use Spark MLlib 
> implementation as a temporary fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1539) Implement affinity matrix computation in Mahout DSL

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1539:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 0.10.0)
   0.10.1

> Implement affinity matrix computation in Mahout DSL
> ---
>
> Key: MAHOUT-1539
> URL: https://issues.apache.org/jira/browse/MAHOUT-1539
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 0.10.1
>
> Attachments: ComputeAffinities.scala
>
>
> This has the same goal as MAHOUT-1506, but rather than code the pairwise 
> computations in MapReduce, this will be done in the Mahout DSL.
> An orthogonal issue is the format of the raw input (vectors, text, images, 
> SequenceFiles), and how the user specifies the distance equation and any 
> associated parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-29 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386041#comment-14386041
 ] 

Shannon Quinn commented on MAHOUT-1540:
---

Given that this issue has explicit dependencies on MAHOUT-1538, and Saikat is 
still working on MAHOUT-1539, I propose bumping this to 0.10.1.

Plus, I'll need some assistance from everyone in familiarizing myself with the 
process of converting the Reuters dataset to something I can compute affinities 
from to construct the similarity matrix.

> Reuters example for spectral clustering
> ---
>
> Key: MAHOUT-1540
> URL: https://issues.apache.org/jira/browse/MAHOUT-1540
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 0.10.1
>
>
> Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
> spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1540) Reuters example for spectral clustering

2015-03-29 Thread Shannon Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shannon Quinn updated MAHOUT-1540:
--
Affects Version/s: (was: 1.0)
   0.9
Fix Version/s: (was: 1.0)
   0.10.1

> Reuters example for spectral clustering
> ---
>
> Key: MAHOUT-1540
> URL: https://issues.apache.org/jira/browse/MAHOUT-1540
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>  Labels: DSL, scala, spark
> Fix For: 0.10.1
>
>
> Once MAHOUT-1538 and MAHOUT-1539 are complete, create a working example of 
> spectral clustering using the Reuters dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Slack

2015-03-29 Thread Pat Ferrel
We have a Mahout slack account that anyone with an n...@apache.org can join. 
It’s really nice for group IM. When we are all working on pushing a release out 
it can be a big help. Send me you _apache_ email and I’ll add you. Then you can 
invite others.

[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386018#comment-14386018
 ] 

Pat Ferrel commented on MAHOUT-1655:


Merged master but now integration fails

[ERROR] 
/Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[120,14]
 cannot find symbol
  symbol:   method skip(java.util.Iterator,int)
  location: class com.google.common.collect.Iterators
[ERROR] 
/Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[160,14]
 cannot find symbol
  symbol:   method skip(java.util.Iterator,int)
  location: class com.google.common.collect.Iterators
[ERROR] 
/Users/pat/mahout/integration/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java:[163,16]
 cannot find symbol
  symbol:   method skip(java.util.Iterator,int)
 r
[INFO] 3 errors 
[INFO] -
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Mahout Build Tools  SUCCESS [  1.679 s]
[INFO] Apache Mahout . SUCCESS [  0.410 s]
[INFO] Mahout Math ... SUCCESS [01:04 min]
[INFO] Mahout HDFS ... SUCCESS [  2.193 s]
[INFO] Mahout Map-Reduce . SUCCESS [12:49 min]
[INFO] Mahout Integration  FAILURE [  1.743 s]
[INFO] Mahout Examples ... SKIPPED
[INFO] Mahout Release Package  SKIPPED
[INFO] Mahout Math Scala bindings  SKIPPED
[INFO] Mahout Spark bindings . SKIPPED
[INFO] Mahout Spark bindings shell ... SKIPPED
[INFO] Mahout H2O backend  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14:00 min
[INFO] Finished at: 2015-03-29T15:34:03-08:00
[INFO] Final Memory: 58M/436M
[INFO] 


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386005#comment-14386005
 ] 

Shannon Quinn commented on MAHOUT-1659:
---

Pull request created: https://github.com/apache/mahout/pull/88

> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1659) Remove deprecated Lanczos solver from spectral clustering in mr-legacy

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386004#comment-14386004
 ] 

ASF GitHub Bot commented on MAHOUT-1659:


GitHub user magsol opened a pull request:

https://github.com/apache/mahout/pull/88

MAHOUT-1659

Removed the dependency on the Lanczos solver from spectral clustering. Now 
exclusively uses SSVD for dimensionality reduction.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/magsol/mahout master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/88.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #88


commit 8cdcb4efd15ba2ca4622358bb93785d72caf6f38
Author: Shannon Quinn 
Date:   2015-03-29T23:05:58Z

Removed the dependency on the Lanczos solver from spectral clustering. Now 
exclusively uses SSVD for dimensionality reduction.




> Remove deprecated Lanczos solver from spectral clustering in mr-legacy
> --
>
> Key: MAHOUT-1659
> URL: https://issues.apache.org/jira/browse/MAHOUT-1659
> Project: Mahout
>  Issue Type: Task
>  Components: Clustering, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
>Priority: Minor
> Fix For: 0.10.0
>
>
> Spectral clustering still has the option of using either SSVD or the Lanczos 
> solver for dimensionality reduction. Remove the latter entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385986#comment-14385986
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

Yes. Do merge/rebase of master first, not to have conflicts.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385984#comment-14385984
 ] 

Pat Ferrel commented on MAHOUT-1655:


ok so I should:
1) make 11.0.2 the version in the new mahout-mr module?
2) make 14.0.1 the version in the root pom?

The version in the Spark module is not specified so it will get the root pom 
14.0.1 and will be happy with that.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-29 Thread Stevo Slavic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stevo Slavic resolved MAHOUT-1563.
--
Resolution: Fixed

Fixed.

> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.10.0
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385979#comment-14385979
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

legacy/mr module needs old, 11.0.2, since that's the version hadoop libraries 
depend on. Not sure about integration module. Now 11.0.2 is default, while the 
rest of the modules which need newer one, like spark ones, override. We can 
switch, that 14.0.1 is default (in parent pom, dependencyManagement section), 
and legacy/mr overrides to older. Then we need to cleanup also overrides from 
spark modules.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385977#comment-14385977
 ] 

Pat Ferrel commented on MAHOUT-1655:


Ah, ok. Thanks, I'd never have figured that out.

Just to get this on the table: Mahout's spark-shell doesn't need it AFAIK. 
mahout-math-scala does but waits until building the mahout-spark module to 
create a mahout-spark...dependency-reduced.jar that is passed to the context 
(along with the non transitive dependency mahout jars) when it's created. This 
gets it to the workers where it is used. Does this work for running jobs on 
Spark?

For mapreduce it will go into the mr...job.jar where it will be passed to 
hadoop mapreduce.

There is so much confusion over this, in my mind anyhow. Do you think this is 
the right thing to do?

So every module will use 14.0.1?


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385975#comment-14385975
 ] 

Hudson commented on MAHOUT-1563:


SUCCESS: Integrated in Mahout-Quality #3025 (See 
[https://builds.apache.org/job/Mahout-Quality/3025/])
MAHOUT-1563: Scala binary version classifier is now in h2o module artifact id, 
since it's a Scala module (sslavic: rev 
089275713f5f49499d939569fd9929f2f50ce1f3)
* pom.xml
* h2o/pom.xml


> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.10.0
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1590) mahout unit test failures due to guava version conflict on hadoop 2

2015-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385968#comment-14385968
 ] 

Hudson commented on MAHOUT-1590:


SUCCESS: Integrated in Mahout-Quality #3024 (See 
[https://builds.apache.org/job/Mahout-Quality/3024/])
MAHOUT-1590 Spark expects guava 14.0.1 (sslavic: rev 
205896fde04baa774f77703c5867f09b20895e6b)
* spark-shell/pom.xml


> mahout unit test failures due to guava version conflict on hadoop 2
> ---
>
> Key: MAHOUT-1590
> URL: https://issues.apache.org/jira/browse/MAHOUT-1590
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: Hadoop 2.x 
>Reporter: Venkat Ranganathan
>Assignee: Stevo Slavic
>  Labels: DSL, scala, spark
> Fix For: 0.10.0
>
> Attachments: MAHOUT-1590.patch
>
>
> Running 
>mvn clean test -Dhadoop2.version=2.4.0 
> has many unit test failures because guava version mismatch.   
> For example:
> ==
> completeJobToyExample(org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest)
>   Time elapsed: 0.736 sec  <<< ERROR!
> java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
> at 
> __randomizedtesting.SeedInfo.seed([BEBBF9ACD237F984:B570D1523391FD4E]:0)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:278)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:375)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob.run(ParallelALSFactorizationJob.java:172)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.explicitExample(ParallelALSFactorizationJobTest.java:112)
> at 
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJobTest.completeJobToyExample(ParallelALSFactorizationJobTest.java:71)
> =
> hadoop mapreduce V2 is using guava v11.0.2 and mahout is using guava version 
> 16.0
> After trying different versions guava version 14.0 seems to have hadoop MR V2 
> compatible jars and mahout needed classes. 
> The unit tests ran successfully after changing the dependency in mahout to 
> v14.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385957#comment-14385957
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

Spark expects guava, and exactly version 14.0.1 which is now provided.
Before change, 11.0.2 was on classpath, and it's HashFunction is missing 
hashInt method, which was added in guava 12 (see 
[here|https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/HashFunction.java?name=v14.0.1#158]),
 causing that "NoSuchMethodError: com.google.common.hash.HashFunction.hashInt" 
to be thrown.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385954#comment-14385954
 ] 

Pat Ferrel commented on MAHOUT-1655:


why? Can we talk before committing to master? Spark-shell doesn't need guava 
and spark, though it does, doesn't care which version (as long as it has 
HashBiMap).

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385952#comment-14385952
 ] 

Hudson commented on MAHOUT-1563:


SUCCESS: Integrated in Mahout-Quality #3023 (See 
[https://builds.apache.org/job/Mahout-Quality/3023/])
MAHOUT-1563: Eliminated warnings about multiple scala versions (sslavic: rev 
6cf991c60ac73dbd32b26d8ea0b773ac07d16193)
* spark/pom.xml
* spark-shell/pom.xml
* math-scala/pom.xml
* math/pom.xml
* h2o/pom.xml
* pom.xml
* CHANGELOG


> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.10.0
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385950#comment-14385950
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

I've just added it, and pushed to master, so just merge/rebase.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385948#comment-14385948
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

[~pferrel], just add to mahout-spark-shell module dependencies

{noformat}

  com.google.guava
  guava
  14.0.1

{noformat}

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Andrew Palumbo

aha .. oops..
On 03/29/2015 03:57 PM, Suneel Marthi wrote:

version# is in the parent pom, math pom references the parent :)

On Sun, Mar 29, 2015 at 3:52 PM, ASF GitHub Bot (JIRA) 
wrote:


 [
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385917#comment-14385917
]

ASF GitHub Bot commented on MAHOUT-1655:


Github user andrewpalumbo commented on the pull request:

 https://github.com/apache/mahout/pull/86#issuecomment-87462909

 no version number in the math-scala pom.



Refactor module dependencies


 Key: MAHOUT-1655
 URL: https://issues.apache.org/jira/browse/MAHOUT-1655
 Project: Mahout
  Issue Type: Improvement
  Components: mrlegacy
Affects Versions: 0.9
Reporter: Pat Ferrel
Assignee: Andrew Musselman
Priority: Critical
 Fix For: 0.10.0


Make a new module, call it mahout-hadoop. Move anything there that is

currently in mrlegacy but used in math-scala or spark. Remove dependencies
on mrlegacy altogether if possible by using other core classes.

The goal is to have math-scala and spark module depend on math, and a

small module called mahout-hadoop (much smaller than mrlegacy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)





[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385919#comment-14385919
 ] 

Stevo Slavic commented on MAHOUT-1655:
--

[~pferrel] I'm working on MAHOUT-1563. It's all done, just waiting for last 
build to pass, making sure build passes with hadoop1 (1.2.1) as well.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Suneel Marthi
version# is in the parent pom, math pom references the parent :)

On Sun, Mar 29, 2015 at 3:52 PM, ASF GitHub Bot (JIRA) 
wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385917#comment-14385917
> ]
>
> ASF GitHub Bot commented on MAHOUT-1655:
> 
>
> Github user andrewpalumbo commented on the pull request:
>
> https://github.com/apache/mahout/pull/86#issuecomment-87462909
>
> no version number in the math-scala pom.
>
>
> > Refactor module dependencies
> > 
> >
> > Key: MAHOUT-1655
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: mrlegacy
> >Affects Versions: 0.9
> >Reporter: Pat Ferrel
> >Assignee: Andrew Musselman
> >Priority: Critical
> > Fix For: 0.10.0
> >
> >
> > Make a new module, call it mahout-hadoop. Move anything there that is
> currently in mrlegacy but used in math-scala or spark. Remove dependencies
> on mrlegacy altogether if possible by using other core classes.
> > The goal is to have math-scala and spark module depend on math, and a
> small module called mahout-hadoop (much smaller than mrlegacy).
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385917#comment-14385917
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87462909
  
no version number in the math-scala pom.


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385914#comment-14385914
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87462164
  
Its still referenced in the parent pom with version 11.0.2.   Possible that
the offending class could be missing from this guava version ??

On Sun, Mar 29, 2015 at 3:43 PM, Andrew Palumbo 
wrote:

> It was in the mahout-math pom last i saw.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385911#comment-14385911
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87461957
  
It was in the mahout-math pom last i saw. 


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385910#comment-14385910
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87461773
  
what is the status of guava in the rest of the project?


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385909#comment-14385909
 ] 

Pat Ferrel commented on MAHOUT-1655:


ok, what is the status of Guava with the work [~sslavic] was doing?



> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAHOUT-1661) Deprecate Lanczos from the code base

2015-03-29 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1661:
-

Assignee: Shannon Quinn

> Deprecate Lanczos from the code base
> 
>
> Key: MAHOUT-1661
> URL: https://issues.apache.org/jira/browse/MAHOUT-1661
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Assignee: Shannon Quinn
>Priority: Critical
> Fix For: 0.10.0
>
>
> Lanczos has long been deprecated from the code base but the code doesn't 
> reflect that.  Its only used now in the Spectral KMeans which needs to be 
> refactored to use SSVD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385907#comment-14385907
 ] 

Pat Ferrel commented on MAHOUT-1655:


https://github.com/apache/mahout/pull/86

builds and unit test pass and launches spark-shell but fails 
spark-itemsimilarity because google,common is missing. Expect mapreduce to fail 
too with that but maybe it's in the job jar for mapreduce.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1661) Deprecate Lanczos from the code base

2015-03-29 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385908#comment-14385908
 ] 

Suneel Marthi commented on MAHOUT-1661:
---

Assigning this to Shannon now, if someone else would like to make a PR for this 
please feel free to do so.

> Deprecate Lanczos from the code base
> 
>
> Key: MAHOUT-1661
> URL: https://issues.apache.org/jira/browse/MAHOUT-1661
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Priority: Critical
> Fix For: 0.10.0
>
>
> Lanczos has long been deprecated from the code base but the code doesn't 
> reflect that.  Its only used now in the Spectral KMeans which needs to be 
> refactored to use SSVD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385905#comment-14385905
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87461277
  
Its part of Guava.


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1661) Deprecate Lanczos from the code base

2015-03-29 Thread Suneel Marthi (JIRA)
Suneel Marthi created MAHOUT-1661:
-

 Summary: Deprecate Lanczos from the code base
 Key: MAHOUT-1661
 URL: https://issues.apache.org/jira/browse/MAHOUT-1661
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.9
Reporter: Suneel Marthi
Priority: Critical
 Fix For: 0.10.0


Lanczos has long been deprecated from the code base but the code doesn't 
reflect that.  Its only used now in the Spectral KMeans which needs to be 
refactored to use SSVD. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385904#comment-14385904
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/86#issuecomment-87461211
  
not sure where google.common is supposed t come from, looking...


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385902#comment-14385902
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


GitHub user pferrel opened a pull request:

https://github.com/apache/mahout/pull/86

MAHOUT-1655

Refactors mr-legacy into mahout-hdfs and mahout-mr

Compiles and completes unit tests, can launch spark-shell but doesn't run 
spark-itemsimilarity with the following error:

15/03/29 12:22:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@192.168.0.7:52857/user/HeartbeatReceiver
15/03/29 12:22:12 WARN BlockManager: Putting block broadcast_0 failed
Exception in thread "main" java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at 
org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at 
org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at 
org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
at 
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
at 
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pferrel/mahout MAHOUT-1655

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/86.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #86


commit c783c0a91a5f6f1f279b77de6f52ccb39292b5e9
Author: Andrew Musselman 
Date:   2015-03-26T00:38:56Z

Moving mrlegacy directory to hdfs, starting minimal mr directory, 
repointing references to mrlegacy everywhere.

commit 1dc3662ac2f548d6ecc784aababda806aa5e7578
Author: Andrew Musselman 
Date:   2015-03-26T14:24:44Z

Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/mahout 
into MAHOUT-1655

commit 943d982f9ab08c1a9478ac40b02c4938091ec095
Author: Andrew Musselman 
Date:   2015-03-26T17:00:12Z

Merge branch 'master' into MAHOUT-1655

commit 1c65f2f441e4f504cdb6d215397097a3296eac15
Author: Andrew Musselman 
Date:   2015-03-26T17:13:17Z

Moving contents of hdfs over to mr.

commit 5c8e964991c88813a4cb8abe367245c3c7838246
Author: pferrel 
Date:   2015-03-29T16:44:02Z

Merge branch 'MAHOUT-1655' of https://github.com/andrewmusselman/mahout 
into MAHOUT-1655

commit 2d940a04ad06cdeca23731355ef61f1d25d9d2d5
Author: pferrel 
Date:   2015-03-29T18:12:21Z

moved classes into mahout-hdfs and created a dependency in mahout-mr for 
the module

commit 7cda0918c5bef50f2e32b70c3439fc3656d804f8
Author: pferrel 
Date:   2015-03-29T18:13:25Z

added junits for moved classes

commit c2b18eebb9009bf0676566e8e9a30332441c2331
Author: pferrel 
Date:   2015-03-29T19:20:11Z

no need to pass mapreduce jar to Spark context now




> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Lanczos Deprecation - No ??

2015-03-29 Thread Andrew Palumbo


On 03/29/2015 03:07 PM, Andrew Palumbo wrote:

done.

(on the wiki)


On 03/29/2015 02:59 PM, Suneel Marthi wrote:
Please mark it as deprecated immediately both in the code and on the 
Wiki.


Shannon, if you could refactor Spectral KMeans to use SSVD for the 
upcoming

release, we could purge this for good.

On Sun, Mar 29, 2015 at 2:32 PM, Andrew Palumbo  
wrote:



I'll update it as deprecated on the algorithms page. Let me know if you
want it removed completely.


On 03/29/2015 02:25 PM, Suneel Marthi wrote:

I was under the impression that Lanczos had been long deprecated in 
Mahout
now for a few years and the reason we still have it was only 
because it

was
being used by Spectral KMeans clustering.

See
http://grepcode.com/file/repo1.maven.org/maven2/org.
apache.mahout/mahout-math/0.9/org/apache/mahout/math/decomposer/lanczos/ 


LanczosSolver.java

Its still not deprecated as of 0.9, even thou the Release notes for 
0.8
mention call out that it will be deprecated in a future release.  
As one

who's finalized the release notes for both 0.8 and 0.9 releases, I
apologize for this slip up.

While the community internally understands that Lanczos has been long
deprecated, but without explicitly marking the code as "deprecated" 
and
updating the project wiki documentation to reflect the same it 
wouldn't be
apparent to a user of the project and its possible that its 
referenced in

talks for comparison with SSVD or other similar methods.

Apologies to all of those who have been using/referencing Lanczos 
in their

work or talks.

For the upcoming release:

1.  Refactor Spectral KMeans to use SSVD (Shannon, this is now a high
priority)
2.  Deprecate Lanczos, I prefer completely purging it from the 
codebase.

3.  Ensure the Release notes and Wiki documentation reflect that very
clearly.

Thanks.








Re: Lanczos Deprecation - No ??

2015-03-29 Thread Andrew Palumbo

done.
On 03/29/2015 02:59 PM, Suneel Marthi wrote:

Please mark it as deprecated immediately both in the code and on the Wiki.

Shannon, if you could refactor Spectral KMeans to use SSVD for the upcoming
release, we could purge this for good.

On Sun, Mar 29, 2015 at 2:32 PM, Andrew Palumbo  wrote:


I'll update it as deprecated on the algorithms page. Let me know if you
want it removed completely.


On 03/29/2015 02:25 PM, Suneel Marthi wrote:


I was under the impression that Lanczos had been long deprecated in Mahout
now for a few years and the reason we still have it was only because it
was
being used by Spectral KMeans clustering.

See
http://grepcode.com/file/repo1.maven.org/maven2/org.
apache.mahout/mahout-math/0.9/org/apache/mahout/math/decomposer/lanczos/
LanczosSolver.java

Its still not deprecated as of 0.9, even thou the Release notes for 0.8
mention call out that it will be deprecated in a future release.  As one
who's finalized the release notes for both 0.8 and 0.9 releases, I
apologize for this slip up.

While the community internally understands that Lanczos has been long
deprecated, but without explicitly marking the code as "deprecated" and
updating the project wiki documentation to reflect the same it wouldn't be
apparent to a user of the project and its possible that its referenced in
talks for comparison with SSVD or other similar methods.

Apologies to all of those who have been using/referencing Lanczos in their
work or talks.

For the upcoming release:

1.  Refactor Spectral KMeans to use SSVD (Shannon, this is now a high
priority)
2.  Deprecate Lanczos, I prefer completely purging it from the codebase.
3.  Ensure the Release notes and Wiki documentation reflect that very
clearly.

Thanks.






[jira] [Commented] (MAHOUT-1477) Clean up website on Logistic Regression

2015-03-29 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385894#comment-14385894
 ] 

Andrew Palumbo commented on MAHOUT-1477:


Thanks Suneel, added references and links for Frank's blog and a link to the 
mahout bank marketing example page.  I'll close this tomorrow if I don't hear 
back from Frank or Ted. 

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Andrew Palumbo
>  Labels: legacy
> Fix For: 0.10.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Lanczos Deprecation - No ??

2015-03-29 Thread Suneel Marthi
Please mark it as deprecated immediately both in the code and on the Wiki.

Shannon, if you could refactor Spectral KMeans to use SSVD for the upcoming
release, we could purge this for good.

On Sun, Mar 29, 2015 at 2:32 PM, Andrew Palumbo  wrote:

> I'll update it as deprecated on the algorithms page. Let me know if you
> want it removed completely.
>
>
> On 03/29/2015 02:25 PM, Suneel Marthi wrote:
>
>> I was under the impression that Lanczos had been long deprecated in Mahout
>> now for a few years and the reason we still have it was only because it
>> was
>> being used by Spectral KMeans clustering.
>>
>> See
>> http://grepcode.com/file/repo1.maven.org/maven2/org.
>> apache.mahout/mahout-math/0.9/org/apache/mahout/math/decomposer/lanczos/
>> LanczosSolver.java
>>
>> Its still not deprecated as of 0.9, even thou the Release notes for 0.8
>> mention call out that it will be deprecated in a future release.  As one
>> who's finalized the release notes for both 0.8 and 0.9 releases, I
>> apologize for this slip up.
>>
>> While the community internally understands that Lanczos has been long
>> deprecated, but without explicitly marking the code as "deprecated" and
>> updating the project wiki documentation to reflect the same it wouldn't be
>> apparent to a user of the project and its possible that its referenced in
>> talks for comparison with SSVD or other similar methods.
>>
>> Apologies to all of those who have been using/referencing Lanczos in their
>> work or talks.
>>
>> For the upcoming release:
>>
>> 1.  Refactor Spectral KMeans to use SSVD (Shannon, this is now a high
>> priority)
>> 2.  Deprecate Lanczos, I prefer completely purging it from the codebase.
>> 3.  Ensure the Release notes and Wiki documentation reflect that very
>> clearly.
>>
>> Thanks.
>>
>>
>


Re: Lanczos Deprecation - No ??

2015-03-29 Thread Andrew Palumbo
I'll update it as deprecated on the algorithms page. Let me know if you 
want it removed completely.


On 03/29/2015 02:25 PM, Suneel Marthi wrote:

I was under the impression that Lanczos had been long deprecated in Mahout
now for a few years and the reason we still have it was only because it was
being used by Spectral KMeans clustering.

See
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-math/0.9/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

Its still not deprecated as of 0.9, even thou the Release notes for 0.8
mention call out that it will be deprecated in a future release.  As one
who's finalized the release notes for both 0.8 and 0.9 releases, I
apologize for this slip up.

While the community internally understands that Lanczos has been long
deprecated, but without explicitly marking the code as "deprecated" and
updating the project wiki documentation to reflect the same it wouldn't be
apparent to a user of the project and its possible that its referenced in
talks for comparison with SSVD or other similar methods.

Apologies to all of those who have been using/referencing Lanczos in their
work or talks.

For the upcoming release:

1.  Refactor Spectral KMeans to use SSVD (Shannon, this is now a high
priority)
2.  Deprecate Lanczos, I prefer completely purging it from the codebase.
3.  Ensure the Release notes and Wiki documentation reflect that very
clearly.

Thanks.





Lanczos Deprecation - No ??

2015-03-29 Thread Suneel Marthi
I was under the impression that Lanczos had been long deprecated in Mahout
now for a few years and the reason we still have it was only because it was
being used by Spectral KMeans clustering.

See
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-math/0.9/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

Its still not deprecated as of 0.9, even thou the Release notes for 0.8
mention call out that it will be deprecated in a future release.  As one
who's finalized the release notes for both 0.8 and 0.9 releases, I
apologize for this slip up.

While the community internally understands that Lanczos has been long
deprecated, but without explicitly marking the code as "deprecated" and
updating the project wiki documentation to reflect the same it wouldn't be
apparent to a user of the project and its possible that its referenced in
talks for comparison with SSVD or other similar methods.

Apologies to all of those who have been using/referencing Lanczos in their
work or talks.

For the upcoming release:

1.  Refactor Spectral KMeans to use SSVD (Shannon, this is now a high
priority)
2.  Deprecate Lanczos, I prefer completely purging it from the codebase.
3.  Ensure the Release notes and Wiki documentation reflect that very
clearly.

Thanks.


[jira] [Work started] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-29 Thread Stevo Slavic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1563 started by Stevo Slavic.

> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.10.0
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAHOUT-1563) Clean up WARNINGs during build

2015-03-29 Thread Stevo Slavic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stevo Slavic reassigned MAHOUT-1563:


Assignee: Stevo Slavic  (was: Andrew Musselman)

> Clean up WARNINGs during build
> --
>
> Key: MAHOUT-1563
> URL: https://issues.apache.org/jira/browse/MAHOUT-1563
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Stevo Slavic
>Priority: Minor
>  Labels: DSL, scala
> Fix For: 0.10.0
>
>
> We need to clean up warnings in the maven logs.  They seem to have piled up 
> recently; some are about scala lib version conflicts, some are about 
> deprecated APIs, some are about code style.
> Some may be fine for now but extra warnings in build logs feels like bad 
> hygiene to me.
> Some examples:
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  com.twitter:chill_2.10:0.3.1 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmBase.scala:73:
>  warning: a pure expression does nothing in statement position; you may be 
> omitting necessary parentheses
> [INFO] this
> [WARNING]  Expected all dependencies to require Scala version: 2.10.3
> [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires scala 
> version: 2.10.3
> [WARNING]  org.scalatest:scalatest_2.10:2.0 requires scala version: 2.10.0
> [WARNING] Multiple versions of scala libraries detected!
> [WARNING] 
> /home/akm/mahout/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala:132:
>  warning: non-variable type argument Double in type pattern Iterable[Double] 
> is unchecked since it is eliminated by erasure
> [INFO] case t: Iterable[Double] => t.toArray
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Some input files use or override a deprecated API.
> [WARNING] 
> /home/akm/mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java:
>  Recompile with -Xlint:deprecation for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1655) Refactor module dependencies

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385873#comment-14385873
 ] 

ASF GitHub Bot commented on MAHOUT-1655:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/83#issuecomment-87452443
  
intermediate branch that compiles here: 
https://github.com/pferrel/mahout/tree/MAHOUT-1655


> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1477) Clean up website on Logistic Regression

2015-03-29 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385872#comment-14385872
 ] 

Suneel Marthi commented on MAHOUT-1477:
---

Frank's got a good blog post about Mahout's Logistic Regression here 
http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/.

It would be good to reference that.

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Andrew Palumbo
>  Labels: legacy
> Fix For: 0.10.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1477) Clean up website on Logistic Regression

2015-03-29 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385871#comment-14385871
 ] 

Andrew Palumbo commented on MAHOUT-1477:


I'll close this out tomorrow (3/30) if i don't hear back from the original 
authors. [~frankscholten], [~tdunning] - not sure if its you either of you 
guys, but could you take a look?  

> Clean up website on Logistic Regression
> ---
>
> Key: MAHOUT-1477
> URL: https://issues.apache.org/jira/browse/MAHOUT-1477
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Andrew Palumbo
>  Labels: legacy
> Fix For: 0.10.0
>
>
> The website on Logistic regression needs clean up. We need to go through the 
> text, remove dead links and check whether the information is still consistent 
> with the current code. We should also link to the example created in 
> MAHOUT-1425 
> https://mahout.apache.org/users/classification/logistic-regression.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Andrew Palumbo

Sometimes it comes up and sometimes it doesn't, but it is resolved.

On 03/29/2015 01:57 PM, Suneel Marthi wrote:

yeah i noticed the weirdness with M-1609 too. Well lets keep that out of
the daily bug bash.

On Sun, Mar 29, 2015 at 1:55 PM, Andrew Palumbo  wrote:


yeah there's something weird going on with  M-1609, but I closed it on
Friday.


On 03/29/2015 12:36 PM, Andrew Musselman wrote:


Sunday's:

Andrew Palumbo
--
M-1477: Clean up website on Logistic Regression
M-1493: Port Naive Bayes to Spark DSL(Patch available)
M-1559: Documentation and cleanup for Naive Bayes Example
M-1564: Naive Bayes classifier for new Text Documents
M-1609: NullPointerException(This bug is not showing up aside from its
title)
M-1635: Getting an exception when I provide classification labels manually
for Naive Bayes
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1648: Update CMS for Mahout 0.10.0

Andrew Musselman
-
M-1462: Cleaning up Random Forests documentation on Mahout website
M-1470: LDA Topic dump
M-1522: Handle logging levels via log4j.xml
M-1563: cleanup Warnings during Build
M-1655: Refactor module dependencies

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1625: lucene2seq: failure to convert a document that does not contain a
field (the field is not required)
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content(Patch available)

Suneel Marthi
-
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1512: Hadoop 2 compatibility
M-1585: Javadocs not hosted by Mahout-Quality
M-1586: Collections downloads must have hash signatures
M-1619: HighDFWordsPruner overwrites cache files
M-1647: The release build is incomplete
M-1652: Java 7 update
M-1656: Change SNAPSHOT version from 1.0 to 0.10
M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf

Stevo Slavic

M-1277: Lose dependency on custom commons-cli
M-1278: Improve inheritance of apache parent pom
M-1562: Publish Scaladocs
M-1602: Euclidean Distance Similarity Math
M-1650: upgrade 3rd party jars

Shannon Quinn
---
M-1538: Port spectral clustering to Mahout DSL
M-1539: Implement affinity matrix computation in Mahout DSL
M-1659: Remove deprecated Lanczos solver from spectral clustering in
mr-legacy

Sebastian Schelter
--
M-1584: Create a detailed example of how to index an arbitrary dataset and
run LDA on it(Patch available)

Gokhan Capan
--
M-1626: Support for required quasi-algebraic operations and starting with
aggregating rows/blocks

Unassigned
--
M-1516: run classify-20newsgroups.sh failed cause by
/tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
available)
M-1551: Add document to describe how to use mlp with command line
(Patch
available)
M-1557: Add support for sparse training vectors in MLP(Patch
available)
M-1593: cluster-reuters.sh does not work complaining
java.lang.IllegalStateException(Patch available)
M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
available)
M-1633: Failure to execute query when solr index contains documents with
different fields
M-1634: ALS don't work when it adds new files in Distributed Cache
   (Patch available)
M-1637: RecommenderJob of ALS fails in the mapper because it uses the
instance of other class






[jira] [Commented] (MAHOUT-1653) Spark 1.3

2015-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385869#comment-14385869
 ] 

ASF GitHub Bot commented on MAHOUT-1653:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/82#issuecomment-87450688
  
My plan for this was to branch just before the code freeze and apply it 
then so we don't have to keep 2 branches in sync.


> Spark 1.3
> -
>
> Key: MAHOUT-1653
> URL: https://issues.apache.org/jira/browse/MAHOUT-1653
> Project: Mahout
>  Issue Type: Dependency upgrade
>Affects Versions: 0.10.1
>Reporter: Andrew Musselman
>Assignee: Andrew Palumbo
>
> Support Spark 1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Suneel Marthi
yeah i noticed the weirdness with M-1609 too. Well lets keep that out of
the daily bug bash.

On Sun, Mar 29, 2015 at 1:55 PM, Andrew Palumbo  wrote:

> yeah there's something weird going on with  M-1609, but I closed it on
> Friday.
>
>
> On 03/29/2015 12:36 PM, Andrew Musselman wrote:
>
>> Sunday's:
>>
>> Andrew Palumbo
>> --
>> M-1477: Clean up website on Logistic Regression
>> M-1493: Port Naive Bayes to Spark DSL(Patch available)
>> M-1559: Documentation and cleanup for Naive Bayes Example
>> M-1564: Naive Bayes classifier for new Text Documents
>> M-1609: NullPointerException(This bug is not showing up aside from its
>> title)
>> M-1635: Getting an exception when I provide classification labels manually
>> for Naive Bayes
>> M-1638: H2O bindings fail at drmParallelizeWithRowLabels
>> M-1648: Update CMS for Mahout 0.10.0
>>
>> Andrew Musselman
>> -
>> M-1462: Cleaning up Random Forests documentation on Mahout website
>> M-1470: LDA Topic dump
>> M-1522: Handle logging levels via log4j.xml
>> M-1563: cleanup Warnings during Build
>> M-1655: Refactor module dependencies
>>
>> Dmitriy Lyubimov
>> --
>> M-1646: Refactor out all legacy MR dependencies from scala code
>>
>> Frank Scholten
>> -
>> M-1625: lucene2seq: failure to convert a document that does not contain a
>> field (the field is not required)
>> M-1649: Lucene 5 upgrade
>>
>> Pat Ferrel
>> -
>> M-1589: mahout.cmd has duplicated content(Patch available)
>>
>> Suneel Marthi
>> -
>> M-1469: Streaming KMeans fails when executed in MR mode and
>> REDUCE_STREAMING_KMEANS set to true
>> M-1512: Hadoop 2 compatibility
>> M-1585: Javadocs not hosted by Mahout-Quality
>> M-1586: Collections downloads must have hash signatures
>> M-1619: HighDFWordsPruner overwrites cache files
>> M-1647: The release build is incomplete
>> M-1652: Java 7 update
>> M-1656: Change SNAPSHOT version from 1.0 to 0.10
>> M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
>>
>> Stevo Slavic
>> 
>> M-1277: Lose dependency on custom commons-cli
>> M-1278: Improve inheritance of apache parent pom
>> M-1562: Publish Scaladocs
>> M-1602: Euclidean Distance Similarity Math
>> M-1650: upgrade 3rd party jars
>>
>> Shannon Quinn
>> ---
>> M-1538: Port spectral clustering to Mahout DSL
>> M-1539: Implement affinity matrix computation in Mahout DSL
>> M-1659: Remove deprecated Lanczos solver from spectral clustering in
>> mr-legacy
>>
>> Sebastian Schelter
>> --
>> M-1584: Create a detailed example of how to index an arbitrary dataset and
>> run LDA on it(Patch available)
>>
>> Gokhan Capan
>> --
>> M-1626: Support for required quasi-algebraic operations and starting with
>> aggregating rows/blocks
>>
>> Unassigned
>> --
>> M-1516: run classify-20newsgroups.sh failed cause by
>> /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
>> available)
>> M-1551: Add document to describe how to use mlp with command line
>> (Patch
>> available)
>> M-1557: Add support for sparse training vectors in MLP(Patch
>> available)
>> M-1593: cluster-reuters.sh does not work complaining
>> java.lang.IllegalStateException(Patch available)
>> M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
>> available)
>> M-1633: Failure to execute query when solr index contains documents with
>> different fields
>> M-1634: ALS don't work when it adds new files in Distributed Cache
>>   (Patch available)
>> M-1637: RecommenderJob of ALS fails in the mapper because it uses the
>> instance of other class
>>
>>
>


[jira] [Updated] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-29 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1643:
---
Issue Type: Improvement  (was: Bug)

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Improvement
>  Components: CLI, spark
>Affects Versions: 0.10.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 0.10.1
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-29 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo reopened MAHOUT-1643:


Reopening for 0.10.1 because we still don't have a way to set spark options for 
the shell,  a problem that I hit in my first usage of it.

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 0.10.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 0.10.1
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-29 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1643:
---
Affects Version/s: (was: 1.0)
   0.10.0

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 0.10.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 0.10.1
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Andrew Palumbo
yeah there's something weird going on with  M-1609, but I closed it on 
Friday.


On 03/29/2015 12:36 PM, Andrew Musselman wrote:

Sunday's:

Andrew Palumbo
--
M-1477: Clean up website on Logistic Regression
M-1493: Port Naive Bayes to Spark DSL(Patch available)
M-1559: Documentation and cleanup for Naive Bayes Example
M-1564: Naive Bayes classifier for new Text Documents
M-1609: NullPointerException(This bug is not showing up aside from its
title)
M-1635: Getting an exception when I provide classification labels manually
for Naive Bayes
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1648: Update CMS for Mahout 0.10.0

Andrew Musselman
-
M-1462: Cleaning up Random Forests documentation on Mahout website
M-1470: LDA Topic dump
M-1522: Handle logging levels via log4j.xml
M-1563: cleanup Warnings during Build
M-1655: Refactor module dependencies

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1625: lucene2seq: failure to convert a document that does not contain a
field (the field is not required)
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content(Patch available)

Suneel Marthi
-
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1512: Hadoop 2 compatibility
M-1585: Javadocs not hosted by Mahout-Quality
M-1586: Collections downloads must have hash signatures
M-1619: HighDFWordsPruner overwrites cache files
M-1647: The release build is incomplete
M-1652: Java 7 update
M-1656: Change SNAPSHOT version from 1.0 to 0.10
M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf

Stevo Slavic

M-1277: Lose dependency on custom commons-cli
M-1278: Improve inheritance of apache parent pom
M-1562: Publish Scaladocs
M-1602: Euclidean Distance Similarity Math
M-1650: upgrade 3rd party jars

Shannon Quinn
---
M-1538: Port spectral clustering to Mahout DSL
M-1539: Implement affinity matrix computation in Mahout DSL
M-1659: Remove deprecated Lanczos solver from spectral clustering in
mr-legacy

Sebastian Schelter
--
M-1584: Create a detailed example of how to index an arbitrary dataset and
run LDA on it(Patch available)

Gokhan Capan
--
M-1626: Support for required quasi-algebraic operations and starting with
aggregating rows/blocks

Unassigned
--
M-1516: run classify-20newsgroups.sh failed cause by
/tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
available)
M-1551: Add document to describe how to use mlp with command line(Patch
available)
M-1557: Add support for sparse training vectors in MLP(Patch available)
M-1593: cluster-reuters.sh does not work complaining
java.lang.IllegalStateException(Patch available)
M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
available)
M-1633: Failure to execute query when solr index contains documents with
different fields
M-1634: ALS don't work when it adds new files in Distributed Cache
  (Patch available)
M-1637: RecommenderJob of ALS fails in the mapper because it uses the
instance of other class





[jira] [Updated] (MAHOUT-1643) CLI arguments are not being processed in spark-shell

2015-03-29 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1643:
---
Fix Version/s: (was: 0.10.0)
   0.10.1

> CLI arguments are not being processed in spark-shell
> 
>
> Key: MAHOUT-1643
> URL: https://issues.apache.org/jira/browse/MAHOUT-1643
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI, spark
>Affects Versions: 1.0
> Environment: spark spark-shell
>Reporter: Andrew Palumbo
>  Labels: DSL, scala, spark, spark-shell
> Fix For: 0.10.1
>
>
> The CLI arguments are not being processed in spark-shell.  Most importantly 
> the spark options are not being passed to the spark configuration via:
> {code}
> $ mahout spark-shell -D:k=n
> {code}
> The arguments are preserved it through {code}$ bin/mahout{code}There should 
> be a relatively easy fix either by using the MahoutOptionParser, Scopt or by 
> simply parsing the args array. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Andrew Musselman
Yes, reminder we want to freeze/slush next Sunday.

If you won't be able to finish your bugs let's do some more triage and
split up work.

On Sunday, March 29, 2015, Suneel Marthi  wrote:

> A daily "politely harsh' reminder of the April 5 code freeze date with the
> daily bug bash would be helpful.
>
> On Sun, Mar 29, 2015 at 12:36 PM, Andrew Musselman <
> andrew.mussel...@gmail.com > wrote:
>
> > Sunday's:
> >
> > Andrew Palumbo
> > --
> > M-1477: Clean up website on Logistic Regression
> > M-1493: Port Naive Bayes to Spark DSL(Patch available)
> > M-1559: Documentation and cleanup for Naive Bayes Example
> > M-1564: Naive Bayes classifier for new Text Documents
> > M-1609: NullPointerException(This bug is not showing up aside from
> its
> > title)
> > M-1635: Getting an exception when I provide classification labels
> manually
> > for Naive Bayes
> > M-1638: H2O bindings fail at drmParallelizeWithRowLabels
> > M-1648: Update CMS for Mahout 0.10.0
> >
> > Andrew Musselman
> > -
> > M-1462: Cleaning up Random Forests documentation on Mahout website
> > M-1470: LDA Topic dump
> > M-1522: Handle logging levels via log4j.xml
> > M-1563: cleanup Warnings during Build
> > M-1655: Refactor module dependencies
> >
> > Dmitriy Lyubimov
> > --
> > M-1646: Refactor out all legacy MR dependencies from scala code
> >
> > Frank Scholten
> > -
> > M-1625: lucene2seq: failure to convert a document that does not contain a
> > field (the field is not required)
> > M-1649: Lucene 5 upgrade
> >
> > Pat Ferrel
> > -
> > M-1589: mahout.cmd has duplicated content(Patch available)
> >
> > Suneel Marthi
> > -
> > M-1469: Streaming KMeans fails when executed in MR mode and
> > REDUCE_STREAMING_KMEANS set to true
> > M-1512: Hadoop 2 compatibility
> > M-1585: Javadocs not hosted by Mahout-Quality
> > M-1586: Collections downloads must have hash signatures
> > M-1619: HighDFWordsPruner overwrites cache files
> > M-1647: The release build is incomplete
> > M-1652: Java 7 update
> > M-1656: Change SNAPSHOT version from 1.0 to 0.10
> > M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
> >
> > Stevo Slavic
> > 
> > M-1277: Lose dependency on custom commons-cli
> > M-1278: Improve inheritance of apache parent pom
> > M-1562: Publish Scaladocs
> > M-1602: Euclidean Distance Similarity Math
> > M-1650: upgrade 3rd party jars
> >
> > Shannon Quinn
> > ---
> > M-1538: Port spectral clustering to Mahout DSL
> > M-1539: Implement affinity matrix computation in Mahout DSL
> > M-1659: Remove deprecated Lanczos solver from spectral clustering in
> > mr-legacy
> >
> > Sebastian Schelter
> > --
> > M-1584: Create a detailed example of how to index an arbitrary dataset
> and
> > run LDA on it(Patch available)
> >
> > Gokhan Capan
> > --
> > M-1626: Support for required quasi-algebraic operations and starting with
> > aggregating rows/blocks
> >
> > Unassigned
> > --
> > M-1516: run classify-20newsgroups.sh failed cause by
> > /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
> > available)
> > M-1551: Add document to describe how to use mlp with command line
> (Patch
> > available)
> > M-1557: Add support for sparse training vectors in MLP(Patch
> available)
> > M-1593: cluster-reuters.sh does not work complaining
> > java.lang.IllegalStateException(Patch available)
> > M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
> > available)
> > M-1633: Failure to execute query when solr index contains documents with
> > different fields
> > M-1634: ALS don't work when it adds new files in Distributed Cache
> >  (Patch available)
> > M-1637: RecommenderJob of ALS fails in the mapper because it uses the
> > instance of other class
> >
>


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Suneel Marthi
A daily "politely harsh' reminder of the April 5 code freeze date with the
daily bug bash would be helpful.

On Sun, Mar 29, 2015 at 12:36 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Sunday's:
>
> Andrew Palumbo
> --
> M-1477: Clean up website on Logistic Regression
> M-1493: Port Naive Bayes to Spark DSL(Patch available)
> M-1559: Documentation and cleanup for Naive Bayes Example
> M-1564: Naive Bayes classifier for new Text Documents
> M-1609: NullPointerException(This bug is not showing up aside from its
> title)
> M-1635: Getting an exception when I provide classification labels manually
> for Naive Bayes
> M-1638: H2O bindings fail at drmParallelizeWithRowLabels
> M-1648: Update CMS for Mahout 0.10.0
>
> Andrew Musselman
> -
> M-1462: Cleaning up Random Forests documentation on Mahout website
> M-1470: LDA Topic dump
> M-1522: Handle logging levels via log4j.xml
> M-1563: cleanup Warnings during Build
> M-1655: Refactor module dependencies
>
> Dmitriy Lyubimov
> --
> M-1646: Refactor out all legacy MR dependencies from scala code
>
> Frank Scholten
> -
> M-1625: lucene2seq: failure to convert a document that does not contain a
> field (the field is not required)
> M-1649: Lucene 5 upgrade
>
> Pat Ferrel
> -
> M-1589: mahout.cmd has duplicated content(Patch available)
>
> Suneel Marthi
> -
> M-1469: Streaming KMeans fails when executed in MR mode and
> REDUCE_STREAMING_KMEANS set to true
> M-1512: Hadoop 2 compatibility
> M-1585: Javadocs not hosted by Mahout-Quality
> M-1586: Collections downloads must have hash signatures
> M-1619: HighDFWordsPruner overwrites cache files
> M-1647: The release build is incomplete
> M-1652: Java 7 update
> M-1656: Change SNAPSHOT version from 1.0 to 0.10
> M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
>
> Stevo Slavic
> 
> M-1277: Lose dependency on custom commons-cli
> M-1278: Improve inheritance of apache parent pom
> M-1562: Publish Scaladocs
> M-1602: Euclidean Distance Similarity Math
> M-1650: upgrade 3rd party jars
>
> Shannon Quinn
> ---
> M-1538: Port spectral clustering to Mahout DSL
> M-1539: Implement affinity matrix computation in Mahout DSL
> M-1659: Remove deprecated Lanczos solver from spectral clustering in
> mr-legacy
>
> Sebastian Schelter
> --
> M-1584: Create a detailed example of how to index an arbitrary dataset and
> run LDA on it(Patch available)
>
> Gokhan Capan
> --
> M-1626: Support for required quasi-algebraic operations and starting with
> aggregating rows/blocks
>
> Unassigned
> --
> M-1516: run classify-20newsgroups.sh failed cause by
> /tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
> available)
> M-1551: Add document to describe how to use mlp with command line(Patch
> available)
> M-1557: Add support for sparse training vectors in MLP(Patch available)
> M-1593: cluster-reuters.sh does not work complaining
> java.lang.IllegalStateException(Patch available)
> M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
> available)
> M-1633: Failure to execute query when solr index contains documents with
> different fields
> M-1634: ALS don't work when it adds new files in Distributed Cache
>  (Patch available)
> M-1637: RecommenderJob of ALS fails in the mapper because it uses the
> instance of other class
>


Re: Mahout 0.10.0 Bug bash

2015-03-29 Thread Andrew Musselman
Sunday's:

Andrew Palumbo
--
M-1477: Clean up website on Logistic Regression
M-1493: Port Naive Bayes to Spark DSL(Patch available)
M-1559: Documentation and cleanup for Naive Bayes Example
M-1564: Naive Bayes classifier for new Text Documents
M-1609: NullPointerException(This bug is not showing up aside from its
title)
M-1635: Getting an exception when I provide classification labels manually
for Naive Bayes
M-1638: H2O bindings fail at drmParallelizeWithRowLabels
M-1648: Update CMS for Mahout 0.10.0

Andrew Musselman
-
M-1462: Cleaning up Random Forests documentation on Mahout website
M-1470: LDA Topic dump
M-1522: Handle logging levels via log4j.xml
M-1563: cleanup Warnings during Build
M-1655: Refactor module dependencies

Dmitriy Lyubimov
--
M-1646: Refactor out all legacy MR dependencies from scala code

Frank Scholten
-
M-1625: lucene2seq: failure to convert a document that does not contain a
field (the field is not required)
M-1649: Lucene 5 upgrade

Pat Ferrel
-
M-1589: mahout.cmd has duplicated content(Patch available)

Suneel Marthi
-
M-1469: Streaming KMeans fails when executed in MR mode and
REDUCE_STREAMING_KMEANS set to true
M-1512: Hadoop 2 compatibility
M-1585: Javadocs not hosted by Mahout-Quality
M-1586: Collections downloads must have hash signatures
M-1619: HighDFWordsPruner overwrites cache files
M-1647: The release build is incomplete
M-1652: Java 7 update
M-1656: Change SNAPSHOT version from 1.0 to 0.10
M-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf

Stevo Slavic

M-1277: Lose dependency on custom commons-cli
M-1278: Improve inheritance of apache parent pom
M-1562: Publish Scaladocs
M-1602: Euclidean Distance Similarity Math
M-1650: upgrade 3rd party jars

Shannon Quinn
---
M-1538: Port spectral clustering to Mahout DSL
M-1539: Implement affinity matrix computation in Mahout DSL
M-1659: Remove deprecated Lanczos solver from spectral clustering in
mr-legacy

Sebastian Schelter
--
M-1584: Create a detailed example of how to index an arbitrary dataset and
run LDA on it(Patch available)

Gokhan Capan
--
M-1626: Support for required quasi-algebraic operations and starting with
aggregating rows/blocks

Unassigned
--
M-1516: run classify-20newsgroups.sh failed cause by
/tmp/mahout-work-jpan/20news-all does not exists in hdfs.(Patch
available)
M-1551: Add document to describe how to use mlp with command line(Patch
available)
M-1557: Add support for sparse training vectors in MLP(Patch available)
M-1593: cluster-reuters.sh does not work complaining
java.lang.IllegalStateException(Patch available)
M-1594: Example factorize-movielens-1M.sh does not use HDFS(Patch
available)
M-1633: Failure to execute query when solr index contains documents with
different fields
M-1634: ALS don't work when it adds new files in Distributed Cache
 (Patch available)
M-1637: RecommenderJob of ALS fails in the mapper because it uses the
instance of other class


[jira] [Updated] (MAHOUT-1634) ALS don't work when it adds new files in Distributed Cache

2015-03-29 Thread Andrew Musselman (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman updated MAHOUT-1634:
-
Fix Version/s: (was: 1.0)
   0.10.0

> ALS don't work when it adds new files in Distributed Cache
> --
>
> Key: MAHOUT-1634
> URL: https://issues.apache.org/jira/browse/MAHOUT-1634
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.9
> Environment: Cloudera 5.1 VM, eclipse, zookeeper
>Reporter: Cristian Galán
>  Labels: ALS, legacy
> Fix For: 0.10.0
>
> Attachments: mahout.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ALS algorithm uses distributed cache to temp files, but the distributed cache 
> have other uses too, especially to add dependencies
> (http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/),
>  so when in a hadoop's job we add a dependency library (or other file) ALS 
> fails because it reads ALL files in Distribution Cache without distinction.
> This occurs in the project of my company because we need to add Mahout 
> dependencies (mahout, lucene,...) in an hadoop Configuration to run Mahout's 
> jobs, otherwise the Mahout's job fails because it don't find the dependencies.
> I propose two options (I think two valid options):
> 1) Eliminate all .jar in the return of HadoopUtil.getCacheFiles
> 2) Elliminate all Path object distinct of /part-*
> I prefer the first because it's less aggressive, and I think this solution 
> will be resolve all problems.
> Pd: Sorry if my english is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Projects page out of date

2015-03-29 Thread Suneel Marthi
It wouldn't propagate unless its pushed to the mirrors, next release should
take care of that.

On Sun, Mar 29, 2015 at 12:59 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Yep, good call; I'll make an update and see if it propagates.
>
> On Saturday, March 28, 2015, Suneel Marthi 
> wrote:
>
> > May need to update the project DOAP ??
> >
> > On Sun, Mar 29, 2015 at 12:31 AM, Andrew Musselman <
> > andrew.mussel...@gmail.com > wrote:
> >
> > > How does this page get refreshed?
> > >
> > >
> > > https://projects.apache.org/projects/mahout.html
> > >
> >
>