[jira] [Commented] (MAHOUT-1946) ViennaCL not being picked up by JNI

2017-05-09 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003767#comment-16003767
 ] 

Dmitriy Lyubimov commented on MAHOUT-1946:
--

it seems so. load library failure should equal to backend being not available.


> ViennaCL not being picked up by JNI
> ---
>
> Key: MAHOUT-1946
> URL: https://issues.apache.org/jira/browse/MAHOUT-1946
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Musselman
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Using the PR for MAHOUT-1938 but probably in master as well:
> scala> :load ./examples/bin/SparseSparseDrmTimer.mscala
> Loading ./examples/bin/SparseSparseDrmTimer.mscala...
> timeSparseDRMMMul: (m: Int, n: Int, s: Int, para: Int, pctDense: Double, 
> seed: Long)Long
> scala> timeSparseDRMMMul(100,100,100,1,.02,1234L)
> [INFO] Creating org.apache.mahout.viennacl.opencl.GPUMMul solver
> [INFO] Successfully created org.apache.mahout.viennacl.opencl.GPUMMul solver
> gpuRWCW
> 17/02/26 13:18:54 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:127)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:33)
>   at 
> org.apache.mahout.math.scalabindings.RLikeMatrixOps.$percent$times$percent(RLikeMatrixOps.scala:37)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$.org$apache$mahout$sparkbindings$blas$ABt$$mmulFunc$1(ABt.scala:98)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/02/26 13:18:54 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at 
> 

[jira] [Commented] (MAHOUT-1946) ViennaCL not being picked up by JNI

2017-05-09 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003640#comment-16003640
 ] 

Dmitriy Lyubimov commented on MAHOUT-1946:
--

Native backends should be able to recover from inability to load native code 
gracefully (just report up that the backend is unavailable, rather than 
throwing class loading exception .. guess saying the obvious :) sorry

> ViennaCL not being picked up by JNI
> ---
>
> Key: MAHOUT-1946
> URL: https://issues.apache.org/jira/browse/MAHOUT-1946
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Musselman
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Using the PR for MAHOUT-1938 but probably in master as well:
> scala> :load ./examples/bin/SparseSparseDrmTimer.mscala
> Loading ./examples/bin/SparseSparseDrmTimer.mscala...
> timeSparseDRMMMul: (m: Int, n: Int, s: Int, para: Int, pctDense: Double, 
> seed: Long)Long
> scala> timeSparseDRMMMul(100,100,100,1,.02,1234L)
> [INFO] Creating org.apache.mahout.viennacl.opencl.GPUMMul solver
> [INFO] Successfully created org.apache.mahout.viennacl.opencl.GPUMMul solver
> gpuRWCW
> 17/02/26 13:18:54 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:127)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:33)
>   at 
> org.apache.mahout.math.scalabindings.RLikeMatrixOps.$percent$times$percent(RLikeMatrixOps.scala:37)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$.org$apache$mahout$sparkbindings$blas$ABt$$mmulFunc$1(ABt.scala:98)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/02/26 13:18:54 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> 

[jira] [Commented] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs

2017-02-14 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866853#comment-15866853
 ] 

Dmitriy Lyubimov commented on MAHOUT-1940:
--

Normally, one who is writing in Java, does not have to really port anything 
from Scala. 
For example, Spark's Java APIs are in fact implemented in Scala. 

There are normally two ways of going about this: 
(1) write API in Java and implement them in Scala (the way Spark does), 
(2) write Java-compatible traits in Scala and then implement them in Scala as 
well. (which is what i do as it saves complexity a bit). 

to approach the (2), the APIs should only be using Java-compatible types. That 
is, no Scala libraries (such as collections) or incompatible language 
constructs (such as implicits, curried functions, generics context bounds etc. 
etc.) Implementing API interfaces in Java just verifies this a bit better and 
allows avoiding a mixed build (which may sometimes be a problem due to circular 
dependencies between Java and Scala code).


> Provide a Java API to  SimilarityAnalysis and any other needed APIs
> ---
>
> Key: MAHOUT-1940
> URL: https://issues.apache.org/jira/browse/MAHOUT-1940
> Project: Mahout
>  Issue Type: New Feature
>  Components: Algorithms, cooccurrence
>Reporter: James Mackey
>
> We want to port the functionality from 
> org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy 
> integration with a java project we will be creating that derives a similarity 
> measure from the co-occurrence and cross-occurrence matrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1939 at 2/11/17 12:33 AM:


perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi:fastutil

  
  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  



was (Author: dlyubimov):
perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  


> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1939 at 2/11/17 12:16 AM:


perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  



was (Author: dlyubimov):
perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.

> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov commented on MAHOUT-1939:
--

perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.

> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)
Dmitriy Lyubimov created MAHOUT-1939:


 Summary: fastutil version clash with spark distributions
 Key: MAHOUT-1939
 URL: https://issues.apache.org/jira/browse/MAHOUT-1939
 Project: Mahout
  Issue Type: Bug
Reporter: Dmitriy Lyubimov
Priority: Critical


Version difference in fast util breaks sparse algebra (specifically, 
RandomAccessSparseVector in assign, e.g., vec *= 5).

observed version in CDH:

file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar

mahout uses 7.0.12

java.lang.UnsupportedOperationException
at 
it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
at 
org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
at 
org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
at 
org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (MAHOUT-1916) mahout bug

2017-01-26 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1916.
--
Resolution: Invalid

> mahout bug
> --
>
> Key: MAHOUT-1916
> URL: https://issues.apache.org/jira/browse/MAHOUT-1916
> Project: Mahout
>  Issue Type: Bug
>Reporter: sarra sarra
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-12-21 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768521#comment-15768521
 ] 

Dmitriy Lyubimov commented on MAHOUT-1856:
--

one thing -- we usually squash working braches before moving a PR to master so 
that we preferrably have one commit per issue. this is much easier manage (and 
hot-fix stuff if needed later).

> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1892) Can't broadcast vector in Mahout-Shell

2016-11-15 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668009#comment-15668009
 ] 

Dmitriy Lyubimov commented on MAHOUT-1892:
--

Shell is  a mystery. Obviously it tries to drag A itself into the mapblock 
closure, buy why is escaping me.

What happens if we remove implicit conversion (i.e. use bcastV.value explicitly 
inside the closure)? is it still happening?

> Can't broadcast vector in Mahout-Shell
> --
>
> Key: MAHOUT-1892
> URL: https://issues.apache.org/jira/browse/MAHOUT-1892
> Project: Mahout
>  Issue Type: Bug
>Reporter: Trevor Grant
>
> When attempting to broadcast a Vector in Mahout's spark-shell with `mapBlock` 
> we get serialization errors.  **NOTE** scalars can be broadcast without issue.
> I did some testing in the "Zeppelin Shell" for lack of a better term.  See 
> https://github.com/apache/zeppelin/pull/928
> The `mapBlock` same code I ran in the spark-shell below, also generated 
> errors.  However, wrapping a mapBlock into a function in a compiled jar 
> https://github.com/apache/mahout/pull/246/commits/ccb5da65330e394763928f6dc51d96e38debe4fb#diff-4a952e8e09ae07e0b3a7ac6a5d6b2734R25
>  and then running said function from the Mahout Shell or in the "Zeppelin 
> Shell" (using Spark or Flink as a runner) works fine.  
> Consider
> ```
> mahout> val inCoreA = dense((1, 2, 3), (3, 4, 5))
> val A = drmParallelize(inCoreA)
> val v: Vector = dvec(1,1,1)
> val bcastV = drmBroadcast(v)
> val drm2 = A.mapBlock() {
> case (keys, block) =>
> for(row <- 0 until block.nrow) block(row, ::) -= bcastV
> keys -> block
> }
> drm2.checkpoint()
> ```
> Which emits the stack trace:
> ```
> org.apache.spark.SparkException: Task not serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> at org.apache.spark.rdd.RDD.map(RDD.scala:317)
> at 
> org.apache.mahout.sparkbindings.blas.MapBlock$.exec(MapBlock.scala:33)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:338)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:116)
> at 
> org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:94)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:96)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:98)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:100)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
> at 

[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:10 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:09 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:08 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding ncol and nrow to drmDfsRead 
specifically. There's possibly only a slight benefit right now (no no-cache or 
no-sample guarantee), which likely only decrease in the future. I am fine with 
it as understood there's no "no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov commented on MAHOUT-1884:
--

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding ncol and nrow to drmDfsRead 
specifically. There's possibly only a slight benefit right now (no no-cache or 
no-sample guarantee), which likely only decrease in the future. I am fine with 
it as understood there's no "no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-03 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543437#comment-15543437
 ] 

Dmitriy Lyubimov commented on MAHOUT-1884:
--



Which api is this about specifically?

wrapping existing RDD (drmWrap() api) supports this. 

Also note that for drms off disk, these are one-pass computations that are of 
cost no more than RDD$count(). Since for any dataset we call dfsRead(), the 
obvious intent is to use it, loading & caching is not doing any harm as that's 
what would happen anyway.

also, matrix dimensions are the most obvious ones but not everything that 
optimizer may need to analyze about the dataset (lazily). There are more 
heuristics about datasets that drmWrap() accepts (and even more that it 
doesn't). 

if we are talking about cases where drmWrap() cannot be used for some reason, 
we probably should request metadata equivalent to what drmWrap() does, not just 
ncol, nrow.

> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-05-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271648#comment-15271648
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1791 at 5/5/16 12:03 AM:
---

experiments show that native solvers + 2 backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end


was (Author: dlyubimov):
experiments show that native solvers + 2x backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-05-04 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271648#comment-15271648
 ] 

Dmitriy Lyubimov commented on MAHOUT-1791:
--

experiments show that native solvers + 2x backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218531#comment-15218531
 ] 

Dmitriy Lyubimov commented on MAHOUT-1788:
--

[~shashidongur]

Oh but of course! please do!

You may work on any issue, this or any other of your choice, or even on any new 
issue you can think of (for sizeable contributions it is recommended to start 
discussion on the @dev list first though, to make sure to benefit from 
experience of others. Please file any new issue first to jira).


> spark-itemsimilarity integration test script cleanup
> 
>
> Key: MAHOUT-1788
> URL: https://issues.apache.org/jira/browse/MAHOUT-1788
> Project: Mahout
>  Issue Type: Improvement
>  Components: cooccurrence
>Affects Versions: 0.11.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Trivial
> Fix For: 1.0.0
>
>
> binary release does not contain data for itemsimilarity tests, neith binary 
> nor source versions will run on a cluster unless data is hand copied to hdfs.
> Clean this up so it copies data if needed and the data is in both versions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1755) Mahout DSL for Flink: Flush intermediate results to FS

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1755.


bulk-closing resolved issues

> Mahout DSL for Flink: Flush intermediate results to FS
> --
>
> Key: MAHOUT-1755
> URL: https://issues.apache.org/jira/browse/MAHOUT-1755
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now Flink (unlike Spark) doesn't keep intermediate results in memory - 
> therefore they should be flushed to a file system, and read back when 
> required. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1810) Failing test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing issue)

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1810.


bulk-closing resolved issues

> Failing test in flink-bindings: A + B Identically partitioned (mapBlock 
> Checkpointing issue)
> 
>
> Key: MAHOUT-1810
> URL: https://issues.apache.org/jira/browse/MAHOUT-1810
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> the {{A + B, Identically Partitioned}} test in the Flink RLikeDrmOpsSuite 
> fails.  This test failure likely indicates an issue with Flink's 
> Checkpointing or mapBlock operator:
> {code}
>   test("C = A + B, identically partitioned") {
> val inCoreA = dense((1, 2, 3), (3, 4, 5), (5, 6, 7))
> val A = drmParallelize(inCoreA, numPartitions = 2)
> // Create B which would be identically partitioned to A. mapBlock() by 
> default will do the trick.
> val B = A.mapBlock() {
>   case (keys, block) =>
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> keys -> bBlock
> }
>   // Prevent repeated computation non-determinism
>   // flink problem is here... checkpoint is not doing what it should
>   .checkpoint()
> val inCoreB = B.collect
> printf("A=\n%s\n", inCoreA)
> printf("B=\n%s\n", inCoreB)
> val C = A + B
> val inCoreC = C.collect
> printf("C=\n%s\n", inCoreC)
> // Actual
> val inCoreCControl = inCoreA + inCoreB
> (inCoreC - inCoreCControl).norm should be < 1E-10
>   }
> {code}
> The output shous clearly that the line:
> {code}
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> {code} 
> in the {{mapBlock}} closure is being calculated more than once.
> Output:
> {code}
> A=
> {
>  0 => {0:1.0,1:2.0,2:3.0}
>  1 => {0:3.0,1:4.0,2:5.0}
>  2 => {0:5.0,1:6.0,2:7.0}
> }
> B=
> {
>  0 => {0:0.26203398262809574,1:0.22561543461472167,2:0.23229669514522655}
>  1 => {0:0.1638068194515867,1:0.18751822418846575,2:0.20586366231381614}
>  2 => {0:0.9279465706239354,1:0.2963513448240057,2:0.8866928923235948}
> }
> C=
> {
>  0 => {0:1.7883652623225594,1:2.6401297718606216,2:3.0023341959374195}
>  1 => {0:3.641411452208408,1:4.941233165480053,2:5.381282338548803}
>  2 => {0:5.707434148862531,1:6.022780876943659,2:7.149772825494352}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1686) Create a documentattion page for ALS

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1686:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Create a documentattion page for ALS
> 
>
> Key: MAHOUT-1686
> URL: https://issues.apache.org/jira/browse/MAHOUT-1686
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for ALS. This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1791:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1682) Create a documentation page for SPCA

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1682:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Create a documentation page for SPCA
> 
>
> Key: MAHOUT-1682
> URL: https://issues.apache.org/jira/browse/MAHOUT-1682
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for SPCA.  This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1765) Mahout DSL for Flink: Add some documentation about Flink backend

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1765:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Mahout DSL for Flink: Add some documentation about Flink backend
> 
>
> Key: MAHOUT-1765
> URL: https://issues.apache.org/jira/browse/MAHOUT-1765
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1821) Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink configuration

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1821:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink 
> configuration
> ---
>
> Key: MAHOUT-1821
> URL: https://issues.apache.org/jira/browse/MAHOUT-1821
> Project: Mahout
>  Issue Type: New Feature
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Mahout on Flink requires a few Mahout specific configuration parameters.  
> These configurations should be set in a 
> {{$MAHOUT_HOME/conf/mahout-flink-conf.yaml}} file.  The values in it will 
> override the user's {{flink-conf.yaml}} file if the same key is set in both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1814) Implement drm2intKeyed in flink bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1814:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Implement drm2intKeyed in flink bindings
> 
>
> Key: MAHOUT-1814
> URL: https://issues.apache.org/jira/browse/MAHOUT-1814
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1786:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Make classes implements Serializable for Spark 1.5+
> ---
>
> Key: MAHOUT-1786
> URL: https://issues.apache.org/jira/browse/MAHOUT-1786
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: performance
>
> Spark 1.5 comes with a new very efficient serializer that uses code 
> generation.  It is twice as fast as kryo.  When using mahout, we have to set 
> KryoSerializer because some classes aren't serializable otherwise.  
> I suggest to declare Math classes as "implements Serializable" where needed.  
> For instance, to use coocurence package in spark 1.5, we had to modify 
> AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it 
> work without Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1750) Mahout DSL for Flink: Implement ABt

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1750:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Mahout DSL for Flink: Implement ABt
> ---
>
> Key: MAHOUT-1750
> URL: https://issues.apache.org/jira/browse/MAHOUT-1750
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now ABt is expressed through AtB, which is not optimal, and we need to have a 
> special implementation for ABt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1790) SparkEngine nnz overflow resultSize when reducing.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1790:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> SparkEngine nnz overflow resultSize when reducing.
> --
>
> Key: MAHOUT-1790
> URL: https://issues.apache.org/jira/browse/MAHOUT-1790
> Project: Mahout
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.11.1
>Reporter: Michel Lemay
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Fix For: 0.12.1
>
>
> When counting numNonZeroElementsPerColumn in spark engine with large number 
> of columns, we get the following error:
> ERROR TaskSetManager: Total size of serialized results of nnn tasks (1031.7 
> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
> and then, the call stack:
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 267 tasks (1024.1 MB) is bigger than 
> spark.driver.maxResultSize (1024.0 MB)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.numNonZeroElementsPerColumn(SparkEngine.scala:86)
> at 
> org.apache.mahout.math.drm.CheckpointedOps.numNonZeroElementsPerColumn(CheckpointedOps.scala:37)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.sampleDownAndBinarize(SimilarityAnalysis.scala:286)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrences(SimilarityAnalysis.scala:66)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrencesIDSs(SimilarityAnalysis.scala:141)
> This occurs because it uses a DenseVector and spark seemingly aggregate all 
> of them on the driver before reducing.  
> I think this could be easily prevented with a treeReduce(_ += _, depth)  
> instead of a reduce(_ += _)
> 'depth' could be computed in function of 'n' and numberOfPartitions.. 
> something in the line of:
>   val maxResultSize = 
>   val numPartitions = drm.rdd.partitions.size
>   val n = drm.ncol
>   val bytesPerVector = n * 8 + overhead?
>   val maxVectors = maxResultSize / bytes / 2 + 1 // be safe
>   val depth = math.max(1, math.ceil(math.log(1 + numPartitions / maxVectors) 
> / math.log(2)).toInt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1570:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
> Fix For: 0.12.0
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1779) Brief overview page for the Flink engine

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1779:
-
Sprint: Jan/Feb-2016  (was: Mar/Apr-2016)

> Brief overview page for the Flink engine 
> -
>
> Key: MAHOUT-1779
> URL: https://issues.apache.org/jira/browse/MAHOUT-1779
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Add a brief  overview page for the Flink bindings similar to the h2o bindings 
> writeup found at:
> http://mahout.apache.org/users/environment/h2o-internals.html
> The page would go under the "Engines" section in the "Mahout Environment" 
> menu on the site.
> Topics would include: a brief overview of the engine, how the Drm API is 
> backed backed by the flink engine, interoperability with the engine-native 
> datastructures and methods and any other pertinent information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1786:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Make classes implements Serializable for Spark 1.5+
> ---
>
> Key: MAHOUT-1786
> URL: https://issues.apache.org/jira/browse/MAHOUT-1786
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: performance
>
> Spark 1.5 comes with a new very efficient serializer that uses code 
> generation.  It is twice as fast as kryo.  When using mahout, we have to set 
> KryoSerializer because some classes aren't serializable otherwise.  
> I suggest to declare Math classes as "implements Serializable" where needed.  
> For instance, to use coocurence package in spark 1.5, we had to modify 
> AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it 
> work without Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1686) Create a documentattion page for ALS

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1686:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Create a documentattion page for ALS
> 
>
> Key: MAHOUT-1686
> URL: https://issues.apache.org/jira/browse/MAHOUT-1686
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for ALS. This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1765) Mahout DSL for Flink: Add some documentation about Flink backend

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1765:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Mahout DSL for Flink: Add some documentation about Flink backend
> 
>
> Key: MAHOUT-1765
> URL: https://issues.apache.org/jira/browse/MAHOUT-1765
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1682) Create a documentation page for SPCA

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1682:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Create a documentation page for SPCA
> 
>
> Key: MAHOUT-1682
> URL: https://issues.apache.org/jira/browse/MAHOUT-1682
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for SPCA.  This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1779) Brief overview page for the Flink engine

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1779:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Brief overview page for the Flink engine 
> -
>
> Key: MAHOUT-1779
> URL: https://issues.apache.org/jira/browse/MAHOUT-1779
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Add a brief  overview page for the Flink bindings similar to the h2o bindings 
> writeup found at:
> http://mahout.apache.org/users/environment/h2o-internals.html
> The page would go under the "Engines" section in the "Mahout Environment" 
> menu on the site.
> Topics would include: a brief overview of the engine, how the Drm API is 
> backed backed by the flink engine, interoperability with the engine-native 
> datastructures and methods and any other pertinent information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1750) Mahout DSL for Flink: Implement ABt

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1750:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Mahout DSL for Flink: Implement ABt
> ---
>
> Key: MAHOUT-1750
> URL: https://issues.apache.org/jira/browse/MAHOUT-1750
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now ABt is expressed through AtB, which is not optimal, and we need to have a 
> special implementation for ABt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1814) Implement drm2intKeyed in flink bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1814:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Implement drm2intKeyed in flink bindings
> 
>
> Key: MAHOUT-1814
> URL: https://issues.apache.org/jira/browse/MAHOUT-1814
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1791:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1570:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
> Fix For: 0.12.0
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1821) Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink configuration

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1821:
-
Sprint: Mar/Apr-2016  (was: Jan/Feb-2016)

> Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink 
> configuration
> ---
>
> Key: MAHOUT-1821
> URL: https://issues.apache.org/jira/browse/MAHOUT-1821
> Project: Mahout
>  Issue Type: New Feature
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Mahout on Flink requires a few Mahout specific configuration parameters.  
> These configurations should be set in a 
> {{$MAHOUT_HOME/conf/mahout-flink-conf.yaml}} file.  The values in it will 
> override the user's {{flink-conf.yaml}} file if the same key is set in both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1751) Mahout DSL for Flink: Implement AtA

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1751.


bulk-closing resolved issues

> Mahout DSL for Flink: Implement AtA
> ---
>
> Key: MAHOUT-1751
> URL: https://issues.apache.org/jira/browse/MAHOUT-1751
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now AtA is implemented via AtB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1804) Implement drmParallelizeWithRowLabels(...) in flink

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1804.


bulk-closing resolved issues

> Implement drmParallelizeWithRowLabels(...) in flink
> ---
>
> Key: MAHOUT-1804
> URL: https://issues.apache.org/jira/browse/MAHOUT-1804
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1733) Move HDFSUtils to some common project

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1733.


bulk-closing resolved issues

> Move HDFSUtils to some common project
> -
>
> Key: MAHOUT-1733
> URL: https://issues.apache.org/jira/browse/MAHOUT-1733
> Project: Mahout
>  Issue Type: Wish
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> Hadoop1HDFSUtils doesn't seem to be Spark-specific, and in Flink-bindings we 
> need to have similar functionality. For now I just copied this class over to 
> the Flick bindings submodule, but it would be nice to have this in some 
> commonly used module e.g. mahout-hdfs or in the scala bindings. 
> Additionally when the utils class will be changed in spark bindings (e.g. in 
> MAHOUT-1660) we'll have to copy the changes - so it would be nice to avoid it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1749) Mahout DSL for Flink: Implement Atx

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1749.


bulk-closing resolved issues

> Mahout DSL for Flink: Implement Atx
> ---
>
> Key: MAHOUT-1749
> URL: https://issues.apache.org/jira/browse/MAHOUT-1749
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now Atx is implemented through At and Ax operators, but it's not optimal, and 
> its needs special implementation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1703) Mahout DSL for Flink: implement cbind and rbind

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1703.


bulk-closing resolved issues

> Mahout DSL for Flink: implement cbind and rbind
> ---
>
> Key: MAHOUT-1703
> URL: https://issues.apache.org/jira/browse/MAHOUT-1703
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - cbind
> - cbind with scalar
> - rbind
> rbind depends on mapBlock operator, so it also has to be implemented as a 
> part of this Jira



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1702) Mahout DSL for Flink: implement element-wise operators

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1702.


bulk-closing resolved issues

> Mahout DSL for Flink: implement element-wise operators
> --
>
> Key: MAHOUT-1702
> URL: https://issues.apache.org/jira/browse/MAHOUT-1702
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - AewB
> - AewScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1709) Mahout DSL for Flink: implement slicing

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1709.


bulk-closing resolved issues

>  Mahout DSL for Flink: implement slicing
> 
>
> Key: MAHOUT-1709
> URL: https://issues.apache.org/jira/browse/MAHOUT-1709
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570 OpRowRange operator for Flink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1812) Implement drmParallelizeEmptyLong(...) in flink Bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1812.


bulk-closing resolved issues

> Implement drmParallelizeEmptyLong(...) in flink Bindings
> 
>
> Key: MAHOUT-1812
> URL: https://issues.apache.org/jira/browse/MAHOUT-1812
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1805) Implement allreduceBlock(...) in flink bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1805.


bulk-closing resolved issues

> Implement allreduceBlock(...) in flink bindings
> ---
>
> Key: MAHOUT-1805
> URL: https://issues.apache.org/jira/browse/MAHOUT-1805
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1747) Mahout DSL for Flink: add support for different types of indexes (String, long, etc)

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1747.


bulk-closing resolved issues

> Mahout DSL for Flink: add support for different types of indexes (String, 
> long, etc)
> 
>
> Key: MAHOUT-1747
> URL: https://issues.apache.org/jira/browse/MAHOUT-1747
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> For mahout-flink (MAHOUT-1570) rows now can only be indexed with integers. 
> Add support for other types: Long, String, etc. 
> See FlinkEngine#toPhysical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1711) Mahout DSL for Flink: implement broadcasting

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1711.


bulk-closing resolved issues

> Mahout DSL for Flink: implement broadcasting
> 
>
> Key: MAHOUT-1711
> URL: https://issues.apache.org/jira/browse/MAHOUT-1711
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement drmBroadcast for flink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1817) Implement caching in Flink Bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1817.


bulk-closing resolved issues

> Implement caching in Flink Bindings
> ---
>
> Key: MAHOUT-1817
> URL: https://issues.apache.org/jira/browse/MAHOUT-1817
> Project: Mahout
>  Issue Type: New Feature
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Flink does not have in-memory caching analogous to that of Spark.  We need 
> find a way to honour the {{checkpoint()}} contract in Flink Bindings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1748) Mahout DSL for Flink: switch to Flink Scala API

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1748.


bulk-closing resolved issues

> Mahout DSL for Flink: switch to Flink Scala API
> ---
>
> Key: MAHOUT-1748
> URL: https://issues.apache.org/jira/browse/MAHOUT-1748
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> In Flink-Mahout (MAHOUT-1570) Flink Java API is used because Scala API caused 
> different strange compilation problems. 
> But Scala API handles types better than Flink Java API, so it's better to 
> switch to Scala API. It also can solve MAHOUT-1747



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1701) Mahout DSL for Flink: implement AtB ABt and AtA operators

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1701.


bulk-closing resolved issues

> Mahout DSL for Flink: implement AtB ABt and AtA operators
> -
>
> Key: MAHOUT-1701
> URL: https://issues.apache.org/jira/browse/MAHOUT-1701
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - AtB
> - ABt
> - AtA 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1789) Remove DataSetOps.scala from mahout-flink, replace by DataSetUtils from Flink 0.10

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1789.


bulk-closing resolved issues

> Remove DataSetOps.scala from mahout-flink, replace by DataSetUtils from Flink 
> 0.10
> --
>
> Key: MAHOUT-1789
> URL: https://issues.apache.org/jira/browse/MAHOUT-1789
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.0
>
>
> DataSetOps#zipWithIndex is not needed anymore in light of DataSetUtils 
> available in Flink 0.10+.  
> DataSetUtils.{java, scala} from Flink 0.10+ provides methods - 
> zipWithIndex(), zipWithUniqueId(), sample(), sampleWithSize(); all of which 
> are needed to support Mahout-Flink backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1710) Mahout DSL for Flink: implement right in-core matrix multiplication

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1710.


bulk-closing resolved issues

> Mahout DSL for Flink: implement right in-core matrix multiplication
> ---
>
> Key: MAHOUT-1710
> URL: https://issues.apache.org/jira/browse/MAHOUT-1710
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570 implement A %*% B multiplication when B is an 
> in-core matrix 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1712) Mahout DSL for Flink: implement operators At, Ax, Atx

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1712.


bulk-closing resolved issues

> Mahout DSL for Flink: implement operators At, Ax, Atx
> -
>
> Key: MAHOUT-1712
> URL: https://issues.apache.org/jira/browse/MAHOUT-1712
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570, implement the following operators on Flink:
> - At
> - Ax
> - Atx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1809) Failing tests in Flink-bindings: dals and dspca

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1809.


bulk-closing resolved issues

> Failing tests in Flink-bindings: dals and dspca
> ---
>
> Key: MAHOUT-1809
> URL: https://issues.apache.org/jira/browse/MAHOUT-1809
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> {{dspca}} and {{dals}} are failing in the flink distributed decomposition 
> suite with numerical and oom errors respectively:
> {{dspca}} Failure and stack trace:
> {code}
> 54.69239412917543 was not less than 1.0E-5
> ScalaTestFailureLocation: 
> org.apache.mahout.flinkbindings.FailingTestsSuite$$anonfun$4 at 
> (FailingTestsSuite.scala:230)
> org.scalatest.exceptions.TestFailedException: 54.69239412917543 was not less 
> than 1.0E-5
>   at 
> org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
>   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
>   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite$$anonfun$4.apply$mcV$sp(FailingTestsSuite.scala:230)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite$$anonfun$4.apply(FailingTestsSuite.scala:186)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite$$anonfun$4.apply(FailingTestsSuite.scala:186)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FailingTestsSuite.scala:48)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite.runTest(FailingTestsSuite.scala:48)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite.org$scalatest$BeforeAndAfterAllConfigMap$$super$run(FailingTestsSuite.scala:48)
>   at 
> org.scalatest.BeforeAndAfterAllConfigMap$class.liftedTree1$1(BeforeAndAfterAllConfigMap.scala:248)
>   at 
> org.scalatest.BeforeAndAfterAllConfigMap$class.run(BeforeAndAfterAllConfigMap.scala:247)
>   at 
> org.apache.mahout.flinkbindings.FailingTestsSuite.run(FailingTestsSuite.scala:48)
>   at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
>   at 
> org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
>   at 
> 

[jira] [Closed] (MAHOUT-1815) dsqDist(X,Y) and dsqDist(X) failing in flink tests.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1815.


bulk-closing resolved issues

> dsqDist(X,Y) and dsqDist(X) failing in flink tests.
> ---
>
> Key: MAHOUT-1815
> URL: https://issues.apache.org/jira/browse/MAHOUT-1815
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
>
> {code}
>   test("dsqDist(X,Y)") {
> val m = 100
> val n = 300
> val d = 7
> val mxX = Matrices.symmetricUniformView(m, d, 12345).cloned -= 5
> val mxY = Matrices.symmetricUniformView(n, d, 1234).cloned += 10
> val (drmX, drmY) = (drmParallelize(mxX, 3), drmParallelize(mxY, 4))
> val mxDsq = dsqDist(drmX, drmY).collect
> val mxDsqControl = new DenseMatrix(m, n) := { (r, c, _) ⇒ (mxX(r, ::) - 
> mxY(c, ::)) ^= 2 sum }
> (mxDsq - mxDsqControl).norm should be < 1e-7
>   }
> {code}
> And 
> {code}
>  test("dsqDist(X)") {
> val m = 100
> val d = 7
> val mxX = Matrices.symmetricUniformView(m, d, 12345).cloned -= 5
> val drmX = drmParallelize(mxX, 3)
> val mxDsq = dsqDist(drmX).collect
> val mxDsqControl = sqDist(drmX)
> (mxDsq - mxDsqControl).norm should be < 1e-7
>   }
> {code}
> are both failing in flink tests with {{arrayOutOfBounds}} Exceptions:
> {code}
> 03/15/2016 17:02:19   DataSink 
> (org.apache.flink.api.java.Utils$CollectHelper@568b43ab)(5/10) switched to 
> FINISHED 
> 1 [CHAIN GroupReduce (GroupReduce at 
> org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:78))
>  -> Map (Map at 
> org.apache.mahout.flinkbindings.blas.FlinkOpMapBlock$.apply(FlinkOpMapBlock.scala:37))
>  -> FlatMap (FlatMap at 
> org.apache.mahout.flinkbindings.drm.BlockifiedFlinkDrm.asRowWise(FlinkDrm.scala:93))
>  (8/10)] ERROR org.apache.flink.runtime.operators.BatchTask  - Error in task 
> code:  CHAIN GroupReduce (GroupReduce at 
> org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:78))
>  -> Map (Map at 
> org.apache.mahout.flinkbindings.blas.FlinkOpMapBlock$.apply(FlinkOpMapBlock.scala:37))
>  -> FlatMap (FlatMap at 
> org.apache.mahout.flinkbindings.drm.BlockifiedFlinkDrm.asRowWise(FlinkDrm.scala:93))
>  (8/10)
> java.lang.ArrayIndexOutOfBoundsException: 5
>   at 
> org.apache.mahout.math.drm.package$$anonfun$4$$anonfun$apply$3.apply(package.scala:317)
>   at 
> org.apache.mahout.math.drm.package$$anonfun$4$$anonfun$apply$3.apply(package.scala:317)
>   at 
> org.apache.mahout.math.scalabindings.MatrixOps$$anonfun$$colon$eq$3$$anonfun$apply$2.apply(MatrixOps.scala:164)
>   at 
> org.apache.mahout.math.scalabindings.MatrixOps$$anonfun$$colon$eq$3$$anonfun$apply$2.apply(MatrixOps.scala:164)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.mahout.math.scalabindings.MatrixOps$$anonfun$$colon$eq$3.apply(MatrixOps.scala:164)
>   at 
> org.apache.mahout.math.scalabindings.MatrixOps$$anonfun$$colon$eq$3.apply(MatrixOps.scala:164)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.mahout.math.scalabindings.MatrixOps.$colon$eq(MatrixOps.scala:164)
>   at 
> org.apache.mahout.math.drm.package$$anonfun$4.apply(package.scala:317)
>   at 
> org.apache.mahout.math.drm.package$$anonfun$4.apply(package.scala:311)
>   at 
> org.apache.mahout.flinkbindings.blas.FlinkOpMapBlock$$anonfun$1.apply(FlinkOpMapBlock.scala:39)
>   at 
> org.apache.mahout.flinkbindings.blas.FlinkOpMapBlock$$anonfun$1.apply(FlinkOpMapBlock.scala:38)
>   at org.apache.flink.api.scala.DataSet$$anon$1.map(DataSet.scala:297)
>   at 
> org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:78)
>   at 
> org.apache.mahout.flinkbindings.blas.FlinkOpAtB$$anon$6.reduce(FlinkOpAtB.scala:86)
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.run(GroupReduceDriver.java:125)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1734) Mahout DSL for Flink: implement I/O

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1734.


bulk-closing resolved issues

> Mahout DSL for Flink: implement I/O
> ---
>
> Key: MAHOUT-1734
> URL: https://issues.apache.org/jira/browse/MAHOUT-1734
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570, implement reading DRMs from the file system and 
> writing the results back



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1682) Create a documentation page for SPCA

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1682:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Create a documentation page for SPCA
> 
>
> Key: MAHOUT-1682
> URL: https://issues.apache.org/jira/browse/MAHOUT-1682
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for SPCA.  This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1686) Create a documentattion page for ALS

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1686:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Create a documentattion page for ALS
> 
>
> Key: MAHOUT-1686
> URL: https://issues.apache.org/jira/browse/MAHOUT-1686
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Andrew Musselman
> Fix For: 0.12.0
>
>
> Following the template of the SSVD and QR pages create a page for ALS. This 
> Page would go under Algorithms-> Distributed Matrix Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1570:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
> Fix For: 0.12.0
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1765) Mahout DSL for Flink: Add some documentation about Flink backend

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1765:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Mahout DSL for Flink: Add some documentation about Flink backend
> 
>
> Key: MAHOUT-1765
> URL: https://issues.apache.org/jira/browse/MAHOUT-1765
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1790) SparkEngine nnz overflow resultSize when reducing.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1790:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> SparkEngine nnz overflow resultSize when reducing.
> --
>
> Key: MAHOUT-1790
> URL: https://issues.apache.org/jira/browse/MAHOUT-1790
> Project: Mahout
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.11.1
>Reporter: Michel Lemay
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Fix For: 0.12.1
>
>
> When counting numNonZeroElementsPerColumn in spark engine with large number 
> of columns, we get the following error:
> ERROR TaskSetManager: Total size of serialized results of nnn tasks (1031.7 
> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
> and then, the call stack:
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 267 tasks (1024.1 MB) is bigger than 
> spark.driver.maxResultSize (1024.0 MB)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.numNonZeroElementsPerColumn(SparkEngine.scala:86)
> at 
> org.apache.mahout.math.drm.CheckpointedOps.numNonZeroElementsPerColumn(CheckpointedOps.scala:37)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.sampleDownAndBinarize(SimilarityAnalysis.scala:286)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrences(SimilarityAnalysis.scala:66)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrencesIDSs(SimilarityAnalysis.scala:141)
> This occurs because it uses a DenseVector and spark seemingly aggregate all 
> of them on the driver before reducing.  
> I think this could be easily prevented with a treeReduce(_ += _, depth)  
> instead of a reduce(_ += _)
> 'depth' could be computed in function of 'n' and numberOfPartitions.. 
> something in the line of:
>   val maxResultSize = 
>   val numPartitions = drm.rdd.partitions.size
>   val n = drm.ncol
>   val bytesPerVector = n * 8 + overhead?
>   val maxVectors = maxResultSize / bytes / 2 + 1 // be safe
>   val depth = math.max(1, math.ceil(math.log(1 + numPartitions / maxVectors) 
> / math.log(2)).toInt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1750) Mahout DSL for Flink: Implement ABt

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1750:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Mahout DSL for Flink: Implement ABt
> ---
>
> Key: MAHOUT-1750
> URL: https://issues.apache.org/jira/browse/MAHOUT-1750
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Andrew Palumbo
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now ABt is expressed through AtB, which is not optimal, and we need to have a 
> special implementation for ABt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1791:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1779) Brief overview page for the Flink engine

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1779:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Brief overview page for the Flink engine 
> -
>
> Key: MAHOUT-1779
> URL: https://issues.apache.org/jira/browse/MAHOUT-1779
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.11.0
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Add a brief  overview page for the Flink bindings similar to the h2o bindings 
> writeup found at:
> http://mahout.apache.org/users/environment/h2o-internals.html
> The page would go under the "Engines" section in the "Mahout Environment" 
> menu on the site.
> Topics would include: a brief overview of the engine, how the Drm API is 
> backed backed by the flink engine, interoperability with the engine-native 
> datastructures and methods and any other pertinent information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1821) Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink configuration

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1821:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink 
> configuration
> ---
>
> Key: MAHOUT-1821
> URL: https://issues.apache.org/jira/browse/MAHOUT-1821
> Project: Mahout
>  Issue Type: New Feature
>  Components: Flink
>Affects Versions: 0.11.2
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Mahout on Flink requires a few Mahout specific configuration parameters.  
> These configurations should be set in a 
> {{$MAHOUT_HOME/conf/mahout-flink-conf.yaml}} file.  The values in it will 
> override the user's {{flink-conf.yaml}} file if the same key is set in both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1814) Implement drm2intKeyed in flink bindings

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1814:
-
Sprint: Jan/Feb-2016  (was: Nov/Dec-2015)

> Implement drm2intKeyed in flink bindings
> 
>
> Key: MAHOUT-1814
> URL: https://issues.apache.org/jira/browse/MAHOUT-1814
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1700) OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD

2016-03-30 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218129#comment-15218129
 ] 

Dmitriy Lyubimov commented on MAHOUT-1700:
--

After reading it again, i don't see it as a big problem on the following basis: 

(1) The deficiency can be solved by more careful input preparation and figuring 
out its true parsimonious dimensionality instead of declaring everything of 
Integer.MAX_VALUE dimension.

(2) The user claims that another workaround, properly adjusting working memory, 
is not possible due to lack of priviledges. It is not a problem of the project 
per se.

Given status of MR and effort required to re-test MR cases for interestingly 
significant volume, I'd say NOT A PROBLEM resolution is due.

> OutOfMemory Problem in ABtDenseOutJob in Distributed SSVD
> -
>
> Key: MAHOUT-1700
> URL: https://issues.apache.org/jira/browse/MAHOUT-1700
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.9, 0.10.0
>Reporter: Ethan Yi
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 0.12.0
>
>
>  Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0)  job. 
> There's a java heap space out of memory problem  in ABtDenseOutJob. I found 
> the reason, the ABtDenseOutJob map code is as below:
> protected void map(Writable key, VectorWritable value, Context context)
>   throws IOException, InterruptedException {
>   Vector vec = value.get();
>   int vecSize = vec.size();
>   if (aCols == null) {
> aCols = new Vector[vecSize];
>   } else if (aCols.length < vecSize) {
> aCols = Arrays.copyOf(aCols, vecSize);
>   }
>   if (vec.isDense()) {
> for (int i = 0; i < vecSize; i++) {
>   extendAColIfNeeded(i, aRowCount + 1);
>   aCols[i].setQuick(aRowCount, vec.getQuick(i));
> }
>   } else if (vec.size() > 0) {
> for (Vector.Element vecEl : vec.nonZeroes()) {
>   int i = vecEl.index();
>   extendAColIfNeeded(i, aRowCount + 1);
>   aCols[i].setQuick(aRowCount, vecEl.get());
> }
>   }
>   aRowCount++;
> }
> If the input is RandomAccessSparseVector, usually with big data, it's 
> vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new 
> Vector[vecSize] will introduce the OutOfMemory problem. The settlement of 
> course should be enlarge every tasktracker's maximum memory:
> 
>   mapred.child.java.opts
>   -Xmx1024m
> 
> However, if you are NOT hadoop administrator or ops, you have no permission 
> to modify the config. So, I try to modify ABtDenseOutJob map code to support 
> RandomAccessSparseVector situation, I use hashmap to represent aCols instead 
> of the original Vector[] aCols array, the modified code is as below:
> private Map aColsMap = new HashMap();
> protected void map(Writable key, VectorWritable value, Context context)
>   throws IOException, InterruptedException {
>   Vector vec = value.get();
>   if (vec.isDense()) {
> for (int i = 0; i < vecSize; i++) {
>   //extendAColIfNeeded(i, aRowCount + 1);
>   if (aColsMap.get(i) == null) {
> aColsMap.put(i, new 
> RandomAccessSparseVector(Integer.MAX_VALUE, 100));
>   }
>   aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i));
>   //aCols[i].setQuick(aRowCount, vec.getQuick(i));
> }
>   } else if (vec.size() > 0) {
> for (Vector.Element vecEl : vec.nonZeroes()) {
>   int i = vecEl.index();
>   //extendAColIfNeeded(i, aRowCount + 1);
>   if (aColsMap.get(i) == null) {
> aColsMap.put(i, new 
> RandomAccessSparseVector(Integer.MAX_VALUE, 100));
>   }
>   aColsMap.get(i).setQuick(aRowCount, vecEl.get());
>   //aCols[i].setQuick(aRowCount, vecEl.get());
> }
>   }
>   aRowCount++;
> }
> Then the OutofMemory problem is dismissed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1790) SparkEngine nnz overflow resultSize when reducing.

2016-03-25 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212117#comment-15212117
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1790 at 3/25/16 5:25 PM:
---

[~smarthi]
(1) Actually i don't know why this comment (of mine) got in here. This is a 
different issue altogether -- it is an issue of a spark setting. if there's a 
collection of large volume of information to the front end, spark has a setting 
to limit it at certain number by default (1Gb), apparently in attempt to 
maintain the robustness of the backend w.r.t. runaway processes and ill 
programming. If there's a legitimate big collection to the front that exceeds 
1G then the solution is just bump up this setting with spark. 
(-Dspark.max.result or something).

(2) I still have a question why would the operation invoked cause such a big 
collection to the front -- this needs to be investigated. However, this 
capability, numNonZero etc. was added by either [~ssc] or [~pferrel], so i'd 
suggest for these folks to take a look at it again. I don't have the first hand 
knowledge of this logic. 

If everything else fails, i'll take a look at some point -- but not soon, as it 
is not a priority for me.







was (Author: dlyubimov):
[~smarthi]
(1) Actually i don't know why this comment (of mine) got in here. This is a 
different issue altogether -- it is an issue of a spark setting. if there's a 
collection of large volume of information to the front end, spark has a setting 
to limit it at certain number by default (1Gb), apparently in attempt to 
maintain the robustness of the backend w.r.t. runaway processes and ill 
programming. If there's a legitimate big collection to the front that exceeds 
1G then the solution is just bump up this setting with spark. 
(-Dspark.max.result or something).

(2) I still have a question why would the operation invoked cause such a big 
collection to the front -- this needs to be investigated. However, this 
capability, numNonZero etc. was added by either [~ssc] or [~pferrel], so i'd 
suggest for these folks to take a look at it again. I don't know first hand 
knowledge of this logic. 

If everything else fails, i'll take a look at some point -- but not soon, as it 
is not a priority for me.






> SparkEngine nnz overflow resultSize when reducing.
> --
>
> Key: MAHOUT-1790
> URL: https://issues.apache.org/jira/browse/MAHOUT-1790
> Project: Mahout
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.11.1
>Reporter: Michel Lemay
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Fix For: 0.12.0
>
>
> When counting numNonZeroElementsPerColumn in spark engine with large number 
> of columns, we get the following error:
> ERROR TaskSetManager: Total size of serialized results of nnn tasks (1031.7 
> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
> and then, the call stack:
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 267 tasks (1024.1 MB) is bigger than 
> spark.driver.maxResultSize (1024.0 MB)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
> at 

[jira] [Commented] (MAHOUT-1790) SparkEngine nnz overflow resultSize when reducing.

2016-03-25 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212117#comment-15212117
 ] 

Dmitriy Lyubimov commented on MAHOUT-1790:
--

[~smarthi]
(1) Actually i don't know why this comment (of mine) got in here. This is a 
different issue altogether -- it is an issue of a spark setting. if there's a 
collection of large volume of information to the front end, spark has a setting 
to limit it at certain number by default (1Gb), apparently in attempt to 
maintain the robustness of the backend w.r.t. runaway processes and ill 
programming. If there's a legitimate big collection to the front that exceeds 
1G then the solution is just bump up this setting with spark. 
(-Dspark.max.result or something).

(2) I still have a question why would the operation invoked cause such a big 
collection to the front -- this needs to be investigated. However, this 
capability, numNonZero etc. was added by either [~ssc] or [~pferrel], so i'd 
suggest for these folks to take a look at it again. I don't know first hand 
knowledge of this logic. 

If everything else fails, i'll take a look at some point -- but not soon, as it 
is not a priority for me.






> SparkEngine nnz overflow resultSize when reducing.
> --
>
> Key: MAHOUT-1790
> URL: https://issues.apache.org/jira/browse/MAHOUT-1790
> Project: Mahout
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.11.1
>Reporter: Michel Lemay
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Fix For: 0.12.0
>
>
> When counting numNonZeroElementsPerColumn in spark engine with large number 
> of columns, we get the following error:
> ERROR TaskSetManager: Total size of serialized results of nnn tasks (1031.7 
> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
> and then, the call stack:
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 267 tasks (1024.1 MB) is bigger than 
> spark.driver.maxResultSize (1024.0 MB)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1942)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.numNonZeroElementsPerColumn(SparkEngine.scala:86)
> at 
> org.apache.mahout.math.drm.CheckpointedOps.numNonZeroElementsPerColumn(CheckpointedOps.scala:37)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.sampleDownAndBinarize(SimilarityAnalysis.scala:286)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrences(SimilarityAnalysis.scala:66)
> at 
> org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrencesIDSs(SimilarityAnalysis.scala:141)
> This occurs because it uses a DenseVector and spark seemingly aggregate all 
> of them on the driver before reducing.  
> I think this could be easily prevented with a treeReduce(_ += 

[jira] [Commented] (MAHOUT-1755) Mahout DSL for Flink: Flush intermediate results to FS

2016-03-21 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205666#comment-15205666
 ] 

Dmitriy Lyubimov commented on MAHOUT-1755:
--

Interesting. this gets back to the caching checkpoints: caching checkpoints is 
a way to optimize for common computational paths in DAGs. 

so the only way to implement checkpoints in Flink (other than with cacheHint = 
NONE) is therefore always dumping stuff to DFS? If yes, this is a serious 
limitation as it prevents the most effective form of caching, i.e., when the 
object trees themselves are used. Even if hdfs checkpoint hits memory cache, we 
still need to spend time serializing and deserializing partitions back for 
every new tiny bit of computation (such as taking mean, or sum, or 
reducing-rebroadcasting intermediate statistics). This loop is very common for 
algorithms running till convergence. If we have a heavyweight scheduling in 
these type of systems inside the loop (as opposed to one-time scheduling 
outside the loop in superstep systems), it is already bad enough. If we on top 
of that need to serialize and deserialize when we run 50 conversion iterations, 
this is pretty disastrous.

So there's absolutely no way to keep datasets in object trees inside the worker 
vms between computations?

> Mahout DSL for Flink: Flush intermediate results to FS
> --
>
> Key: MAHOUT-1755
> URL: https://issues.apache.org/jira/browse/MAHOUT-1755
> Project: Mahout
>  Issue Type: Task
>  Components: Flink
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now Flink (unlike Spark) doesn't keep intermediate results in memory - 
> therefore they should be flushed to a file system, and read back when 
> required. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1810) Failing test in flink-bindings: A %*% B Identically partitioned

2016-03-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199864#comment-15199864
 ] 

Dmitriy Lyubimov commented on MAHOUT-1810:
--

Assumption of identical partitioning depends on the engine. Maybe it doesn't 
hold in case of flink at all?

In this case (checkpoint or not) the assumption is that collection.map(x=>x) 
doesn't change neither data allocation to splits nor its ordering inside every 
split (aka partition). If this holds, then input and output are identically 
partitioned. 

Therefore, if B = A.map(x=> x...) then A and B are identically partitioned, and 
then A + B can be optimized as A.zip(B).map (_._1 + _._2). If A and B are not 
identically partitioned, then elementwise binary functions would require 
pre-join, which is much more expensive than zip. 

This test simply provokes this optimization (in spark), but if engine doesn't 
support zips or assumption of identical partitioning does not hold, then engine 
optimizer should rectify the situation by always executing join() after 
mapblocks. Check back with me for more info where to hack it if it is indeed 
the case.. 


> Failing test in flink-bindings: A %*% B Identically partitioned
> ---
>
> Key: MAHOUT-1810
> URL: https://issues.apache.org/jira/browse/MAHOUT-1810
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> the {{A %*% B, Identically Partitioned}} test in the Flink RLikeDrmOpsSuite 
> fails.  This test failure likely indicates an issue with Flink's 
> Checkpointing or mapBlock operator:
> {code}
>   test("C = A + B, identically partitioned") {
> val inCoreA = dense((1, 2, 3), (3, 4, 5), (5, 6, 7))
> val A = drmParallelize(inCoreA, numPartitions = 2)
> // Create B which would be identically partitioned to A. mapBlock() by 
> default will do the trick.
> val B = A.mapBlock() {
>   case (keys, block) =>
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> keys -> bBlock
> }
>   // Prevent repeated computation non-determinism
>   // flink problem is here... checkpoint is not doing what it should
>   .checkpoint()
> val inCoreB = B.collect
> printf("A=\n%s\n", inCoreA)
> printf("B=\n%s\n", inCoreB)
> val C = A + B
> val inCoreC = C.collect
> printf("C=\n%s\n", inCoreC)
> // Actual
> val inCoreCControl = inCoreA + inCoreB
> (inCoreC - inCoreCControl).norm should be < 1E-10
>   }
> {code}
> The output shous clearly that the line:
> {code}
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> {code} 
> in the {{mapBlock}} closure is being calculated more than once.
> Output:
> {code}
> A=
> {
>  0 => {0:1.0,1:2.0,2:3.0}
>  1 => {0:3.0,1:4.0,2:5.0}
>  2 => {0:5.0,1:6.0,2:7.0}
> }
> B=
> {
>  0 => {0:0.26203398262809574,1:0.22561543461472167,2:0.23229669514522655}
>  1 => {0:0.1638068194515867,1:0.18751822418846575,2:0.20586366231381614}
>  2 => {0:0.9279465706239354,1:0.2963513448240057,2:0.8866928923235948}
> }
> C=
> {
>  0 => {0:1.7883652623225594,1:2.6401297718606216,2:3.0023341959374195}
>  1 => {0:3.641411452208408,1:4.941233165480053,2:5.381282338548803}
>  2 => {0:5.707434148862531,1:6.022780876943659,2:7.149772825494352}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1810) Failing test in flink-bindings: A %*% B Identically partitioned

2016-03-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199914#comment-15199914
 ] 

Dmitriy Lyubimov commented on MAHOUT-1810:
--

Another note is that checkpoint() (at least in case of spark) would not prevent 
computation non-determinism in case a partition is lost and subsequently 
recomputed. 
it _may_ have effect on double executions if engine indeed recomputes the input 
A again as part of A+B so maybe yes checkpoint is not doing what it is supposed 
to do for Flink, i.e., does not create optimization barrier here or/and does 
not cache intermediate result by default.

> Failing test in flink-bindings: A %*% B Identically partitioned
> ---
>
> Key: MAHOUT-1810
> URL: https://issues.apache.org/jira/browse/MAHOUT-1810
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> the {{A %*% B, Identically Partitioned}} test in the Flink RLikeDrmOpsSuite 
> fails.  This test failure likely indicates an issue with Flink's 
> Checkpointing or mapBlock operator:
> {code}
>   test("C = A + B, identically partitioned") {
> val inCoreA = dense((1, 2, 3), (3, 4, 5), (5, 6, 7))
> val A = drmParallelize(inCoreA, numPartitions = 2)
> // Create B which would be identically partitioned to A. mapBlock() by 
> default will do the trick.
> val B = A.mapBlock() {
>   case (keys, block) =>
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> keys -> bBlock
> }
>   // Prevent repeated computation non-determinism
>   // flink problem is here... checkpoint is not doing what it should
>   .checkpoint()
> val inCoreB = B.collect
> printf("A=\n%s\n", inCoreA)
> printf("B=\n%s\n", inCoreB)
> val C = A + B
> val inCoreC = C.collect
> printf("C=\n%s\n", inCoreC)
> // Actual
> val inCoreCControl = inCoreA + inCoreB
> (inCoreC - inCoreCControl).norm should be < 1E-10
>   }
> {code}
> The output shous clearly that the line:
> {code}
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> {code} 
> in the {{mapBlock}} closure is being calculated more than once.
> Output:
> {code}
> A=
> {
>  0 => {0:1.0,1:2.0,2:3.0}
>  1 => {0:3.0,1:4.0,2:5.0}
>  2 => {0:5.0,1:6.0,2:7.0}
> }
> B=
> {
>  0 => {0:0.26203398262809574,1:0.22561543461472167,2:0.23229669514522655}
>  1 => {0:0.1638068194515867,1:0.18751822418846575,2:0.20586366231381614}
>  2 => {0:0.9279465706239354,1:0.2963513448240057,2:0.8866928923235948}
> }
> C=
> {
>  0 => {0:1.7883652623225594,1:2.6401297718606216,2:3.0023341959374195}
>  1 => {0:3.641411452208408,1:4.941233165480053,2:5.381282338548803}
>  2 => {0:5.707434148862531,1:6.022780876943659,2:7.149772825494352}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1810) Failing test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing issue)

2016-03-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202129#comment-15202129
 ] 

Dmitriy Lyubimov commented on MAHOUT-1810:
--

so if checkpoint doesn't cache, (which is the intent to get rid of determinism 
in this test), it is formal non-adherence to the contract of checkpoint and 
checkpoint caching capabilities (parameters CacheHint).

So you are saying there's no way to cope with this?

I think, in the worst case, the solution should seek dumping intermediate 
checkpoint (for cache hints other than None) to dfs or in-memory file system. 
various people are telling me that dfs can now have a pretty sizeable local 
cache configured too, so persistence is not so bad (but not as good as keeping 
object trees in the same jvm, of course).

> Failing test in flink-bindings: A + B Identically partitioned (mapBlock 
> Checkpointing issue)
> 
>
> Key: MAHOUT-1810
> URL: https://issues.apache.org/jira/browse/MAHOUT-1810
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Palumbo
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> the {{A + B, Identically Partitioned}} test in the Flink RLikeDrmOpsSuite 
> fails.  This test failure likely indicates an issue with Flink's 
> Checkpointing or mapBlock operator:
> {code}
>   test("C = A + B, identically partitioned") {
> val inCoreA = dense((1, 2, 3), (3, 4, 5), (5, 6, 7))
> val A = drmParallelize(inCoreA, numPartitions = 2)
> // Create B which would be identically partitioned to A. mapBlock() by 
> default will do the trick.
> val B = A.mapBlock() {
>   case (keys, block) =>
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> keys -> bBlock
> }
>   // Prevent repeated computation non-determinism
>   // flink problem is here... checkpoint is not doing what it should
>   .checkpoint()
> val inCoreB = B.collect
> printf("A=\n%s\n", inCoreA)
> printf("B=\n%s\n", inCoreB)
> val C = A + B
> val inCoreC = C.collect
> printf("C=\n%s\n", inCoreC)
> // Actual
> val inCoreCControl = inCoreA + inCoreB
> (inCoreC - inCoreCControl).norm should be < 1E-10
>   }
> {code}
> The output shous clearly that the line:
> {code}
> val bBlock = block.like() := { (r, c, v) => util.Random.nextDouble()}
> {code} 
> in the {{mapBlock}} closure is being calculated more than once.
> Output:
> {code}
> A=
> {
>  0 => {0:1.0,1:2.0,2:3.0}
>  1 => {0:3.0,1:4.0,2:5.0}
>  2 => {0:5.0,1:6.0,2:7.0}
> }
> B=
> {
>  0 => {0:0.26203398262809574,1:0.22561543461472167,2:0.23229669514522655}
>  1 => {0:0.1638068194515867,1:0.18751822418846575,2:0.20586366231381614}
>  2 => {0:0.9279465706239354,1:0.2963513448240057,2:0.8866928923235948}
> }
> C=
> {
>  0 => {0:1.7883652623225594,1:2.6401297718606216,2:3.0023341959374195}
>  1 => {0:3.641411452208408,1:4.941233165480053,2:5.381282338548803}
>  2 => {0:5.707434148862531,1:6.022780876943659,2:7.149772825494352}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1763) A minor bug in Spark binding documentation

2016-03-18 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1763.
--
Resolution: Fixed

> A minor bug in Spark binding documentation
> --
>
> Key: MAHOUT-1763
> URL: https://issues.apache.org/jira/browse/MAHOUT-1763
> Project: Mahout
>  Issue Type: Documentation
>  Components: Math, spark
>Reporter: Sergey Tryuber
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 0.12.0
>
>
> [Documentation|http://apache.github.io/mahout/doc/ScalaSparkBindings.html] 
> has following example:
> {noformat}
> diag(10, 3.5)
> {noformat}
> But in fact it should be:
> {noformat}
> diag(3.5, 10)
> {noformat}
> See 
> [package.scala|https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala]
>  for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1763) A minor bug in Spark binding documentation

2016-03-18 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1763.


> A minor bug in Spark binding documentation
> --
>
> Key: MAHOUT-1763
> URL: https://issues.apache.org/jira/browse/MAHOUT-1763
> Project: Mahout
>  Issue Type: Documentation
>  Components: Math, spark
>Reporter: Sergey Tryuber
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 0.12.0
>
>
> [Documentation|http://apache.github.io/mahout/doc/ScalaSparkBindings.html] 
> has following example:
> {noformat}
> diag(10, 3.5)
> {noformat}
> But in fact it should be:
> {noformat}
> diag(3.5, 10)
> {noformat}
> See 
> [package.scala|https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala]
>  for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1769) Incorrect documentation for collecting DRM to HDFS

2016-03-18 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1769.
--
Resolution: Fixed

> Incorrect documentation for collecting DRM to HDFS
> --
>
> Key: MAHOUT-1769
> URL: https://issues.apache.org/jira/browse/MAHOUT-1769
> Project: Mahout
>  Issue Type: Documentation
>  Components: Mahout spark shell
>Affects Versions: 0.11.0
>Reporter: Sergey Tryuber
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.12.0
>
>
> There is a bug in 
> [documenation|http://apache.github.io/mahout/doc/ScalaSparkBindings.html] 
> (2.3.5 Collecting to HDFS). Instead of:
> {code}
> A.writeDRM(path = hdfsPath)
> {code}
> should be
> {code}
> A.dfsWrite(path = "")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MAHOUT-1769) Incorrect documentation for collecting DRM to HDFS

2016-03-18 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov closed MAHOUT-1769.


> Incorrect documentation for collecting DRM to HDFS
> --
>
> Key: MAHOUT-1769
> URL: https://issues.apache.org/jira/browse/MAHOUT-1769
> Project: Mahout
>  Issue Type: Documentation
>  Components: Mahout spark shell
>Affects Versions: 0.11.0
>Reporter: Sergey Tryuber
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.12.0
>
>
> There is a bug in 
> [documenation|http://apache.github.io/mahout/doc/ScalaSparkBindings.html] 
> (2.3.5 Collecting to HDFS). Instead of:
> {code}
> A.writeDRM(path = hdfsPath)
> {code}
> should be
> {code}
> A.dfsWrite(path = "")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1800:
-
Summary: Pare down Casstag overuse  (was: Pair down Casstag overuse)

> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1800) Pair down Casstag overuse

2016-03-07 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1800:
-
Summary: Pair down Casstag overuse  (was: Pare down Casstag overuse)

> Pair down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1570:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
> Fix For: 0.12.0
>
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1747) Mahout DSL for Flink: add support for different types of indexes (String, long, etc)

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1747:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: add support for different types of indexes (String, 
> long, etc)
> 
>
> Key: MAHOUT-1747
> URL: https://issues.apache.org/jira/browse/MAHOUT-1747
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> For mahout-flink (MAHOUT-1570) rows now can only be indexed with integers. 
> Add support for other types: Long, String, etc. 
> See FlinkEngine#toPhysical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1701) Mahout DSL for Flink: implement AtB ABt and AtA operators

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1701:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement AtB ABt and AtA operators
> -
>
> Key: MAHOUT-1701
> URL: https://issues.apache.org/jira/browse/MAHOUT-1701
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - AtB
> - ABt
> - AtA 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1750) Mahout DSL for Flink: Implement ABt

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1750:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: Implement ABt
> ---
>
> Key: MAHOUT-1750
> URL: https://issues.apache.org/jira/browse/MAHOUT-1750
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now ABt is expressed through AtB, which is not optimal, and we need to have a 
> special implementation for ABt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1710) Mahout DSL for Flink: implement right in-core matrix multiplication

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1710:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement right in-core matrix multiplication
> ---
>
> Key: MAHOUT-1710
> URL: https://issues.apache.org/jira/browse/MAHOUT-1710
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570 implement A %*% B multiplication when B is an 
> in-core matrix 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1765) Mahout DSL for Flink: Add some documentation about Flink backend

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1765:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: Add some documentation about Flink backend
> 
>
> Key: MAHOUT-1765
> URL: https://issues.apache.org/jira/browse/MAHOUT-1765
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1702) Mahout DSL for Flink: implement element-wise operators

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1702:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement element-wise operators
> --
>
> Key: MAHOUT-1702
> URL: https://issues.apache.org/jira/browse/MAHOUT-1702
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - AewB
> - AewScalar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1703) Mahout DSL for Flink: implement cbind and rbind

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1703:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement cbind and rbind
> ---
>
> Key: MAHOUT-1703
> URL: https://issues.apache.org/jira/browse/MAHOUT-1703
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement the following operators on Flink:
> - cbind
> - cbind with scalar
> - rbind
> rbind depends on mapBlock operator, so it also has to be implemented as a 
> part of this Jira



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1734) Mahout DSL for Flink: implement I/O

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1734:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement I/O
> ---
>
> Key: MAHOUT-1734
> URL: https://issues.apache.org/jira/browse/MAHOUT-1734
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: flink
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570, implement reading DRMs from the file system and 
> writing the results back



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1711) Mahout DSL for Flink: implement broadcasting

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1711:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement broadcasting
> 
>
> Key: MAHOUT-1711
> URL: https://issues.apache.org/jira/browse/MAHOUT-1711
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> as a part of MAHOUT-1570 implement drmBroadcast for flink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1712) Mahout DSL for Flink: implement operators At, Ax, Atx

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1712:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: implement operators At, Ax, Atx
> -
>
> Key: MAHOUT-1712
> URL: https://issues.apache.org/jira/browse/MAHOUT-1712
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570, implement the following operators on Flink:
> - At
> - Ax
> - Atx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1749) Mahout DSL for Flink: Implement Atx

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1749:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: Implement Atx
> ---
>
> Key: MAHOUT-1749
> URL: https://issues.apache.org/jira/browse/MAHOUT-1749
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now Atx is implemented through At and Ax operators, but it's not optimal, and 
> its needs special implementation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1755) Mahout DSL for Flink: Flush intermediate results to FS

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1755:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: Flush intermediate results to FS
> --
>
> Key: MAHOUT-1755
> URL: https://issues.apache.org/jira/browse/MAHOUT-1755
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now Flink (unlike Spark) doesn't keep intermediate results in memory - 
> therefore they should be flushed to a file system, and read back when 
> required. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1748) Mahout DSL for Flink: switch to Flink Scala API

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1748:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: switch to Flink Scala API
> ---
>
> Key: MAHOUT-1748
> URL: https://issues.apache.org/jira/browse/MAHOUT-1748
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> In Flink-Mahout (MAHOUT-1570) Flink Java API is used because Scala API caused 
> different strange compilation problems. 
> But Scala API handles types better than Flink Java API, so it's better to 
> switch to Scala API. It also can solve MAHOUT-1747



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1751) Mahout DSL for Flink: Implement AtA

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1751:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

> Mahout DSL for Flink: Implement AtA
> ---
>
> Key: MAHOUT-1751
> URL: https://issues.apache.org/jira/browse/MAHOUT-1751
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> Now AtA is implemented via AtB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1709) Mahout DSL for Flink: implement slicing

2015-11-14 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1709:
-
Sprint: Nov/Dec-2015  (was: Sep/Oct-2015)

>  Mahout DSL for Flink: implement slicing
> 
>
> Key: MAHOUT-1709
> URL: https://issues.apache.org/jira/browse/MAHOUT-1709
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.1
>Reporter: Alexey Grigorev
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.12.0
>
>
> As a part of MAHOUT-1570 OpRowRange operator for Flink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >