Jenkins build became unstable: mahout-nightly » Mahout Spark bindings #2040

2016-03-07 Thread Apache Jenkins Server
See 




[jira] [Commented] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184564#comment-15184564
 ] 

Hudson commented on MAHOUT-1800:


SUCCESS: Integrated in Mahout-Quality #3310 (See 
[https://builds.apache.org/job/Mahout-Quality/3310/])
Revert "MAHOUT-1800: Pare down Casstag overuse closes apache/mahout#183" 
(apalumbo: rev fcd6b9e01aea5f92fd947c0964ef7884314f33db)
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesLeftMatrix.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/ABt.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DSPCA.scala
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSparkOps.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewUnaryFunc.scala
* 
math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DSSVD.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABAnyKey.scala
* math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedDrm.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AinCoreB.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewB.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractBinaryOp.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/CbindAB.scala
* 
spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtB.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/DrmRddOps.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewUnaryFuncFusion.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractUnaryOp.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLikeOps.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRowRange.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtAnyKey.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtx.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAt.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DQR.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Par.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpCbind.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/RbindAB.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLike.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpMapBlock.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpCbindScalar.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtA.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpPar.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesRightMatrix.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABt.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAB.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/CheckpointAction.scala
* h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAx.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRbind.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Ax.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/MapBlock.scala
* spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala


> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every 

Jenkins build is back to normal : Mahout-Quality #3310

2016-03-07 Thread Apache Jenkins Server
See 



Jenkins build became unstable: Mahout-h1-Quality #166

2016-03-07 Thread Apache Jenkins Server
See 



[jira] [Commented] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184432#comment-15184432
 ] 

Hudson commented on MAHOUT-1800:


FAILURE: Integrated in Mahout-Quality #3309 (See 
[https://builds.apache.org/job/Mahout-Quality/3309/])
MAHOUT-1800: Pare down Casstag overuse closes apache/mahout#183 (apalumbo: rev 
6919fd9febe1585d15e78e51aabcad8fa29235f3)
* 
spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSparkOps.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtA.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/MapBlock.scala
* spark/src/main/scala/org/apache/mahout/drivers/TrainNBDriver.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DQR.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpMapBlock.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractUnaryOp.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtx.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABAnyKey.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAB.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpCbind.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesLeftMatrix.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/CbindAB.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Ax.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRowRange.scala
* 
math-scala/src/main/scala/org/apache/mahout/classifier/naivebayes/NaiveBayes.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/DrmRddOps.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtAnyKey.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAtB.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DSSVD.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpCbindScalar.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpTimesRightMatrix.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpABt.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/CheckpointAction.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLikeOps.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/RLikeDrmOps.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/Par.scala
* math-scala/src/test/scala/org/apache/mahout/math/drm/DrmLikeOpsSuiteBase.scala
* h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/AbstractBinaryOp.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAt.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewB.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpPar.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/decompositions/DSPCA.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAx.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedDrm.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewScalar.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewUnaryFuncFusion.scala
* 
math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpAewUnaryFunc.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DrmLike.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/RbindAB.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/ABt.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/logical/OpRbind.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AinCoreB.scala
* 
spark/src/main/scala/org/apache/mahout/classifier/naivebayes/SparkNaiveBayes.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/blas/package.scala


> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator 

Build failed in Jenkins: Mahout-Quality #3309

2016-03-07 Thread Apache Jenkins Server
See 

Changes:

[apalumbo] MAHOUT-1800: Pare down Casstag overuse closes apache/mahout#183

--
[...truncated 6509 lines...]
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$instantiateClass(ClosureCleaner.scala:330)
  at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:268)
  at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:262)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  ...
- ItemSimilarityDriver, two input paths
- ItemSimilarityDriver, two inputs of different dimensions
- ItemSimilarityDriver cross similarity two separate items spaces
- A.t %*% B after changing row cardinality of A
- Changing row cardinality of an IndexedDataset
- ItemSimilarityDriver cross similarity two separate items spaces, missing 
rows in B
- ItemSimilarityDriver cross similarity two separate items spaces, adding 
rows in B

[jira] [Commented] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184385#comment-15184385
 ] 

ASF GitHub Bot commented on MAHOUT-1800:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/183


> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184274#comment-15184274
 ] 

ASF GitHub Bot commented on MAHOUT-1800:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/183#issuecomment-193568541
  
+1 to merge this, tests pass locally


> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1795) Release Scala 2.11 bindings

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184200#comment-15184200
 ] 

ASF GitHub Bot commented on MAHOUT-1795:


Github user mikekap commented on a diff in the pull request:

https://github.com/apache/mahout/pull/179#discussion_r55305279
  
--- Diff: pom.xml ---
@@ -804,13 +798,15 @@
 distribution
 math-scala
 spark
-spark-shell
--- End diff --

Converting spark-shell is a nontrivial task that I don't have time to 
tackle. Spark itself has distinct shell code for 2.11 & 2.10. This change just 
formalizes excluding spark-shell from the 2.11 build; in a perfect universe the 
majority of this change wouldn't be needed.


> Release Scala 2.11 bindings
> ---
>
> Key: MAHOUT-1795
> URL: https://issues.apache.org/jira/browse/MAHOUT-1795
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Mike Kaplinskiy
> Attachments: patch.diff
>
>
> It would be nice to ship scala 2.11 bindings for mahout-math/mahout-spark. 
> (I'm not sure of other users, but mahout-shell isn't nearly at the top of my 
> list here).
> It looks simple enough for those two - the attached patch is a 
> proof-of-concept to compile (and pass all tests) under scala 2.11. I'm not 
> sure what the proper way to do this is, but it doesn't look too daunting. 
> (Famous last words?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread Andrew Palumbo
I'm +1 on this.

 Original message 
From: Suneel Marthi 
Date: 03/07/2016 8:09 PM (GMT-05:00)
To: mahout 
Subject: Re: [jira] [Commented] (MAHOUT-1640) Better collections would 
significantly improve vector-operation speed

If @apalumbo, @pferrel et.al vote for it now, we should merge the patch
into 0.11.2 master and 0.12.0 branch.

No need to wait for 3 days.

Again, +1 from me.

Thanks @vigna and sorry about missing this, my focus has been on 0.12.0
Flink integration.

On Mon, Mar 7, 2016 at 8:06 PM, Dmitriy Lyubimov  wrote:

> ok standard 3 days then.
>
> On Mon, Mar 7, 2016 at 5:04 PM, ASF GitHub Bot (JIRA) 
> wrote:
>
> >
> > [
> >
> https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184122#comment-15184122
> > ]
> >
> > ASF GitHub Bot commented on MAHOUT-1640:
> > 
> >
> > Github user smarthi commented on the pull request:
> >
> > https://github.com/apache/mahout/pull/81#issuecomment-193536262
> >
> > Seems like it's ASL 2.0 -
> > https://github.com/vigna/fastutil/blob/master/LICENSE-2.0
> >
> > +1 from me, good to go.
> >
> > On Mon, Mar 7, 2016 at 7:21 PM, Dmitriy Lyubimov <
> > notificati...@github.com>
> > wrote:
> >
> > > @vigna  is 0.7.2 fastutil is still the
> > best
> > > version to use? I can't immediately find the license on it?
> > > @smarthi  et. al. : need a few votes
> on
> > > inclusion of fastutil as a dependency
> > >
> > > —
> > > Reply to this email directly or view it on GitHub
> > > .
> > >
> >
> >
> >
> > > Better collections would significantly improve vector-operation speed
> > > -
> > >
> > > Key: MAHOUT-1640
> > > URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> > > Project: Mahout
> > >  Issue Type: Improvement
> > >  Components: collections
> > > Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version
> > 14.1.0: Mon Dec 22 23:10:38 PST 2014;
> root:xnu-2782.10.72~2/RELEASE_X86_64
> > x86_64 i386 MacBookPro10,1 Darwin
> > > java version "1.8.0_31"
> > > Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> > > Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> > >Reporter: Sebastiano Vigna
> > >Assignee: Suneel Marthi
> > >  Labels: legacy, math, scala
> > > Attachments: fastutil.patch, speed-fastutil, speed-std
> > >
> > >
> > > The collections currently used by Mahout to implement sparse vectors
> are
> > extremely slow. The proposed patch (localized to
> RandomAccessSparseVector)
> > uses fastutil's maps and the speed improvements in vector benchmarks are
> > very significant. It would be interesting to see whether these
> improvements
> > percolate to high-level classes using sparse vectors.
> > > I had to patch two unit tests (an off-by-one bug and an overfitting
> bug;
> > both were exposed by the different order in which key/values were
> returned
> > by iterators).
> > > The included files speed-std and speed-fastutil show the speed
> > improvement. Some more speed might be gained by using everywhere the
> > standard java.util.Map.Entry interface instead of Element.
> > > DISCLAIMER: The "Times" set of tests has been run multiplying two
> > identical vectors. The standard tests multiply two random vectors, so in
> > fact they just test the speed of the underlying map remove() method, as
> > almost all products are zero. This is not very realistic and was heavily
> > penalizing fastutil's "true deletions". Better tests, with a typical
> > overlap of nonzero entries, would be even more realistic.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184178#comment-15184178
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193546860
  
@vigna Please also update the LICENSE.Txt file for fastutil.

https://github.com/apache/mahout/blob/master/LICENSE.txt

Thanks again for this.


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread Suneel Marthi
If @apalumbo, @pferrel et.al vote for it now, we should merge the patch
into 0.11.2 master and 0.12.0 branch.

No need to wait for 3 days.

Again, +1 from me.

Thanks @vigna and sorry about missing this, my focus has been on 0.12.0
Flink integration.

On Mon, Mar 7, 2016 at 8:06 PM, Dmitriy Lyubimov  wrote:

> ok standard 3 days then.
>
> On Mon, Mar 7, 2016 at 5:04 PM, ASF GitHub Bot (JIRA) 
> wrote:
>
> >
> > [
> >
> https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184122#comment-15184122
> > ]
> >
> > ASF GitHub Bot commented on MAHOUT-1640:
> > 
> >
> > Github user smarthi commented on the pull request:
> >
> > https://github.com/apache/mahout/pull/81#issuecomment-193536262
> >
> > Seems like it's ASL 2.0 -
> > https://github.com/vigna/fastutil/blob/master/LICENSE-2.0
> >
> > +1 from me, good to go.
> >
> > On Mon, Mar 7, 2016 at 7:21 PM, Dmitriy Lyubimov <
> > notificati...@github.com>
> > wrote:
> >
> > > @vigna  is 0.7.2 fastutil is still the
> > best
> > > version to use? I can't immediately find the license on it?
> > > @smarthi  et. al. : need a few votes
> on
> > > inclusion of fastutil as a dependency
> > >
> > > —
> > > Reply to this email directly or view it on GitHub
> > > .
> > >
> >
> >
> >
> > > Better collections would significantly improve vector-operation speed
> > > -
> > >
> > > Key: MAHOUT-1640
> > > URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> > > Project: Mahout
> > >  Issue Type: Improvement
> > >  Components: collections
> > > Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version
> > 14.1.0: Mon Dec 22 23:10:38 PST 2014;
> root:xnu-2782.10.72~2/RELEASE_X86_64
> > x86_64 i386 MacBookPro10,1 Darwin
> > > java version "1.8.0_31"
> > > Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> > > Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> > >Reporter: Sebastiano Vigna
> > >Assignee: Suneel Marthi
> > >  Labels: legacy, math, scala
> > > Attachments: fastutil.patch, speed-fastutil, speed-std
> > >
> > >
> > > The collections currently used by Mahout to implement sparse vectors
> are
> > extremely slow. The proposed patch (localized to
> RandomAccessSparseVector)
> > uses fastutil's maps and the speed improvements in vector benchmarks are
> > very significant. It would be interesting to see whether these
> improvements
> > percolate to high-level classes using sparse vectors.
> > > I had to patch two unit tests (an off-by-one bug and an overfitting
> bug;
> > both were exposed by the different order in which key/values were
> returned
> > by iterators).
> > > The included files speed-std and speed-fastutil show the speed
> > improvement. Some more speed might be gained by using everywhere the
> > standard java.util.Map.Entry interface instead of Element.
> > > DISCLAIMER: The "Times" set of tests has been run multiplying two
> > identical vectors. The standard tests multiply two random vectors, so in
> > fact they just test the speed of the underlying map remove() method, as
> > almost all products are zero. This is not very realistic and was heavily
> > penalizing fastutil's "true deletions". Better tests, with a typical
> > overlap of nonzero entries, would be even more realistic.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184126#comment-15184126
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user andrewpalumbo commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193538517
  
+1 from me too



> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread Dmitriy Lyubimov
ok standard 3 days then.

On Mon, Mar 7, 2016 at 5:04 PM, ASF GitHub Bot (JIRA) 
wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184122#comment-15184122
> ]
>
> ASF GitHub Bot commented on MAHOUT-1640:
> 
>
> Github user smarthi commented on the pull request:
>
> https://github.com/apache/mahout/pull/81#issuecomment-193536262
>
> Seems like it's ASL 2.0 -
> https://github.com/vigna/fastutil/blob/master/LICENSE-2.0
>
> +1 from me, good to go.
>
> On Mon, Mar 7, 2016 at 7:21 PM, Dmitriy Lyubimov <
> notificati...@github.com>
> wrote:
>
> > @vigna  is 0.7.2 fastutil is still the
> best
> > version to use? I can't immediately find the license on it?
> > @smarthi  et. al. : need a few votes on
> > inclusion of fastutil as a dependency
> >
> > —
> > Reply to this email directly or view it on GitHub
> > .
> >
>
>
>
> > Better collections would significantly improve vector-operation speed
> > -
> >
> > Key: MAHOUT-1640
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: collections
> > Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64
> x86_64 i386 MacBookPro10,1 Darwin
> > java version "1.8.0_31"
> > Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> > Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> >Reporter: Sebastiano Vigna
> >Assignee: Suneel Marthi
> >  Labels: legacy, math, scala
> > Attachments: fastutil.patch, speed-fastutil, speed-std
> >
> >
> > The collections currently used by Mahout to implement sparse vectors are
> extremely slow. The proposed patch (localized to RandomAccessSparseVector)
> uses fastutil's maps and the speed improvements in vector benchmarks are
> very significant. It would be interesting to see whether these improvements
> percolate to high-level classes using sparse vectors.
> > I had to patch two unit tests (an off-by-one bug and an overfitting bug;
> both were exposed by the different order in which key/values were returned
> by iterators).
> > The included files speed-std and speed-fastutil show the speed
> improvement. Some more speed might be gained by using everywhere the
> standard java.util.Map.Entry interface instead of Element.
> > DISCLAIMER: The "Times" set of tests has been run multiplying two
> identical vectors. The standard tests multiply two random vectors, so in
> fact they just test the speed of the underlying map remove() method, as
> almost all products are zero. This is not very realistic and was heavily
> penalizing fastutil's "true deletions". Better tests, with a typical
> overlap of nonzero entries, would be even more realistic.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184122#comment-15184122
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user smarthi commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193536262
  
Seems like it's ASL 2.0 -
https://github.com/vigna/fastutil/blob/master/LICENSE-2.0

+1 from me, good to go.

On Mon, Mar 7, 2016 at 7:21 PM, Dmitriy Lyubimov 
wrote:

> @vigna  is 0.7.2 fastutil is still the best
> version to use? I can't immediately find the license on it?
> @smarthi  et. al. : need a few votes on
> inclusion of fastutil as a dependency
>
> —
> Reply to this email directly or view it on GitHub
> .
>



> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184117#comment-15184117
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193533894
  
so it's Apache 2.0 now, great.

On Mon, Mar 7, 2016 at 4:56 PM, Sebastiano Vigna 
wrote:

> @dlyubimov  no, actually there are newer
> versions. I am doing a release in few days with a lot of small glitches
> fixed thanks to the Guava test suite. Current version is 7.0.10.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184113#comment-15184113
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user vigna commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193532497
  
@dlyubimov no, actually there are newer versions. I am doing a release in 
few days with a lot of small glitches fixed thanks to the Guava test suite. 
Current version is 7.0.10.


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1800:
-
Summary: Pare down Casstag overuse  (was: Pair down Casstag overuse)

> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1800) Pair down Casstag overuse

2016-03-07 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1800:
-
Summary: Pair down Casstag overuse  (was: Pare down Casstag overuse)

> Pair down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184099#comment-15184099
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193528217
  
ok all tests are passing for me.


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184080#comment-15184080
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193522992
  
@vigna is 0.7.2 fastutil is still the best version to use? I can't 
immediately find the license on it?
@smarthi et. al. : need a few votes on inclusion of fastutil as a dependency


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184068#comment-15184068
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193518798
  
@vigna i apologize it took so long. it completely went off the radar.


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184067#comment-15184067
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user vigna commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193518406
  
If you merge this, I promise I'll look into the other improvements (e.g., 
not using Element). :)


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184061#comment-15184061
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193517651
  
Ok here are the quick sparse multiplication performance results in our unit 
tests. In short, looks like 30% improvement for sparse opeartions on average. 
Which is decent. I will test the rest of the tests and report back. If 
everything is ok, looks like a very valuable addition for the time being. Thank 
you very much @vigna !

Fastutil PR over current master (the second numbers are pragmatically 
important): 

Asr %*% Bsr: (26.0,14.334)
Asr' %*% Bsr: (1214.0,16.0)
Asr %*% Bsr': (96.0,19.0)
Asr' %*% Bsr': (1002.6,13.666)
Asr'' %*% Bsr'': (14.334,14.334)

Asm %*% Bsm: (1523.3,18.0)
Asm' %*% Bsm: (1929.0,21.0)
Asm %*% Bsm': (1428.7,20.668)
Asm' %*% Bsm': (1669.7,17.332)
Asm'' %*% Bsm'': (1541.3,14.334)

Asm %*% Bsr: (1160.0,14.0)
Asm' %*% Bsr: (1609.3,20.332)
Asm %*% Bsr': (1142.3,17.668)
Asm' %*% Bsr': (1408.7,14.334)
Asm'' %*% Bsr'': (1154.7,14.0)

Current master:

Asr %*% Bsr: (30.0,22.0)
Asr' %*% Bsr: (1918.7,26.0)
Asr %*% Bsr': (118.67,27.0)
Asr' %*% Bsr': (1551.7,21.668)
Asr'' %*% Bsr'': (22.332,21.668)

Asm %*% Bsm: (2191.0,26.332)
Asm' %*% Bsm: (2543.5,30.332)
Asm %*% Bsm': (2067.0,30.332)
Asm' %*% Bsm': (2274.0,22.668)
Asm'' %*% Bsm'': (2167.0,22.668)

Asm %*% Bsr: (1717.3,21.668)
Asm' %*% Bsr: (2254.0,32.0)
Asm %*% Bsr': (1762.3,26.668)
Asm' %*% Bsr': (2004.7,21.668)
Asm'' %*% Bsr'': (1711.0,24.332)



> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1640) Better collections would significantly improve vector-operation speed

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184041#comment-15184041
 ] 

ASF GitHub Bot commented on MAHOUT-1640:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/81#issuecomment-193510119
  
still need to look at this.


> Better collections would significantly improve vector-operation speed
> -
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
>  Issue Type: Improvement
>  Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>Reporter: Sebastiano Vigna
>Assignee: Suneel Marthi
>  Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are 
> extremely slow. The proposed patch (localized to RandomAccessSparseVector) 
> uses fastutil's maps and the speed improvements in vector benchmarks are very 
> significant. It would be interesting to see whether these improvements 
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both 
> were exposed by the different order in which key/values were returned by 
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement. 
> Some more speed might be gained by using everywhere the standard 
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical 
> vectors. The standard tests multiply two random vectors, so in fact they just 
> test the speed of the underlying map remove() method, as almost all products 
> are zero. This is not very realistic and was heavily penalizing fastutil's 
> "true deletions". Better tests, with a typical overlap of nonzero entries, 
> would be even more realistic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1795) Release Scala 2.11 bindings

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184015#comment-15184015
 ] 

ASF GitHub Bot commented on MAHOUT-1795:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/179#issuecomment-193505384
  
+1, preliminarily. still don't understand why this seems to exclude 
spark-shell from the build.


> Release Scala 2.11 bindings
> ---
>
> Key: MAHOUT-1795
> URL: https://issues.apache.org/jira/browse/MAHOUT-1795
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Mike Kaplinskiy
> Attachments: patch.diff
>
>
> It would be nice to ship scala 2.11 bindings for mahout-math/mahout-spark. 
> (I'm not sure of other users, but mahout-shell isn't nearly at the top of my 
> list here).
> It looks simple enough for those two - the attached patch is a 
> proof-of-concept to compile (and pass all tests) under scala 2.11. I'm not 
> sure what the proper way to do this is, but it doesn't look too daunting. 
> (Famous last words?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1795) Release Scala 2.11 bindings

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184012#comment-15184012
 ] 

ASF GitHub Bot commented on MAHOUT-1795:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/179#discussion_r55292523
  
--- Diff: pom.xml ---
@@ -804,13 +798,15 @@
 distribution
 math-scala
 spark
-spark-shell
--- End diff --

why are we dropping spark-shell from the build?


> Release Scala 2.11 bindings
> ---
>
> Key: MAHOUT-1795
> URL: https://issues.apache.org/jira/browse/MAHOUT-1795
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Mike Kaplinskiy
> Attachments: patch.diff
>
>
> It would be nice to ship scala 2.11 bindings for mahout-math/mahout-spark. 
> (I'm not sure of other users, but mahout-shell isn't nearly at the top of my 
> list here).
> It looks simple enough for those two - the attached patch is a 
> proof-of-concept to compile (and pass all tests) under scala 2.11. I'm not 
> sure what the proper way to do this is, but it doesn't look too daunting. 
> (Famous last words?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1795) Release Scala 2.11 bindings

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184004#comment-15184004
 ] 

ASF GitHub Bot commented on MAHOUT-1795:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/179#issuecomment-193503313
  
Can somebody else to look this over -- we may want to include this if it 
truly enables 2.11 build


> Release Scala 2.11 bindings
> ---
>
> Key: MAHOUT-1795
> URL: https://issues.apache.org/jira/browse/MAHOUT-1795
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Mike Kaplinskiy
> Attachments: patch.diff
>
>
> It would be nice to ship scala 2.11 bindings for mahout-math/mahout-spark. 
> (I'm not sure of other users, but mahout-shell isn't nearly at the top of my 
> list here).
> It looks simple enough for those two - the attached patch is a 
> proof-of-concept to compile (and pass all tests) under scala 2.11. I'm not 
> sure what the proper way to do this is, but it doesn't look too daunting. 
> (Famous last words?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1795) Release Scala 2.11 bindings

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184005#comment-15184005
 ] 

ASF GitHub Bot commented on MAHOUT-1795:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/179#discussion_r55292335
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/sparkbindings/blas/ABt.scala ---
@@ -29,8 +29,6 @@ import org.apache.spark.SparkContext._
 import org.apache.mahout.math.drm.logical.OpABt
 import org.apache.mahout.logging._
 
-import scala.tools.nsc.io.Pickler.TildeDecorator
-
 /** Contains RDD plans for ABt operator */
--- End diff --

is this really there? oh. it is. good catch +1.


> Release Scala 2.11 bindings
> ---
>
> Key: MAHOUT-1795
> URL: https://issues.apache.org/jira/browse/MAHOUT-1795
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Mike Kaplinskiy
> Attachments: patch.diff
>
>
> It would be nice to ship scala 2.11 bindings for mahout-math/mahout-spark. 
> (I'm not sure of other users, but mahout-shell isn't nearly at the top of my 
> list here).
> It looks simple enough for those two - the attached patch is a 
> proof-of-concept to compile (and pass all tests) under scala 2.11. I'm not 
> sure what the proper way to do this is, but it doesn't look too daunting. 
> (Famous last words?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1799) Read null row vectors from file in TextDelimeterReaderWriter driver

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183954#comment-15183954
 ] 

ASF GitHub Bot commented on MAHOUT-1799:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/182#issuecomment-193496785
  
Noted, need Pat's review


> Read null row vectors from file in TextDelimeterReaderWriter driver
> ---
>
> Key: MAHOUT-1799
> URL: https://issues.apache.org/jira/browse/MAHOUT-1799
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Jussi Jousimo
>Priority: Minor
>
> Since some row vectors in a sparse matrix can be null, Mahout writes them out 
> to a file with the row label only. However, Mahout cannot read these files, 
> but throws an exception when it encounters a label-only row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183897#comment-15183897
 ] 

ASF GitHub Bot commented on MAHOUT-1800:


GitHub user andrewpalumbo opened a pull request:

https://github.com/apache/mahout/pull/183

MAHOUT-1800: Pare down Classtag overuse

Currently, almost every operator requires an implicit parameter for the 
classtag context bound of drm rowset key type, even for things like drmA + drmB.

in reality though the DAG can already infer that similarly to e.g. it 
infers product geometry because classtags are already embedded in the logical 
plan.

for example, `classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)`.

Not only does the DAG already contain this information, but also it opens 
doors to a loss of inference, since the optimizer doesn't verify that the new 
context bound is actually valid by retracing the inference. So any operation 
may introduce an invalid row key type, and as a consequence, invalid 
optimization information, without any further checks.

This patch does the following:
(1) eliminates ClassTag[K] context bound in majority of operations
(2) add keyClassTag:ClassTag[K] property getter to the DrmLike[K] trait 
itself
(3) ensures lazy inference of returned key parameter classtag via DAG 
inference.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1800

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #183


commit e4a358d8adeb8878bc67e7bdf11e9c59f6003365
Author: Andrew Palumbo 
Date:   2016-03-07T22:32:14Z

Pare down Classtag overuse




> Pare down Casstag overuse
> -
>
> Key: MAHOUT-1800
> URL: https://issues.apache.org/jira/browse/MAHOUT-1800
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Andrew Palumbo
> Fix For: 0.11.2
>
>
> currently, almost every operator requires implicit parameter for the classtag 
> context bound of drm rowset key type, even for things like drmA + drmB.
> in reality though DAG can already infer that similarly to e.g. it infers 
> product geometry because classtags are already embedded in the logical plan. 
> for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 
> Not only does the DAG already contain this information, but also it opens 
> doors to a loss of inference, since the optimizer doesn't verify that the new 
> context bound is actually valid by retracing the inference. So any operation 
> may introduce an invalid row key type, and as a consequence, invalid 
> optimization information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1800) Pare down Casstag overuse

2016-03-07 Thread Andrew Palumbo (JIRA)
Andrew Palumbo created MAHOUT-1800:
--

 Summary: Pare down Casstag overuse
 Key: MAHOUT-1800
 URL: https://issues.apache.org/jira/browse/MAHOUT-1800
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Andrew Palumbo
 Fix For: 0.11.2


currently, almost every operator requires implicit parameter for the classtag 
context bound of drm rowset key type, even for things like drmA + drmB.

in reality though DAG can already infer that similarly to e.g. it infers 
product geometry because classtags are already embedded in the logical plan. 

for example, {{classtag(drmA+drmB) == classtag(drmA) == classtag(drmB)}}. 

Not only does the DAG already contain this information, but also it opens doors 
to a loss of inference, since the optimizer doesn't verify that the new context 
bound is actually valid by retracing the inference. So any operation may 
introduce an invalid row key type, and as a consequence, invalid optimization 
information, without any further checks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)