Re: elementwise operator improvements experiments

2014-11-17 Thread Ted Dunning
Isn't it true that sparse iteration should always be used for m := f iff

1) the matrix argument is sparse

AND

2) f(0) == 0

?

Why the need for syntactic notation at all?  This property is much easier
to test than commutativity.


On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 another thing is that optimizer isn't capable of figuring out all
 elementwise fusions in an elementwise expression, e.g. it is not seing
 commutativity rewrites such as

 A * B * A

 should optimally be computed as sqr(A) * B (it will do it as two pairwise
 operators (A*B)*A).

 Bummer.

 To do it truly right, it needs to fuse entire elementwise expressions first
 and then optmize them separately.

 Ok that's probably too much for now. I am quite ok with writing something
 like -0.5 * (a * a ) for now.

 On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  PS
 
  actually applying an exponent funciton in place will require addtional
  underscore it looks. It doesn't want to treat function name as function
  type in this context for some reason (although it does not require
 partial
  syntax when used in arguments inside parenthesis):
 
  m := exp _
 
  Scala is quirky this way i guess
 
 
  On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  So i did quick experimentation with elementwise operator improvements:
 
  (1) stuff like 1 + exp (M):
 
  (1a): this requires generalization in optimizer for elementwise unary
  operators. I've added things like notion if operators require non-zero
  iteration only or not.
 
  (1b): added fusion of elemntwise operators, i.e.
 ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance
  reasons. It still uses an application of a fold over functional monoid,
 but
  i think it should be fairly ok performance/DSL trade-off here. to get it
  even better, we may add functional assignment syntax to distributed
  operands similar to in-memory types as descrbed further down.
 
  (1c): notion that self elementwise things such as expr1 * expr1 (which
 is
  surprisingly often ocurrence, e..g in Torgerson MDS) are rewritten as
 ew(A,
  square) etc.
 
  So that much works. (Note that this also obsoletes dedicated
  scalar/matrix elementwise operators that there currently are). Good.
 
 
  The problem here is that (of course!) semantics of the scala language
 has
  problem importing something like exp(Double):Double alongside with
  exp(DRM):DRM apparently because it doesn't adhere to overloading rules
  (different results) so in practice even though it is allowed, one import
  overshadows the other.
 
  Which means, for the sake of DSL we can't have exp(matrix), we have to
  name it something else. Unless you see a better solution.
 
  So ... elementwise naming options:
 
  Matrix: mexp(m), msqrt(m). msignum(m)
  Vector: vexp(v), vsqrt(v)...
  DRM: dexp(drm), dsqrt(drm)  ?
 
  Let me know what you think.
 
  (2) Another problem is that actually doing something like 1+exp(m) on
  Matrix or Vector types is pretty impractical since, unlike in R (that
 can
  count number of bound variables to an object) the semantics requires
  creating a clone of m for something like exp(m) to guarantee no side
  effects on m itself.
 
  That is, expression 1 + exp(m) for Matrix or vector types causes 2
  clone-copies of original argument.
 
  actually that's why i use in-place syntax for in-memory types quite
  often, something like
 
 1+=: (x *= x)  instead of more naturally looking 1+ x * x.
 
 
  But unlike with simple elementwise operators (+=), there's no in-place
  modification syntax for a function. We could put an additional
 parameter,
  something like
 
  mexp(m, inPlace=true) but i don't like it too much.
 
  What i like much more is functional assignment (we already have
  assignment to a function (row, col, x) = Double but we can add
 elementwise
  function assignment) so that it really looks like
 
  m := exp
 
  That is pretty cool.
  Except there's a problem of optimality of assignment.
 
  There are functions here (e.g. abs, sqrt) that don't require full
  iteration but rather non-zero iteration only. by default notation
 
  m := func
 
  implies dense iteration.
 
  So what i suggest here is add a new syntax to do sparse iteration
  functional assignments:
 
  m ::= abs
 
  I actually like it (a lot) because it short and because it allows for
  more complex formulas in the same traversal, e.g. proverbial R's
 exp(m)+1
  in-place will look
 
  m := (1 + exp(_))
 
  So not terrible.
 
  What it lacks though is automatic determination of composite function
  need to apply to all vs. non-zeros only for in-memory types (for
  distributed types optimizer tracks this automatically).
 
  i.e.
 
  m := abs
 
  is not optimal (because abs doesn't affect 0s) and
 
  m ::= (abs(_) + 1)
 
  is probably also not what one wants (when we have composition of dense
  and sparse affecting functions, result is dense 

Re: elementwise operator improvements experiments

2014-11-17 Thread Dmitriy Lyubimov
This is an astute observation.

There are few things that make it difficult though to accept however:

(1) Occam razor assumption that f will be sought as most simple
implementations. Namely, that they would be deterministic and stateless.
Note that something like

new SparseMatrix() := { if (rnd.next  0.5) 0 else 1 }

will render our test for sparsity assignment also non-deterministic, which
is not good.

Now, solely for the sake of self-irony, we can play Sheldon Cooper for a
second and say that this implementation is pointless, and that in general
any non-deterministic implementation is pointless, and therefore
implausible.

But being implausible is not the same as impossible, so then we have a
classic conflict between performance and correctness here. The problem is
being exacerbated by the fact that implausible problems hardest to find,
and the fact that this is now user-facing code, not just @mahout-dev facing
problem where we could have tackled it simply by documenting conventions.

Now big downside is that of course @public will likely write a lot of dense
elementwise things aMx := f instead of aMx ::= f, but this will only cause
performance errors, but not correctness errors.

And i generally side with correctness and explicit overrides in case like
this. (and proper tutorial)

(2) call f(0) may be illegal due to function domain issues and cause
interruption. Now, again, not very plausible thing, but same argumentations
apply (non fool-proof things are hard to find).

(3) `matrix argument is sparse` condition is problematic because it assumes
that there's nothing to gain by skipping zeros in dense matrix. This relies
heavily on assumption that cost of computing f is about the same as cost of
traversal (and in cases like  f=exp(x) it probably is). But in cases when
cost of computing f  cost of dense traversal, we may find that
denseMatrix ::= f is actually non-trivially better performant than
denseMatrix := f.

(4) finally, in current code there's only one version of assign, which is

mx := f(row, col, x)

and test code is just adding version

mx := f(x)

but the former form of functional assignment is still pretty much in
demand. This former form is what is a particular problem, since we cannot
efficiently test  f(0) = 0 for the entire domain of { (row,col) }.



Finally, unfortunately it looks like there will always be some performance
tricks like that. e.g. for in-memory stuff, as it stands, one needs to
realize that stuff like

2 *=: (A *= A)

(i.e. simple 2*A^2  computation)

is going to do 2 traversals of the same matrix structure whereas

 A ::= { (x) = 2 * x * x }

is going to have only one and therefore is likely to be perhaps almost
twice as efficient (not to mention sparsity assignment optimality issues),
so efficiency will require some black magic for foreseeable future in cases
beyond just prototyping, especially for in-memory stuff, anyway.



On Mon, Nov 17, 2014 at 2:16 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Isn't it true that sparse iteration should always be used for m := f iff

 1) the matrix argument is sparse

 AND

 2) f(0) == 0

 ?

 Why the need for syntactic notation at all?  This property is much easier
 to test than commutativity.


 On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  another thing is that optimizer isn't capable of figuring out all
  elementwise fusions in an elementwise expression, e.g. it is not seing
  commutativity rewrites such as
 
  A * B * A
 
  should optimally be computed as sqr(A) * B (it will do it as two pairwise
  operators (A*B)*A).
 
  Bummer.
 
  To do it truly right, it needs to fuse entire elementwise expressions
 first
  and then optmize them separately.
 
  Ok that's probably too much for now. I am quite ok with writing something
  like -0.5 * (a * a ) for now.
 
  On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   PS
  
   actually applying an exponent funciton in place will require addtional
   underscore it looks. It doesn't want to treat function name as function
   type in this context for some reason (although it does not require
  partial
   syntax when used in arguments inside parenthesis):
  
   m := exp _
  
   Scala is quirky this way i guess
  
  
   On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
   So i did quick experimentation with elementwise operator improvements:
  
   (1) stuff like 1 + exp (M):
  
   (1a): this requires generalization in optimizer for elementwise unary
   operators. I've added things like notion if operators require non-zero
   iteration only or not.
  
   (1b): added fusion of elemntwise operators, i.e.
  ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance
   reasons. It still uses an application of a fold over functional
 monoid,
  but
   i think it should be fairly ok performance/DSL trade-off here. to get
 it
   even better, we may add functional assignment syntax to 

Re: elementwise operator improvements experiments

2014-11-17 Thread Dmitriy Lyubimov
Actually, one thing that could be automated here for in-memory functional
assignments  is perhaps order of traversal. This actually has also a pretty
intimate connection to issues like matrix-matrix multiplication cost-based
algorithm selection.

I.e. right now incidentally it would be more efficient to write

A ::= f

if A is a SparseRowMatrix or a DenseMatrix, and

A.t ::= f

if A is a SparseColumnMatrix (or a transposed view of any row-backed
implementation), since operator ::= always does row-wise traversal.

That's one instance of `black magic` that i'd like to do away with (which
is also relatively easy by exposing some standard matrix structure cost
patterns as it was done with functions and vectors).



On Mon, Nov 17, 2014 at 12:18 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 This is an astute observation.

 There are few things that make it difficult though to accept however:

 (1) Occam razor assumption that f will be sought as most simple
 implementations. Namely, that they would be deterministic and stateless.
 Note that something like

 new SparseMatrix() := { if (rnd.next  0.5) 0 else 1 }

 will render our test for sparsity assignment also non-deterministic, which
 is not good.

 Now, solely for the sake of self-irony, we can play Sheldon Cooper for a
 second and say that this implementation is pointless, and that in general
 any non-deterministic implementation is pointless, and therefore
 implausible.

 But being implausible is not the same as impossible, so then we have a
 classic conflict between performance and correctness here. The problem is
 being exacerbated by the fact that implausible problems hardest to find,
 and the fact that this is now user-facing code, not just @mahout-dev facing
 problem where we could have tackled it simply by documenting conventions.

 Now big downside is that of course @public will likely write a lot of
 dense elementwise things aMx := f instead of aMx ::= f, but this will only
 cause performance errors, but not correctness errors.

 And i generally side with correctness and explicit overrides in case like
 this. (and proper tutorial)

 (2) call f(0) may be illegal due to function domain issues and cause
 interruption. Now, again, not very plausible thing, but same argumentations
 apply (non fool-proof things are hard to find).

 (3) `matrix argument is sparse` condition is problematic because it
 assumes that there's nothing to gain by skipping zeros in dense matrix.
 This relies heavily on assumption that cost of computing f is about the
 same as cost of traversal (and in cases like  f=exp(x) it probably is). But
 in cases when cost of computing f  cost of dense traversal, we may find
 that denseMatrix ::= f is actually non-trivially better performant than
 denseMatrix := f.

 (4) finally, in current code there's only one version of assign, which is

 mx := f(row, col, x)

 and test code is just adding version

 mx := f(x)

 but the former form of functional assignment is still pretty much in
 demand. This former form is what is a particular problem, since we cannot
 efficiently test  f(0) = 0 for the entire domain of { (row,col) }.



 Finally, unfortunately it looks like there will always be some performance
 tricks like that. e.g. for in-memory stuff, as it stands, one needs to
 realize that stuff like

 2 *=: (A *= A)

 (i.e. simple 2*A^2  computation)

 is going to do 2 traversals of the same matrix structure whereas

  A ::= { (x) = 2 * x * x }

 is going to have only one and therefore is likely to be perhaps almost
 twice as efficient (not to mention sparsity assignment optimality issues),
 so efficiency will require some black magic for foreseeable future in cases
 beyond just prototyping, especially for in-memory stuff, anyway.



 On Mon, Nov 17, 2014 at 2:16 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 Isn't it true that sparse iteration should always be used for m := f iff

 1) the matrix argument is sparse

 AND

 2) f(0) == 0

 ?

 Why the need for syntactic notation at all?  This property is much easier
 to test than commutativity.


 On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  another thing is that optimizer isn't capable of figuring out all
  elementwise fusions in an elementwise expression, e.g. it is not seing
  commutativity rewrites such as
 
  A * B * A
 
  should optimally be computed as sqr(A) * B (it will do it as two
 pairwise
  operators (A*B)*A).
 
  Bummer.
 
  To do it truly right, it needs to fuse entire elementwise expressions
 first
  and then optmize them separately.
 
  Ok that's probably too much for now. I am quite ok with writing
 something
  like -0.5 * (a * a ) for now.
 
  On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   PS
  
   actually applying an exponent funciton in place will require addtional
   underscore it looks. It doesn't want to treat function name as
 function
   type in this context for some reason (although it does not 

AtA error

2014-11-17 Thread Pat Ferrel
A matrix with about 4600 rows and somewhere around 27790 columns when executing 
the following line from AtA (not sure of the exact dimensions)

 /** The version of A'A that does not use GraphX */
 def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {

a vector is created whose size is causes the error. How could I have 
constructed a drm that would cause this error? If the column IDs were 
non-contiguous would that yield this error?

==

14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 
(TID 66169)
org.apache.mahout.math.IndexException: Index 27792 is outside allowable range 
of [0,27789)
at 
org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
at 
org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
at 
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
at 
org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at 
scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
at 
scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)



Re: AtA error

2014-11-17 Thread Dmitriy Lyubimov
On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote:

 A matrix with about 4600 rows and somewhere around 27790 columns when
 executing the following line from AtA (not sure of the exact dimensions)

  /** The version of A'A that does not use GraphX */
  def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {

 a vector is created whose size is causes the error. How could I have
 constructed a drm that would cause this error? If the column IDs were
 non-contiguous would that yield this error?


what did you do specifically to build matrix A?


 ==

 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage
 18.0 (TID 66169)
 org.apache.mahout.math.IndexException: Index 27792 is outside allowable
 range of [0,27789)
 at
 org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
 at
 org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
 at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
 at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
 at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at
 scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
 at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:695)




Re: AtA error

2014-11-17 Thread Dmitriy Lyubimov
So this is not a problem of A'A computation -- the input is obviously
invalid.

Question is what you did before you got a A handle -- read it from file?
parallelized it from in-core matrix (drmParallelize)? as a result of other
computation (if yes than what)? wrapped around manually crafted RDD
(drmWrap)?

I don't understand the question about non-continuous ids. You are referring
to some context of your computation assuming I am in context (but i am
unfortunately not)

On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:



 On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote:

 A matrix with about 4600 rows and somewhere around 27790 columns when
 executing the following line from AtA (not sure of the exact dimensions)

  /** The version of A'A that does not use GraphX */
  def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {

 a vector is created whose size is causes the error. How could I have
 constructed a drm that would cause this error? If the column IDs were
 non-contiguous would that yield this error?


 what did you do specifically to build matrix A?


 ==

 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage
 18.0 (TID 66169)
 org.apache.mahout.math.IndexException: Index 27792 is outside allowable
 range of [0,27789)
 at
 org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
 at
 org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
 at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
 at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
 at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at
 scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
 at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:695)





Re: AtA error

2014-11-17 Thread Pat Ferrel
It’s in spark-itemsimilarity. This job reads elements and assigns them to one 
of two RDD backed drms.

I assumed it was a badly formed drm but it’s a 140MB dataset and a bit hard to 
nail down—just looking for a clue. I read this to say that an ID for an element 
in a row vector was larger than drm.ncol, correct?


On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

So this is not a problem of A'A computation -- the input is obviously
invalid.

Question is what you did before you got a A handle -- read it from file?
parallelized it from in-core matrix (drmParallelize)? as a result of other
computation (if yes than what)? wrapped around manually crafted RDD
(drmWrap)?

I don't understand the question about non-continuous ids. You are referring
to some context of your computation assuming I am in context (but i am
unfortunately not)

On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 
 
 On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
 A matrix with about 4600 rows and somewhere around 27790 columns when
 executing the following line from AtA (not sure of the exact dimensions)
 
 /** The version of A'A that does not use GraphX */
 def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {
 
 a vector is created whose size is causes the error. How could I have
 constructed a drm that would cause this error? If the column IDs were
 non-contiguous would that yield this error?
 
 
 what did you do specifically to build matrix A?
 
 
 ==
 
 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage
 18.0 (TID 66169)
 org.apache.mahout.math.IndexException: Index 27792 is outside allowable
 range of [0,27789)
at
 org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
at
 org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
at
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
at
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
at
 scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
 
 
 



Re: AtA error

2014-11-17 Thread Dmitriy Lyubimov
On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote:

 It’s in spark-itemsimilarity. This job reads elements and assigns them to
 one of two RDD backed drms.

 I assumed it was a badly formed drm but it’s a 140MB dataset and a bit
 hard to nail down—just looking for a clue. I read this to say that an ID
 for an element in a row vector was larger than drm.ncol, correct?


yes.

and then it again comes back to the question how the matrix was
constructed. General construction of dimensions (ncol, nrow) is
automatic-lazy, meaning if you have not specified dimensions anywhere
explicitly, it will lazily compute it for you. But if you did volunteer
them anywhere (such as to drmWrap() call) they got to be good. Or you see
things like this.



 On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 So this is not a problem of A'A computation -- the input is obviously
 invalid.

 Question is what you did before you got a A handle -- read it from file?
 parallelized it from in-core matrix (drmParallelize)? as a result of other
 computation (if yes than what)? wrapped around manually crafted RDD
 (drmWrap)?

 I don't understand the question about non-continuous ids. You are referring
 to some context of your computation assuming I am in context (but i am
 unfortunately not)

 On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 
 
  On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  A matrix with about 4600 rows and somewhere around 27790 columns when
  executing the following line from AtA (not sure of the exact dimensions)
 
  /** The version of A'A that does not use GraphX */
  def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {
 
  a vector is created whose size is causes the error. How could I have
  constructed a drm that would cause this error? If the column IDs were
  non-contiguous would that yield this error?
 
 
  what did you do specifically to build matrix A?
 
 
  ==
 
  14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in
 stage
  18.0 (TID 66169)
  org.apache.mahout.math.IndexException: Index 27792 is outside allowable
  range of [0,27789)
 at
  org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
 at
  org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
 at
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
 at
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
 at
  scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at
  scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at
 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at
  scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at
  scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
 at
  org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
 at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:695)
 
 
 




Re: AtA error

2014-11-17 Thread Pat Ferrel
I do use drmWrap so I’ll check there, thanks

On Nov 17, 2014, at 5:22 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote:

 It’s in spark-itemsimilarity. This job reads elements and assigns them to
 one of two RDD backed drms.
 
 I assumed it was a badly formed drm but it’s a 140MB dataset and a bit
 hard to nail down—just looking for a clue. I read this to say that an ID
 for an element in a row vector was larger than drm.ncol, correct?
 

yes.

and then it again comes back to the question how the matrix was
constructed. General construction of dimensions (ncol, nrow) is
automatic-lazy, meaning if you have not specified dimensions anywhere
explicitly, it will lazily compute it for you. But if you did volunteer
them anywhere (such as to drmWrap() call) they got to be good. Or you see
things like this.

 
 
 On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
 So this is not a problem of A'A computation -- the input is obviously
 invalid.
 
 Question is what you did before you got a A handle -- read it from file?
 parallelized it from in-core matrix (drmParallelize)? as a result of other
 computation (if yes than what)? wrapped around manually crafted RDD
 (drmWrap)?
 
 I don't understand the question about non-continuous ids. You are referring
 to some context of your computation assuming I am in context (but i am
 unfortunately not)
 
 On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
 
 
 On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
 A matrix with about 4600 rows and somewhere around 27790 columns when
 executing the following line from AtA (not sure of the exact dimensions)
 
/** The version of A'A that does not use GraphX */
def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {
 
 a vector is created whose size is causes the error. How could I have
 constructed a drm that would cause this error? If the column IDs were
 non-contiguous would that yield this error?
 
 
 what did you do specifically to build matrix A?
 
 
 ==
 
 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in
 stage
 18.0 (TID 66169)
 org.apache.mahout.math.IndexException: Index 27792 is outside allowable
 range of [0,27789)
   at
 org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
   at
 org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
   at
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
   at
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
   at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
   at
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
   at
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
   at
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
   at
 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
   at
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
   at
 scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
   at
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
   at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   at java.lang.Thread.run(Thread.java:695)
 
 
 
 
 



Re: AtA error

2014-11-17 Thread Dmitriy Lyubimov
also technically all vectors should be (or expected to be) of the same
length in a valid matrix thing (doesn't mean they actually have to have all
elements -- or even all vectors, of course). So if needed, just run a
simple validation map before drmWrap to validate or to clean this up,
whichever is suitable.



On Mon, Nov 17, 2014 at 5:24 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I do use drmWrap so I’ll check there, thanks

 On Nov 17, 2014, at 5:22 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote:

  It’s in spark-itemsimilarity. This job reads elements and assigns them to
  one of two RDD backed drms.
 
  I assumed it was a badly formed drm but it’s a 140MB dataset and a bit
  hard to nail down—just looking for a clue. I read this to say that an ID
  for an element in a row vector was larger than drm.ncol, correct?
 

 yes.

 and then it again comes back to the question how the matrix was
 constructed. General construction of dimensions (ncol, nrow) is
 automatic-lazy, meaning if you have not specified dimensions anywhere
 explicitly, it will lazily compute it for you. But if you did volunteer
 them anywhere (such as to drmWrap() call) they got to be good. Or you see
 things like this.

 
 
  On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
  So this is not a problem of A'A computation -- the input is obviously
  invalid.
 
  Question is what you did before you got a A handle -- read it from file?
  parallelized it from in-core matrix (drmParallelize)? as a result of
 other
  computation (if yes than what)? wrapped around manually crafted RDD
  (drmWrap)?
 
  I don't understand the question about non-continuous ids. You are
 referring
  to some context of your computation assuming I am in context (but i am
  unfortunately not)
 
  On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
 
 
  On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com
  wrote:
 
  A matrix with about 4600 rows and somewhere around 27790 columns when
  executing the following line from AtA (not sure of the exact
 dimensions)
 
 /** The version of A'A that does not use GraphX */
 def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = {
 
  a vector is created whose size is causes the error. How could I have
  constructed a drm that would cause this error? If the column IDs were
  non-contiguous would that yield this error?
 
 
  what did you do specifically to build matrix A?
 
 
  ==
 
  14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in
  stage
  18.0 (TID 66169)
  org.apache.mahout.math.IndexException: Index 27792 is outside allowable
  range of [0,27789)
at
  org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
at
 
 org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
at
 
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
at
 
 
 org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
at
 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at
 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at
 
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at
 
 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
at
 
 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
at
  scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
at
  scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
 
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
at
  org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at
 
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)