Re: elementwise operator improvements experiments
Isn't it true that sparse iteration should always be used for m := f iff 1) the matrix argument is sparse AND 2) f(0) == 0 ? Why the need for syntactic notation at all? This property is much easier to test than commutativity. On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: another thing is that optimizer isn't capable of figuring out all elementwise fusions in an elementwise expression, e.g. it is not seing commutativity rewrites such as A * B * A should optimally be computed as sqr(A) * B (it will do it as two pairwise operators (A*B)*A). Bummer. To do it truly right, it needs to fuse entire elementwise expressions first and then optmize them separately. Ok that's probably too much for now. I am quite ok with writing something like -0.5 * (a * a ) for now. On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS actually applying an exponent funciton in place will require addtional underscore it looks. It doesn't want to treat function name as function type in this context for some reason (although it does not require partial syntax when used in arguments inside parenthesis): m := exp _ Scala is quirky this way i guess On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: So i did quick experimentation with elementwise operator improvements: (1) stuff like 1 + exp (M): (1a): this requires generalization in optimizer for elementwise unary operators. I've added things like notion if operators require non-zero iteration only or not. (1b): added fusion of elemntwise operators, i.e. ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance reasons. It still uses an application of a fold over functional monoid, but i think it should be fairly ok performance/DSL trade-off here. to get it even better, we may add functional assignment syntax to distributed operands similar to in-memory types as descrbed further down. (1c): notion that self elementwise things such as expr1 * expr1 (which is surprisingly often ocurrence, e..g in Torgerson MDS) are rewritten as ew(A, square) etc. So that much works. (Note that this also obsoletes dedicated scalar/matrix elementwise operators that there currently are). Good. The problem here is that (of course!) semantics of the scala language has problem importing something like exp(Double):Double alongside with exp(DRM):DRM apparently because it doesn't adhere to overloading rules (different results) so in practice even though it is allowed, one import overshadows the other. Which means, for the sake of DSL we can't have exp(matrix), we have to name it something else. Unless you see a better solution. So ... elementwise naming options: Matrix: mexp(m), msqrt(m). msignum(m) Vector: vexp(v), vsqrt(v)... DRM: dexp(drm), dsqrt(drm) ? Let me know what you think. (2) Another problem is that actually doing something like 1+exp(m) on Matrix or Vector types is pretty impractical since, unlike in R (that can count number of bound variables to an object) the semantics requires creating a clone of m for something like exp(m) to guarantee no side effects on m itself. That is, expression 1 + exp(m) for Matrix or vector types causes 2 clone-copies of original argument. actually that's why i use in-place syntax for in-memory types quite often, something like 1+=: (x *= x) instead of more naturally looking 1+ x * x. But unlike with simple elementwise operators (+=), there's no in-place modification syntax for a function. We could put an additional parameter, something like mexp(m, inPlace=true) but i don't like it too much. What i like much more is functional assignment (we already have assignment to a function (row, col, x) = Double but we can add elementwise function assignment) so that it really looks like m := exp That is pretty cool. Except there's a problem of optimality of assignment. There are functions here (e.g. abs, sqrt) that don't require full iteration but rather non-zero iteration only. by default notation m := func implies dense iteration. So what i suggest here is add a new syntax to do sparse iteration functional assignments: m ::= abs I actually like it (a lot) because it short and because it allows for more complex formulas in the same traversal, e.g. proverbial R's exp(m)+1 in-place will look m := (1 + exp(_)) So not terrible. What it lacks though is automatic determination of composite function need to apply to all vs. non-zeros only for in-memory types (for distributed types optimizer tracks this automatically). i.e. m := abs is not optimal (because abs doesn't affect 0s) and m ::= (abs(_) + 1) is probably also not what one wants (when we have composition of dense and sparse affecting functions, result is dense
Re: elementwise operator improvements experiments
This is an astute observation. There are few things that make it difficult though to accept however: (1) Occam razor assumption that f will be sought as most simple implementations. Namely, that they would be deterministic and stateless. Note that something like new SparseMatrix() := { if (rnd.next 0.5) 0 else 1 } will render our test for sparsity assignment also non-deterministic, which is not good. Now, solely for the sake of self-irony, we can play Sheldon Cooper for a second and say that this implementation is pointless, and that in general any non-deterministic implementation is pointless, and therefore implausible. But being implausible is not the same as impossible, so then we have a classic conflict between performance and correctness here. The problem is being exacerbated by the fact that implausible problems hardest to find, and the fact that this is now user-facing code, not just @mahout-dev facing problem where we could have tackled it simply by documenting conventions. Now big downside is that of course @public will likely write a lot of dense elementwise things aMx := f instead of aMx ::= f, but this will only cause performance errors, but not correctness errors. And i generally side with correctness and explicit overrides in case like this. (and proper tutorial) (2) call f(0) may be illegal due to function domain issues and cause interruption. Now, again, not very plausible thing, but same argumentations apply (non fool-proof things are hard to find). (3) `matrix argument is sparse` condition is problematic because it assumes that there's nothing to gain by skipping zeros in dense matrix. This relies heavily on assumption that cost of computing f is about the same as cost of traversal (and in cases like f=exp(x) it probably is). But in cases when cost of computing f cost of dense traversal, we may find that denseMatrix ::= f is actually non-trivially better performant than denseMatrix := f. (4) finally, in current code there's only one version of assign, which is mx := f(row, col, x) and test code is just adding version mx := f(x) but the former form of functional assignment is still pretty much in demand. This former form is what is a particular problem, since we cannot efficiently test f(0) = 0 for the entire domain of { (row,col) }. Finally, unfortunately it looks like there will always be some performance tricks like that. e.g. for in-memory stuff, as it stands, one needs to realize that stuff like 2 *=: (A *= A) (i.e. simple 2*A^2 computation) is going to do 2 traversals of the same matrix structure whereas A ::= { (x) = 2 * x * x } is going to have only one and therefore is likely to be perhaps almost twice as efficient (not to mention sparsity assignment optimality issues), so efficiency will require some black magic for foreseeable future in cases beyond just prototyping, especially for in-memory stuff, anyway. On Mon, Nov 17, 2014 at 2:16 AM, Ted Dunning ted.dunn...@gmail.com wrote: Isn't it true that sparse iteration should always be used for m := f iff 1) the matrix argument is sparse AND 2) f(0) == 0 ? Why the need for syntactic notation at all? This property is much easier to test than commutativity. On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: another thing is that optimizer isn't capable of figuring out all elementwise fusions in an elementwise expression, e.g. it is not seing commutativity rewrites such as A * B * A should optimally be computed as sqr(A) * B (it will do it as two pairwise operators (A*B)*A). Bummer. To do it truly right, it needs to fuse entire elementwise expressions first and then optmize them separately. Ok that's probably too much for now. I am quite ok with writing something like -0.5 * (a * a ) for now. On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS actually applying an exponent funciton in place will require addtional underscore it looks. It doesn't want to treat function name as function type in this context for some reason (although it does not require partial syntax when used in arguments inside parenthesis): m := exp _ Scala is quirky this way i guess On Sat, Nov 15, 2014 at 10:02 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: So i did quick experimentation with elementwise operator improvements: (1) stuff like 1 + exp (M): (1a): this requires generalization in optimizer for elementwise unary operators. I've added things like notion if operators require non-zero iteration only or not. (1b): added fusion of elemntwise operators, i.e. ew(1+, ew(exp, A)) is rewritten as ew (1+exp, A) for performance reasons. It still uses an application of a fold over functional monoid, but i think it should be fairly ok performance/DSL trade-off here. to get it even better, we may add functional assignment syntax to
Re: elementwise operator improvements experiments
Actually, one thing that could be automated here for in-memory functional assignments is perhaps order of traversal. This actually has also a pretty intimate connection to issues like matrix-matrix multiplication cost-based algorithm selection. I.e. right now incidentally it would be more efficient to write A ::= f if A is a SparseRowMatrix or a DenseMatrix, and A.t ::= f if A is a SparseColumnMatrix (or a transposed view of any row-backed implementation), since operator ::= always does row-wise traversal. That's one instance of `black magic` that i'd like to do away with (which is also relatively easy by exposing some standard matrix structure cost patterns as it was done with functions and vectors). On Mon, Nov 17, 2014 at 12:18 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: This is an astute observation. There are few things that make it difficult though to accept however: (1) Occam razor assumption that f will be sought as most simple implementations. Namely, that they would be deterministic and stateless. Note that something like new SparseMatrix() := { if (rnd.next 0.5) 0 else 1 } will render our test for sparsity assignment also non-deterministic, which is not good. Now, solely for the sake of self-irony, we can play Sheldon Cooper for a second and say that this implementation is pointless, and that in general any non-deterministic implementation is pointless, and therefore implausible. But being implausible is not the same as impossible, so then we have a classic conflict between performance and correctness here. The problem is being exacerbated by the fact that implausible problems hardest to find, and the fact that this is now user-facing code, not just @mahout-dev facing problem where we could have tackled it simply by documenting conventions. Now big downside is that of course @public will likely write a lot of dense elementwise things aMx := f instead of aMx ::= f, but this will only cause performance errors, but not correctness errors. And i generally side with correctness and explicit overrides in case like this. (and proper tutorial) (2) call f(0) may be illegal due to function domain issues and cause interruption. Now, again, not very plausible thing, but same argumentations apply (non fool-proof things are hard to find). (3) `matrix argument is sparse` condition is problematic because it assumes that there's nothing to gain by skipping zeros in dense matrix. This relies heavily on assumption that cost of computing f is about the same as cost of traversal (and in cases like f=exp(x) it probably is). But in cases when cost of computing f cost of dense traversal, we may find that denseMatrix ::= f is actually non-trivially better performant than denseMatrix := f. (4) finally, in current code there's only one version of assign, which is mx := f(row, col, x) and test code is just adding version mx := f(x) but the former form of functional assignment is still pretty much in demand. This former form is what is a particular problem, since we cannot efficiently test f(0) = 0 for the entire domain of { (row,col) }. Finally, unfortunately it looks like there will always be some performance tricks like that. e.g. for in-memory stuff, as it stands, one needs to realize that stuff like 2 *=: (A *= A) (i.e. simple 2*A^2 computation) is going to do 2 traversals of the same matrix structure whereas A ::= { (x) = 2 * x * x } is going to have only one and therefore is likely to be perhaps almost twice as efficient (not to mention sparsity assignment optimality issues), so efficiency will require some black magic for foreseeable future in cases beyond just prototyping, especially for in-memory stuff, anyway. On Mon, Nov 17, 2014 at 2:16 AM, Ted Dunning ted.dunn...@gmail.com wrote: Isn't it true that sparse iteration should always be used for m := f iff 1) the matrix argument is sparse AND 2) f(0) == 0 ? Why the need for syntactic notation at all? This property is much easier to test than commutativity. On Sun, Nov 16, 2014 at 7:42 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: another thing is that optimizer isn't capable of figuring out all elementwise fusions in an elementwise expression, e.g. it is not seing commutativity rewrites such as A * B * A should optimally be computed as sqr(A) * B (it will do it as two pairwise operators (A*B)*A). Bummer. To do it truly right, it needs to fuse entire elementwise expressions first and then optmize them separately. Ok that's probably too much for now. I am quite ok with writing something like -0.5 * (a * a ) for now. On Sat, Nov 15, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: PS actually applying an exponent funciton in place will require addtional underscore it looks. It doesn't want to treat function name as function type in this context for some reason (although it does not
AtA error
A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
So this is not a problem of A'A computation -- the input is obviously invalid. Question is what you did before you got a A handle -- read it from file? parallelized it from in-core matrix (drmParallelize)? as a result of other computation (if yes than what)? wrapped around manually crafted RDD (drmWrap)? I don't understand the question about non-continuous ids. You are referring to some context of your computation assuming I am in context (but i am unfortunately not) On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
It’s in spark-itemsimilarity. This job reads elements and assigns them to one of two RDD backed drms. I assumed it was a badly formed drm but it’s a 140MB dataset and a bit hard to nail down—just looking for a clue. I read this to say that an ID for an element in a row vector was larger than drm.ncol, correct? On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So this is not a problem of A'A computation -- the input is obviously invalid. Question is what you did before you got a A handle -- read it from file? parallelized it from in-core matrix (drmParallelize)? as a result of other computation (if yes than what)? wrapped around manually crafted RDD (drmWrap)? I don't understand the question about non-continuous ids. You are referring to some context of your computation assuming I am in context (but i am unfortunately not) On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote: It’s in spark-itemsimilarity. This job reads elements and assigns them to one of two RDD backed drms. I assumed it was a badly formed drm but it’s a 140MB dataset and a bit hard to nail down—just looking for a clue. I read this to say that an ID for an element in a row vector was larger than drm.ncol, correct? yes. and then it again comes back to the question how the matrix was constructed. General construction of dimensions (ncol, nrow) is automatic-lazy, meaning if you have not specified dimensions anywhere explicitly, it will lazily compute it for you. But if you did volunteer them anywhere (such as to drmWrap() call) they got to be good. Or you see things like this. On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So this is not a problem of A'A computation -- the input is obviously invalid. Question is what you did before you got a A handle -- read it from file? parallelized it from in-core matrix (drmParallelize)? as a result of other computation (if yes than what)? wrapped around manually crafted RDD (drmWrap)? I don't understand the question about non-continuous ids. You are referring to some context of your computation assuming I am in context (but i am unfortunately not) On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
I do use drmWrap so I’ll check there, thanks On Nov 17, 2014, at 5:22 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote: It’s in spark-itemsimilarity. This job reads elements and assigns them to one of two RDD backed drms. I assumed it was a badly formed drm but it’s a 140MB dataset and a bit hard to nail down—just looking for a clue. I read this to say that an ID for an element in a row vector was larger than drm.ncol, correct? yes. and then it again comes back to the question how the matrix was constructed. General construction of dimensions (ncol, nrow) is automatic-lazy, meaning if you have not specified dimensions anywhere explicitly, it will lazily compute it for you. But if you did volunteer them anywhere (such as to drmWrap() call) they got to be good. Or you see things like this. On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So this is not a problem of A'A computation -- the input is obviously invalid. Question is what you did before you got a A handle -- read it from file? parallelized it from in-core matrix (drmParallelize)? as a result of other computation (if yes than what)? wrapped around manually crafted RDD (drmWrap)? I don't understand the question about non-continuous ids. You are referring to some context of your computation assuming I am in context (but i am unfortunately not) On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)
Re: AtA error
also technically all vectors should be (or expected to be) of the same length in a valid matrix thing (doesn't mean they actually have to have all elements -- or even all vectors, of course). So if needed, just run a simple validation map before drmWrap to validate or to clean this up, whichever is suitable. On Mon, Nov 17, 2014 at 5:24 PM, Pat Ferrel p...@occamsmachete.com wrote: I do use drmWrap so I’ll check there, thanks On Nov 17, 2014, at 5:22 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel p...@occamsmachete.com wrote: It’s in spark-itemsimilarity. This job reads elements and assigns them to one of two RDD backed drms. I assumed it was a badly formed drm but it’s a 140MB dataset and a bit hard to nail down—just looking for a clue. I read this to say that an ID for an element in a row vector was larger than drm.ncol, correct? yes. and then it again comes back to the question how the matrix was constructed. General construction of dimensions (ncol, nrow) is automatic-lazy, meaning if you have not specified dimensions anywhere explicitly, it will lazily compute it for you. But if you did volunteer them anywhere (such as to drmWrap() call) they got to be good. Or you see things like this. On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So this is not a problem of A'A computation -- the input is obviously invalid. Question is what you did before you got a A handle -- read it from file? parallelized it from in-core matrix (drmParallelize)? as a result of other computation (if yes than what)? wrapped around manually crafted RDD (drmWrap)? I don't understand the question about non-continuous ids. You are referring to some context of your computation assuming I am in context (but i am unfortunately not) On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel p...@occamsmachete.com wrote: A matrix with about 4600 rows and somewhere around 27790 columns when executing the following line from AtA (not sure of the exact dimensions) /** The version of A'A that does not use GraphX */ def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { a vector is created whose size is causes the error. How could I have constructed a drm that would cause this error? If the column IDs were non-contiguous would that yield this error? what did you do specifically to build matrix A? == 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in stage 18.0 (TID 66169) org.apache.mahout.math.IndexException: Index 27792 is outside allowable range of [0,27789) at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) at org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695)