also technically all vectors should be (or expected to be) of the same length in a valid matrix thing (doesn't mean they actually have to have all elements -- or even all vectors, of course). So if needed, just run a simple validation map before drmWrap to validate or to clean this up, whichever is suitable.
On Mon, Nov 17, 2014 at 5:24 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I do use drmWrap so I’ll check there, thanks > > On Nov 17, 2014, at 5:22 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > On Mon, Nov 17, 2014 at 5:16 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > It’s in spark-itemsimilarity. This job reads elements and assigns them to > > one of two RDD backed drms. > > > > I assumed it was a badly formed drm but it’s a 140MB dataset and a bit > > hard to nail down—just looking for a clue. I read this to say that an ID > > for an element in a row vector was larger than drm.ncol, correct? > > > > yes. > > and then it again comes back to the question how the matrix was > constructed. General construction of dimensions (ncol, nrow) is > automatic-lazy, meaning if you have not specified dimensions anywhere > explicitly, it will lazily compute it for you. But if you did volunteer > them anywhere (such as to drmWrap() call) they got to be good. Or you see > things like this. > > > > > > > On Nov 17, 2014, at 4:58 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > > > So this is not a problem of A'A computation -- the input is obviously > > invalid. > > > > Question is what you did before you got a A handle -- read it from file? > > parallelized it from in-core matrix (drmParallelize)? as a result of > other > > computation (if yes than what)? wrapped around manually crafted RDD > > (drmWrap)? > > > > I don't understand the question about non-continuous ids. You are > referring > > to some context of your computation assuming I am in context (but i am > > unfortunately not) > > > > On Mon, Nov 17, 2014 at 4:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > >> > >> > >> On Mon, Nov 17, 2014 at 3:46 PM, Pat Ferrel <p...@occamsmachete.com> > > wrote: > >> > >>> A matrix with about 4600 rows and somewhere around 27790 columns when > >>> executing the following line from AtA (not sure of the exact > dimensions) > >>> > >>> /** The version of A'A that does not use GraphX */ > >>> def at_a_nongraph(op: OpAtA[_], srcRdd: DrmRdd[_]): DrmRdd[Int] = { > >>> > >>> a vector is created whose size is causes the error. How could I have > >>> constructed a drm that would cause this error? If the column IDs were > >>> non-contiguous would that yield this error? > >>> > >> > >> what did you do specifically to build matrix A? > >> > >> > >>> ================== > >>> > >>> 14/11/12 17:56:03 ERROR executor.Executor: Exception in task 5.0 in > > stage > >>> 18.0 (TID 66169) > >>> org.apache.mahout.math.IndexException: Index 27792 is outside allowable > >>> range of [0,27789) > >>> at > >>> org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) > >>> at > >>> > org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) > >>> at > >>> > > > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) > >>> at > >>> > > > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) > >>> at > >>> > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >>> at > >>> > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) > >>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) > >>> at > >>> > > > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >>> at > >>> > > > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >>> at > >>> > > > scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) > >>> at > >>> scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) > >>> at > >>> scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) > >>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > >>> at > >>> > > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > >>> at > >>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > >>> at > >>> > > > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) > >>> at > >>> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > >>> at > >>> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > >>> at org.apache.spark.scheduler.Task.run(Task.scala:54) > >>> at > >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > >>> at > >>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > >>> at > >>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > >>> at java.lang.Thread.run(Thread.java:695) > >>> > >>> > >> > > > > > >