for the record, parallelizeEmpty does not create partition-less rdd -- it does create empty rows. The reason is that partitons are not just data, they are task embodiment as well. So it is a way, e.g., to generate a random matrix in a distributed way.
I am also not 100% positive lack of rows will not present a problem. I know that empty partitions present a problem -- and if any techniques imply that row-less partitions are resulting, this may be a problem (in some situations they are explicitly filtered out post-op). On Mon, Jul 21, 2014 at 2:08 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > I think we are straight on this finally. > > DRMs and rdds don’t need to embody every row at least when using > sequential Int keys, they are not corrupt if some rows are missing. > > Therefore rbind of drmParallelizeEmpty will work since it will only create > a CheckpointedDrm where nrow is modified. It will not modify the rdd. > > If we had to modify the rdd rbind would not work since the missing keys > are interspersed throughout the matrices, not all at the end. So the > hypothetically created rdd elements would have had the wrong Int keys. But > no need to worry about this now--no need to modify the rdd. > > > On Jul 21, 2014, at 1:05 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > agree, rbind and cbind are the ways to tweak geometry > > > On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote: > > > The summary of the discussion is: > > > > Pat encountered a scenario where matrix multiplication was erroring > > because of mismatching A's rows and B's cols. His solution was to > > fixup/fudge A's nrow value to force the multiplication to happen. I think > > such a fixup of rows is better done through an rbind() like operator > (with > > an empty B matrix) instead of "editing" nrow member. However the problem > > seems to be that A's rows are fewer than desired because they have > missing > > rows (i.e, the int keys sequence has holes). I think such an object is > > corrupted to begin with. And even if you were to fudge nrow, OpAewScalar > > gives math errors (as demonstrated in code example), and AewB, CbindAB > are > > giving runtime exceptions on the cogroup() RDD api. I guess Pat still > feels > > these errors/exceptions must be fixed by filing a Jira. > > > > > > > > > > On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > >> Sorry i did not bear with all the discussion, but this change doesn't > make > >> sense to me. > >> > >> It is not algebraic, it is not R, and it also creates algebraically > >> incorrect object > >> > >> On the topic of the "empty" rows, remember they are not really empty, > they > >> are matrices with 0.0 elements, and "emptyness" is just a compaction > >> scheme > >> that also happens to have some optimization meaning to various algebraic > >> operations. > >> > >> So "empty" matrix is really an absolutely valid matrix. It may cause > >> various mathematical exceptions since it is rank-deficient though, but > >> there are no "mechanical" errors with that representation, so i am not > >> sure > >> what this dicussion was all about (but then again, i had no time to read > >> it > >> all). > >> > >> > >> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) < > j...@apache.org> > >> wrote: > >> > >>> > >>> [ > >>> > >> > https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936 > >>> ] > >>> > >>> ASF GitHub Bot commented on MAHOUT-1541: > >>> ---------------------------------------- > >>> > >>> Github user pferrel commented on a diff in the pull request: > >>> > >>> https://github.com/apache/mahout/pull/31#discussion_r15184775 > >>> > >>> --- Diff: > >>> > >> > spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala > >>> --- > >>> @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag]( > >>> private var cached: Boolean = false > >>> override val context: DistributedContext = rdd.context > >>> > >>> + /** > >>> + * Adds the equivalent of blank rows to the sparse > >> CheckpointedDrm, > >>> which only changes the > >>> + * [[org.apache.mahout.sparkbindings.drm > >>> +.CheckpointedDrmSpark#nrow]] value. > >>> + * No physical changes are made to the underlying rdd, now blank > >>> rows are added as would be done with rbind(blankRows) > >>> + * @param n number to increase row cardinality by > >>> + * @note should be done before any BLAS optimizer actions are > >>> performed on the matrix or you'll get unpredictable > >>> + * results. > >>> + */ > >>> + override def addToRowCardinality(n: Int): CheckpointedDrm[K] = { > >>> + assert(n > -1) > >>> + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, > >>> _cacheStorageLevel ) > >>> + } > >>> --- End diff -- > >>> > >>> I see no fundamental reason for these not to work but it may not be > >>> part of the DRM contract. So maybe I'll make a feature request Jira to > >>> support this. > >>> > >>> In the meantime rbind will not solve this because A will have > >> missing > >>> rows at the end but B may have them throughout--let alone some future > >> C. So > >>> I think reading in all data into one drm with one row and column id > >> space > >>> then chopping into two or more drms based on column ranges should give > >> us > >>> empty rows where they are needed (I certainly hope so or I'm in > >> trouble). > >>> Will have to keep track of which column ids go in which slice but > that's > >>> doable. > >>> > >>> > >>>> Create CLI Driver for Spark Cooccurrence Analysis > >>>> ------------------------------------------------- > >>>> > >>>> Key: MAHOUT-1541 > >>>> URL: > >> https://issues.apache.org/jira/browse/MAHOUT-1541 > >>>> Project: Mahout > >>>> Issue Type: New Feature > >>>> Components: CLI > >>>> Reporter: Pat Ferrel > >>>> Assignee: Pat Ferrel > >>>> > >>>> Create a CLI driver to import data in a flexible manner, create an > >>> IndexedDataset with BiMap ID translation dictionaries, call the Spark > >>> CooccurrenceAnalysis with the appropriate params, then write output > with > >>> external IDs optionally reattached. > >>>> Ultimately it should be able to read input as the legacy mr does but > >>> will support reading externally defined IDs and flexible formats. > Output > >>> will be of the legacy format or text files of the user's specification > >> with > >>> reattached Item IDs. > >>>> Support for legacy formats is a question, users can always use the > >>> legacy code if they want this. Internal to the IndexedDataset is a > Spark > >>> DRM so pipelining can be accomplished without any writing to an actual > >> file > >>> so the legacy sequence file output may not be needed. > >>>> Opinions? > >>> > >>> > >>> > >>> -- > >>> This message was sent by Atlassian JIRA > >>> (v6.2#6252) > >>> > >> > > > > > >