agree, rbind and cbind are the ways to tweak geometry
On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote: > The summary of the discussion is: > > Pat encountered a scenario where matrix multiplication was erroring > because of mismatching A's rows and B's cols. His solution was to > fixup/fudge A's nrow value to force the multiplication to happen. I think > such a fixup of rows is better done through an rbind() like operator (with > an empty B matrix) instead of "editing" nrow member. However the problem > seems to be that A's rows are fewer than desired because they have missing > rows (i.e, the int keys sequence has holes). I think such an object is > corrupted to begin with. And even if you were to fudge nrow, OpAewScalar > gives math errors (as demonstrated in code example), and AewB, CbindAB are > giving runtime exceptions on the cogroup() RDD api. I guess Pat still feels > these errors/exceptions must be fixed by filing a Jira. > > > > > On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> Sorry i did not bear with all the discussion, but this change doesn't make >> sense to me. >> >> It is not algebraic, it is not R, and it also creates algebraically >> incorrect object >> >> On the topic of the "empty" rows, remember they are not really empty, they >> are matrices with 0.0 elements, and "emptyness" is just a compaction >> scheme >> that also happens to have some optimization meaning to various algebraic >> operations. >> >> So "empty" matrix is really an absolutely valid matrix. It may cause >> various mathematical exceptions since it is rank-deficient though, but >> there are no "mechanical" errors with that representation, so i am not >> sure >> what this dicussion was all about (but then again, i had no time to read >> it >> all). >> >> >> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <j...@apache.org> >> wrote: >> >> > >> > [ >> > >> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936 >> > ] >> > >> > ASF GitHub Bot commented on MAHOUT-1541: >> > ---------------------------------------- >> > >> > Github user pferrel commented on a diff in the pull request: >> > >> > https://github.com/apache/mahout/pull/31#discussion_r15184775 >> > >> > --- Diff: >> > >> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala >> > --- >> > @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag]( >> > private var cached: Boolean = false >> > override val context: DistributedContext = rdd.context >> > >> > + /** >> > + * Adds the equivalent of blank rows to the sparse >> CheckpointedDrm, >> > which only changes the >> > + * [[org.apache.mahout.sparkbindings.drm >> > +.CheckpointedDrmSpark#nrow]] value. >> > + * No physical changes are made to the underlying rdd, now blank >> > rows are added as would be done with rbind(blankRows) >> > + * @param n number to increase row cardinality by >> > + * @note should be done before any BLAS optimizer actions are >> > performed on the matrix or you'll get unpredictable >> > + * results. >> > + */ >> > + override def addToRowCardinality(n: Int): CheckpointedDrm[K] = { >> > + assert(n > -1) >> > + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, >> > _cacheStorageLevel ) >> > + } >> > --- End diff -- >> > >> > I see no fundamental reason for these not to work but it may not be >> > part of the DRM contract. So maybe I'll make a feature request Jira to >> > support this. >> > >> > In the meantime rbind will not solve this because A will have >> missing >> > rows at the end but B may have them throughout--let alone some future >> C. So >> > I think reading in all data into one drm with one row and column id >> space >> > then chopping into two or more drms based on column ranges should give >> us >> > empty rows where they are needed (I certainly hope so or I'm in >> trouble). >> > Will have to keep track of which column ids go in which slice but that's >> > doable. >> > >> > >> > > Create CLI Driver for Spark Cooccurrence Analysis >> > > ------------------------------------------------- >> > > >> > > Key: MAHOUT-1541 >> > > URL: >> https://issues.apache.org/jira/browse/MAHOUT-1541 >> > > Project: Mahout >> > > Issue Type: New Feature >> > > Components: CLI >> > > Reporter: Pat Ferrel >> > > Assignee: Pat Ferrel >> > > >> > > Create a CLI driver to import data in a flexible manner, create an >> > IndexedDataset with BiMap ID translation dictionaries, call the Spark >> > CooccurrenceAnalysis with the appropriate params, then write output with >> > external IDs optionally reattached. >> > > Ultimately it should be able to read input as the legacy mr does but >> > will support reading externally defined IDs and flexible formats. Output >> > will be of the legacy format or text files of the user's specification >> with >> > reattached Item IDs. >> > > Support for legacy formats is a question, users can always use the >> > legacy code if they want this. Internal to the IndexedDataset is a Spark >> > DRM so pipelining can be accomplished without any writing to an actual >> file >> > so the legacy sequence file output may not be needed. >> > > Opinions? >> > >> > >> > >> > -- >> > This message was sent by Atlassian JIRA >> > (v6.2#6252) >> > >> > >