agree, rbind and cbind are the ways to tweak geometry

On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote:

> The summary of the discussion is:
>
> Pat encountered a scenario where matrix multiplication was erroring
> because of mismatching A's rows and B's cols. His solution was to
> fixup/fudge A's nrow value to force the multiplication to happen. I think
> such a fixup of rows is better done through an rbind() like operator (with
> an empty B matrix) instead of "editing" nrow member. However the problem
> seems to be that A's rows are fewer than desired because they have missing
> rows (i.e, the int keys sequence has holes). I think such an object is
> corrupted to begin with. And even if you were to fudge nrow, OpAewScalar
> gives math errors (as demonstrated in code example), and AewB, CbindAB are
> giving runtime exceptions on the cogroup() RDD api. I guess Pat still feels
> these errors/exceptions must be fixed by filing a Jira.
>
>
>
>
> On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
>> Sorry i did not bear with all the discussion, but this change doesn't make
>> sense to me.
>>
>> It is not algebraic, it is not R, and it also creates algebraically
>> incorrect object
>>
>> On the topic of the "empty" rows, remember they are not really empty, they
>> are matrices with 0.0 elements, and "emptyness" is just a compaction
>> scheme
>> that also happens to have some optimization meaning to various algebraic
>> operations.
>>
>> So "empty" matrix is really an absolutely valid matrix. It may cause
>> various mathematical exceptions since it is rank-deficient though, but
>> there are no "mechanical" errors with that representation, so i am not
>> sure
>> what this dicussion was all about (but then again, i had no time to read
>> it
>> all).
>>
>>
>> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <j...@apache.org>
>> wrote:
>>
>> >
>> >     [
>> >
>> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936
>> > ]
>> >
>> > ASF GitHub Bot commented on MAHOUT-1541:
>> > ----------------------------------------
>> >
>> > Github user pferrel commented on a diff in the pull request:
>> >
>> >     https://github.com/apache/mahout/pull/31#discussion_r15184775
>> >
>> >     --- Diff:
>> >
>> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
>> > ---
>> >     @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
>> >        private var cached: Boolean = false
>> >        override val context: DistributedContext = rdd.context
>> >
>> >     +  /**
>> >     +   * Adds the equivalent of blank rows to the sparse
>> CheckpointedDrm,
>> > which only changes the
>> >     +   * [[org.apache.mahout.sparkbindings.drm
>> >     +.CheckpointedDrmSpark#nrow]] value.
>> >     +   * No physical changes are made to the underlying rdd, now blank
>> > rows are added as would be done with rbind(blankRows)
>> >     +   * @param n number to increase row cardinality by
>> >     +   * @note should be done before any BLAS optimizer actions are
>> > performed on the matrix or you'll get unpredictable
>> >     +   *       results.
>> >     +   */
>> >     +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
>> >     +    assert(n > -1)
>> >     +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol,
>> > _cacheStorageLevel )
>> >     +  }
>> >     --- End diff --
>> >
>> >     I see no fundamental reason for these not to work but it may not be
>> > part of the DRM contract. So maybe I'll make a feature request Jira to
>> > support this.
>> >
>> >     In the meantime rbind will not solve this because A will have
>> missing
>> > rows at the end but B may have them throughout--let alone some future
>> C. So
>> > I think reading in all data into one drm with one row and column id
>> space
>> > then chopping into two or more drms based on column ranges should give
>> us
>> > empty rows where they are needed (I certainly hope so or I'm in
>> trouble).
>> > Will have to keep track of which column ids go in which slice but that's
>> > doable.
>> >
>> >
>> > > Create CLI Driver for Spark Cooccurrence Analysis
>> > > -------------------------------------------------
>> > >
>> > >                 Key: MAHOUT-1541
>> > >                 URL:
>> https://issues.apache.org/jira/browse/MAHOUT-1541
>> > >             Project: Mahout
>> > >          Issue Type: New Feature
>> > >          Components: CLI
>> > >            Reporter: Pat Ferrel
>> > >            Assignee: Pat Ferrel
>> > >
>> > > Create a CLI driver to import data in a flexible manner, create an
>> > IndexedDataset with BiMap ID translation dictionaries, call the Spark
>> > CooccurrenceAnalysis with the appropriate params, then write output with
>> > external IDs optionally reattached.
>> > > Ultimately it should be able to read input as the legacy mr does but
>> > will support reading externally defined IDs and flexible formats. Output
>> > will be of the legacy format or text files of the user's specification
>> with
>> > reattached Item IDs.
>> > > Support for legacy formats is a question, users can always use the
>> > legacy code if they want this. Internal to the IndexedDataset is a Spark
>> > DRM so pipelining can be accomplished without any writing to an actual
>> file
>> > so the legacy sequence file output may not be needed.
>> > > Opinions?
>> >
>> >
>> >
>> > --
>> > This message was sent by Atlassian JIRA
>> > (v6.2#6252)
>> >
>>
>
>

Reply via email to