Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Dmitriy Lyubimov Mon, 21 Jul 2014 11:50:03 -0700

Sorry i did not bear with all the discussion, but this change doesn't make
sense to me.


It is not algebraic, it is not R, and it also creates algebraically
incorrect object

On the topic of the "empty" rows, remember they are not really empty, they
are matrices with 0.0 elements, and "emptyness" is just a compaction scheme
that also happens to have some optimization meaning to various algebraic
operations.

So "empty" matrix is really an absolutely valid matrix. It may cause
various mathematical exceptions since it is rank-deficient though, but
there are no "mechanical" errors with that representation, so i am not sure
what this dicussion was all about (but then again, i had no time to read it
all).


On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <j...@apache.org>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936
> ]
>
> ASF GitHub Bot commented on MAHOUT-1541:
> ----------------------------------------
>
> Github user pferrel commented on a diff in the pull request:
>
>     https://github.com/apache/mahout/pull/31#discussion_r15184775
>
>     --- Diff:
> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
> ---
>     @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
>        private var cached: Boolean = false
>        override val context: DistributedContext = rdd.context
>
>     +  /**
>     +   * Adds the equivalent of blank rows to the sparse CheckpointedDrm,
> which only changes the
>     +   * [[org.apache.mahout.sparkbindings.drm
>     +.CheckpointedDrmSpark#nrow]] value.
>     +   * No physical changes are made to the underlying rdd, now blank
> rows are added as would be done with rbind(blankRows)
>     +   * @param n number to increase row cardinality by
>     +   * @note should be done before any BLAS optimizer actions are
> performed on the matrix or you'll get unpredictable
>     +   *       results.
>     +   */
>     +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
>     +    assert(n > -1)
>     +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol,
> _cacheStorageLevel )
>     +  }
>     --- End diff --
>
>     I see no fundamental reason for these not to work but it may not be
> part of the DRM contract. So maybe I'll make a feature request Jira to
> support this.
>
>     In the meantime rbind will not solve this because A will have missing
> rows at the end but B may have them throughout--let alone some future C. So
> I think reading in all data into one drm with one row and column id space
> then chopping into two or more drms based on column ranges should give us
> empty rows where they are needed (I certainly hope so or I'm in trouble).
> Will have to keep track of which column ids go in which slice but that's
> doable.
>
>
> > Create CLI Driver for Spark Cooccurrence Analysis
> > -------------------------------------------------
> >
> >                 Key: MAHOUT-1541
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: CLI
> >            Reporter: Pat Ferrel
> >            Assignee: Pat Ferrel
> >
> > Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> > Ultimately it should be able to read input as the legacy mr does but
> will support reading externally defined IDs and flexible formats. Output
> will be of the legacy format or text files of the user's specification with
> reattached Item IDs.
> > Support for legacy formats is a question, users can always use the
> legacy code if they want this. Internal to the IndexedDataset is a Spark
> DRM so pipelining can be accomplished without any writing to an actual file
> so the legacy sequence file output may not be needed.
> > Opinions?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to