Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Dmitriy Lyubimov Mon, 21 Jul 2014 14:14:35 -0700

for the record, parallelizeEmpty does not create partition-less rdd -- it
does create empty rows. The reason is that partitons are not just data,
they are task embodiment as well. So it is a way, e.g., to generate a
random matrix in a distributed way.


I am also not 100% positive lack of rows will not present a problem.

I know that empty partitions present a problem -- and if any techniques
imply that row-less partitions are resulting, this may be a problem (in
some situations they are explicitly filtered out post-op).



On Mon, Jul 21, 2014 at 2:08 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> I think we are straight on this finally.
>
> DRMs and rdds don’t need to embody every row at least when using
> sequential Int keys, they are not corrupt if some rows are missing.
>
> Therefore rbind of drmParallelizeEmpty will work since it will only create
> a CheckpointedDrm where nrow is modified. It will not modify the rdd.
>
> If we had to modify the rdd rbind would not work since the missing keys
> are interspersed throughout the matrices, not all at the end. So the
> hypothetically created rdd elements would have had the wrong Int keys. But
> no need to worry about this now--no need to modify the rdd.
>
>
> On Jul 21, 2014, at 1:05 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> agree, rbind and cbind are the ways to tweak geometry
>
>
> On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote:
>
> > The summary of the discussion is:
> >
> > Pat encountered a scenario where matrix multiplication was erroring
> > because of mismatching A's rows and B's cols. His solution was to
> > fixup/fudge A's nrow value to force the multiplication to happen. I think
> > such a fixup of rows is better done through an rbind() like operator
> (with
> > an empty B matrix) instead of "editing" nrow member. However the problem
> > seems to be that A's rows are fewer than desired because they have
> missing
> > rows (i.e, the int keys sequence has holes). I think such an object is
> > corrupted to begin with. And even if you were to fudge nrow, OpAewScalar
> > gives math errors (as demonstrated in code example), and AewB, CbindAB
> are
> > giving runtime exceptions on the cogroup() RDD api. I guess Pat still
> feels
> > these errors/exceptions must be fixed by filing a Jira.
> >
> >
> >
> >
> > On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> >> Sorry i did not bear with all the discussion, but this change doesn't
> make
> >> sense to me.
> >>
> >> It is not algebraic, it is not R, and it also creates algebraically
> >> incorrect object
> >>
> >> On the topic of the "empty" rows, remember they are not really empty,
> they
> >> are matrices with 0.0 elements, and "emptyness" is just a compaction
> >> scheme
> >> that also happens to have some optimization meaning to various algebraic
> >> operations.
> >>
> >> So "empty" matrix is really an absolutely valid matrix. It may cause
> >> various mathematical exceptions since it is rank-deficient though, but
> >> there are no "mechanical" errors with that representation, so i am not
> >> sure
> >> what this dicussion was all about (but then again, i had no time to read
> >> it
> >> all).
> >>
> >>
> >> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <
> j...@apache.org>
> >> wrote:
> >>
> >>>
> >>>    [
> >>>
> >>
> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936
> >>> ]
> >>>
> >>> ASF GitHub Bot commented on MAHOUT-1541:
> >>> ----------------------------------------
> >>>
> >>> Github user pferrel commented on a diff in the pull request:
> >>>
> >>>    https://github.com/apache/mahout/pull/31#discussion_r15184775
> >>>
> >>>    --- Diff:
> >>>
> >>
> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
> >>> ---
> >>>    @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
> >>>       private var cached: Boolean = false
> >>>       override val context: DistributedContext = rdd.context
> >>>
> >>>    +  /**
> >>>    +   * Adds the equivalent of blank rows to the sparse
> >> CheckpointedDrm,
> >>> which only changes the
> >>>    +   * [[org.apache.mahout.sparkbindings.drm
> >>>    +.CheckpointedDrmSpark#nrow]] value.
> >>>    +   * No physical changes are made to the underlying rdd, now blank
> >>> rows are added as would be done with rbind(blankRows)
> >>>    +   * @param n number to increase row cardinality by
> >>>    +   * @note should be done before any BLAS optimizer actions are
> >>> performed on the matrix or you'll get unpredictable
> >>>    +   *       results.
> >>>    +   */
> >>>    +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
> >>>    +    assert(n > -1)
> >>>    +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol,
> >>> _cacheStorageLevel )
> >>>    +  }
> >>>    --- End diff --
> >>>
> >>>    I see no fundamental reason for these not to work but it may not be
> >>> part of the DRM contract. So maybe I'll make a feature request Jira to
> >>> support this.
> >>>
> >>>    In the meantime rbind will not solve this because A will have
> >> missing
> >>> rows at the end but B may have them throughout--let alone some future
> >> C. So
> >>> I think reading in all data into one drm with one row and column id
> >> space
> >>> then chopping into two or more drms based on column ranges should give
> >> us
> >>> empty rows where they are needed (I certainly hope so or I'm in
> >> trouble).
> >>> Will have to keep track of which column ids go in which slice but
> that's
> >>> doable.
> >>>
> >>>
> >>>> Create CLI Driver for Spark Cooccurrence Analysis
> >>>> -------------------------------------------------
> >>>>
> >>>>                Key: MAHOUT-1541
> >>>>                URL:
> >> https://issues.apache.org/jira/browse/MAHOUT-1541
> >>>>            Project: Mahout
> >>>>         Issue Type: New Feature
> >>>>         Components: CLI
> >>>>           Reporter: Pat Ferrel
> >>>>           Assignee: Pat Ferrel
> >>>>
> >>>> Create a CLI driver to import data in a flexible manner, create an
> >>> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> >>> CooccurrenceAnalysis with the appropriate params, then write output
> with
> >>> external IDs optionally reattached.
> >>>> Ultimately it should be able to read input as the legacy mr does but
> >>> will support reading externally defined IDs and flexible formats.
> Output
> >>> will be of the legacy format or text files of the user's specification
> >> with
> >>> reattached Item IDs.
> >>>> Support for legacy formats is a question, users can always use the
> >>> legacy code if they want this. Internal to the IndexedDataset is a
> Spark
> >>> DRM so pipelining can be accomplished without any writing to an actual
> >> file
> >>> so the legacy sequence file output may not be needed.
> >>>> Opinions?
> >>>
> >>>
> >>>
> >>> --
> >>> This message was sent by Atlassian JIRA
> >>> (v6.2#6252)
> >>>
> >>
> >
> >
>
>

Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to