Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Pat Ferrel Mon, 21 Jul 2014 14:47:34 -0700

because the cardinality is not known until all matrices are read in. For 
cooccurrence the rows are users and that must be the same for all matrices if 
cross-cooccurrence A’B is to work. A’s row cardinality must match B’s even 
though some rows in either may be missing.


I’d like to just read them in and tell each it’s row cardinality after the fact.

Another way to do this is to read them all at once into a giant unified matrix 
and try to figure out how to slice it up into A, B, … with column ranges.


On Jul 21, 2014, at 2:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

so let me see if i am following.

you create an int-keyed matrix with gaps in int sequence and you feel
somehow this is a problem and trying to insert missing rows? is that it?

if yes, why do you believe you need to insert missing rows ?


On Mon, Jul 21, 2014 at 2:26 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> rbind appends another matrix, this seems wrong in this case anyway since
> the missing rows are anywhere in the Int key sequence.
> 
> However the existing CheckpointedDrm already has an rdd and non-empty
> partitions and so should work for %*% and .t or there’s a bug, right?
> 
> This is why I wanted to create an op or method that only changes nrow. The
> current prototype/hack implementation in CheckpointedDrmSpark does this:
> 
>  override def rowCardinality(n: Int): CheckpointedDrm[K] = {
>    assert(n > -1)
>    new CheckpointedDrmSpark[K](rdd, n, ncol, _cacheStorageLevel )
>  }
> 
> Maybe it should be called something else or be an op and maybe it should
> only work for Int keys. Given all that doing the above should work or
> there’s a bug in the math, right?
> 
> 
> On Jul 21, 2014, at 2:12 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> for the record, parallelizeEmpty does not create partition-less rdd -- it
> does create empty rows. The reason is that partitons are not just data,
> they are task embodiment as well. So it is a way, e.g., to generate a
> random matrix in a distributed way.
> 
> I am also not 100% positive lack of rows will not present a problem.
> 
> I know that empty partitions present a problem -- and if any techniques
> imply that row-less partitions are resulting, this may be a problem (in
> some situations they are explicitly filtered out post-op).
> 
> 
> 
> On Mon, Jul 21, 2014 at 2:08 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
>> I think we are straight on this finally.
>> 
>> DRMs and rdds don’t need to embody every row at least when using
>> sequential Int keys, they are not corrupt if some rows are missing.
>> 
>> Therefore rbind of drmParallelizeEmpty will work since it will only
> create
>> a CheckpointedDrm where nrow is modified. It will not modify the rdd.
>> 
>> If we had to modify the rdd rbind would not work since the missing keys
>> are interspersed throughout the matrices, not all at the end. So the
>> hypothetically created rdd elements would have had the wrong Int keys.
> But
>> no need to worry about this now--no need to modify the rdd.
>> 
>> 
>> On Jul 21, 2014, at 1:05 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>> 
>> agree, rbind and cbind are the ways to tweak geometry
>> 
>> 
>> On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote:
>> 
>>> The summary of the discussion is:
>>> 
>>> Pat encountered a scenario where matrix multiplication was erroring
>>> because of mismatching A's rows and B's cols. His solution was to
>>> fixup/fudge A's nrow value to force the multiplication to happen. I
> think
>>> such a fixup of rows is better done through an rbind() like operator
>> (with
>>> an empty B matrix) instead of "editing" nrow member. However the problem
>>> seems to be that A's rows are fewer than desired because they have
>> missing
>>> rows (i.e, the int keys sequence has holes). I think such an object is
>>> corrupted to begin with. And even if you were to fudge nrow, OpAewScalar
>>> gives math errors (as demonstrated in code example), and AewB, CbindAB
>> are
>>> giving runtime exceptions on the cogroup() RDD api. I guess Pat still
>> feels
>>> these errors/exceptions must be fixed by filing a Jira.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>> 
>>>> Sorry i did not bear with all the discussion, but this change doesn't
>> make
>>>> sense to me.
>>>> 
>>>> It is not algebraic, it is not R, and it also creates algebraically
>>>> incorrect object
>>>> 
>>>> On the topic of the "empty" rows, remember they are not really empty,
>> they
>>>> are matrices with 0.0 elements, and "emptyness" is just a compaction
>>>> scheme
>>>> that also happens to have some optimization meaning to various
> algebraic
>>>> operations.
>>>> 
>>>> So "empty" matrix is really an absolutely valid matrix. It may cause
>>>> various mathematical exceptions since it is rank-deficient though, but
>>>> there are no "mechanical" errors with that representation, so i am not
>>>> sure
>>>> what this dicussion was all about (but then again, i had no time to
> read
>>>> it
>>>> all).
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <
>> j...@apache.org>
>>>> wrote:
>>>> 
>>>>> 
>>>>>  [
>>>>> 
>>>> 
>> 
> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936
>>>>> ]
>>>>> 
>>>>> ASF GitHub Bot commented on MAHOUT-1541:
>>>>> ----------------------------------------
>>>>> 
>>>>> Github user pferrel commented on a diff in the pull request:
>>>>> 
>>>>>  https://github.com/apache/mahout/pull/31#discussion_r15184775
>>>>> 
>>>>>  --- Diff:
>>>>> 
>>>> 
>> 
> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
>>>>> ---
>>>>>  @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
>>>>>     private var cached: Boolean = false
>>>>>     override val context: DistributedContext = rdd.context
>>>>> 
>>>>>  +  /**
>>>>>  +   * Adds the equivalent of blank rows to the sparse
>>>> CheckpointedDrm,
>>>>> which only changes the
>>>>>  +   * [[org.apache.mahout.sparkbindings.drm
>>>>>  +.CheckpointedDrmSpark#nrow]] value.
>>>>>  +   * No physical changes are made to the underlying rdd, now blank
>>>>> rows are added as would be done with rbind(blankRows)
>>>>>  +   * @param n number to increase row cardinality by
>>>>>  +   * @note should be done before any BLAS optimizer actions are
>>>>> performed on the matrix or you'll get unpredictable
>>>>>  +   *       results.
>>>>>  +   */
>>>>>  +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
>>>>>  +    assert(n > -1)
>>>>>  +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol,
>>>>> _cacheStorageLevel )
>>>>>  +  }
>>>>>  --- End diff --
>>>>> 
>>>>>  I see no fundamental reason for these not to work but it may not be
>>>>> part of the DRM contract. So maybe I'll make a feature request Jira to
>>>>> support this.
>>>>> 
>>>>>  In the meantime rbind will not solve this because A will have
>>>> missing
>>>>> rows at the end but B may have them throughout--let alone some future
>>>> C. So
>>>>> I think reading in all data into one drm with one row and column id
>>>> space
>>>>> then chopping into two or more drms based on column ranges should give
>>>> us
>>>>> empty rows where they are needed (I certainly hope so or I'm in
>>>> trouble).
>>>>> Will have to keep track of which column ids go in which slice but
>> that's
>>>>> doable.
>>>>> 
>>>>> 
>>>>>> Create CLI Driver for Spark Cooccurrence Analysis
>>>>>> -------------------------------------------------
>>>>>> 
>>>>>>              Key: MAHOUT-1541
>>>>>>              URL:
>>>> https://issues.apache.org/jira/browse/MAHOUT-1541
>>>>>>          Project: Mahout
>>>>>>       Issue Type: New Feature
>>>>>>       Components: CLI
>>>>>>         Reporter: Pat Ferrel
>>>>>>         Assignee: Pat Ferrel
>>>>>> 
>>>>>> Create a CLI driver to import data in a flexible manner, create an
>>>>> IndexedDataset with BiMap ID translation dictionaries, call the Spark
>>>>> CooccurrenceAnalysis with the appropriate params, then write output
>> with
>>>>> external IDs optionally reattached.
>>>>>> Ultimately it should be able to read input as the legacy mr does but
>>>>> will support reading externally defined IDs and flexible formats.
>> Output
>>>>> will be of the legacy format or text files of the user's specification
>>>> with
>>>>> reattached Item IDs.
>>>>>> Support for legacy formats is a question, users can always use the
>>>>> legacy code if they want this. Internal to the IndexedDataset is a
>> Spark
>>>>> DRM so pipelining can be accomplished without any writing to an actual
>>>> file
>>>>> so the legacy sequence file output may not be needed.
>>>>>> Opinions?
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.2#6252)
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to