[jira] [Updated] (MAHOUT-1507) Support input and output using user defined ID wherever possible

Andrew Palumbo (JIRA) Thu, 05 Mar 2015 18:13:01 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Palumbo updated MAHOUT-1507:
-----------------------------------
    Labels: DSL scala spark  (was: spark)

> Support input and output using user defined ID wherever possible
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-1507
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.9
>         Environment: Spark Scala, Mahout v2
>            Reporter: Pat Ferrel
>              Labels: DSL, scala, spark
>             Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> In my experience Names and Properties are primarily used to store
> external keys, which are quite healthy.
> Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
> I am with you on this.
> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
> Are the Spark efforts supporting all Mahout Vector types? Named,
> Property
> Vectors? It occurred to me that data frames in R is a related but more
> general solution. If all rows and columns of a DRM and their
> coresponding
> Vectors (row or column vectors) were to support arbitrary properties
> attached to them in such a way that they are preserved during
> transpose,
> Vector extraction, and any other operations that make sense there
> would be
> a huge benefit for users.
> One of the constant problems with input to Mahout is translation of
> IDs.
> External to Mahout going in, Mahout to external coming out. Most of
> this
> would be unneeded if Mahout supported data frames, some would be
> avoided by
> supporting named or property vectors universally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1507) Support input and output using user defined ID wherever possible

Reply via email to