Re: Data frames

Ted Dunning Thu, 03 Apr 2014 20:51:06 -0700

And the feature request should be phrased in terms of code with desired
behavior.





On Thu, Apr 3, 2014 at 8:00 PM, Pat Ferrel <[email protected]> wrote:

> Perhaps this is best phrased as a feature request.
>
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> PS.
>
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
>
>
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> >
> >
> >
> > On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <[email protected]>
> wrote:
> >
> >>
> >>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >>>
> >>> I think this duality, names and keys, is not very healthy really, and
> >> just
> >>> creates addtutiinal hassle. Spark drm takes care of keys automatically
> >>> thoughout, but propagating names from name vectors is solely algorithm
> >>> concern as it stands.
> >>
> >> Not sure what you mean.
> >
> > Not what you think, it looks like.
> >
> > I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> > persisted, key goes to the key of a sequence file. In particular, it
> means
> > that there is a case of Bag[ key -> NamedVector]. Which means, external
> > anchor could be saved to either key or name of a row. In practice it
> causes
> > compatibility mess, e.g. we saw those numerous cases where e.g.
> seq2sparse
> > saves external keys (file paths) into  key, whereas e.g. clustering
> > algorithms are not seeing them because they expect them to be the name
> part
> > of the vector. I am just saying we have two ways to name the rows, and it
> > is generally not a healthy choice for the aforementioned reason.
> >
> >
> >> In my experience Names and Properties are primarily used to store
> >> external keys, which are quite healthy.
> >
> > Users never have data with Mahout keys, they must constantly go back and
> >> forth. This is exactly what the R data frame does, no? I'm not so
> concerned
> >> with being able to address an element by the external key
> >> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have
> the
> >> external ids follow the data through any calculation that makes sense.
> >>
> >
> > I am with you on this.
> >
> >
> >> This would mean clustering, recommendations, transpose, RSJ would
> require
> >> no id transforming steps. This would make dealing with Mahout much
> easier.
> >>
> >
> > Data frames is a little bit a different thing, right now we work just
> with
> > matrices. Although, yes, our in-core matrices support row and column
> names
> > (just like in R) and distributed matrices support row keys only.  what i
> > mean is that algebraic expression e.g.
> >
> > Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> > above, but not necessarily named vectors, because internally algorithms
> > blockify things into matrix blocks, and i am far from sure that Mahout
> > in-core stuff works correctly with named vectors as part of a matrix
> block
> > in all situations. I may be wrong. I always relied on sequence file keys
> to
> > identify data points.
> >
> > Note that sequence file keys are bigger than just a name, it is anything
> > Writable. I.e. you could save a data structure there, as long as you
> have a
> > Writable for it.
> >
> >
> >>> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <[email protected]> wrote:
> >>>
> >>>> Are the Spark efforts supporting all Mahout Vector types? Named,
> >> Property
> >>>> Vectors? It occurred to me that data frames in R is a related but more
> >>>> general solution. If all rows and columns of a DRM and their
> >> coresponding
> >>>> Vectors (row or column vectors) were to support arbitrary properties
> >>>> attached to them in such a way that they are preserved during
> >> transpose,
> >>>> Vector extraction, and any other operations that make sense there
> >> would be
> >>>> a huge benefit for users.
> >>>>
> >>>> One of the constant problems with input to Mahout is translation of
> >> IDs.
> >>>> External to Mahout going in, Mahout to external coming out. Most of
> >> this
> >>>> would be unneeded if Mahout supported data frames, some would be
> >> avoided by
> >>>> supporting named or property vectors universally.
> >>>>
> >>>>
> >>>
> >>
> >
> >
>
>

Re: Data frames

Reply via email to