And the feature request should be phrased in terms of code with desired behavior.
On Thu, Apr 3, 2014 at 8:00 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Perhaps this is best phrased as a feature request. > > On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > PS. > > sequence file keys have also special meaning if they are Ints. .E.g. A' > physical operator requires keys to be ints, in which case it interprets > them as row indexes that become column indexes. This of course isn't always > the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in > reality optimizer will never choose actual transposition as a physical step > in such pipeline. This interpretation is consistent with interpretation of > long-existing Hadoop-side DistributedRowMatrix#transpose. > > > On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > > > > > > > On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: > > > >> > >>> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >>> > >>> I think this duality, names and keys, is not very healthy really, and > >> just > >>> creates addtutiinal hassle. Spark drm takes care of keys automatically > >>> thoughout, but propagating names from name vectors is solely algorithm > >>> concern as it stands. > >> > >> Not sure what you mean. > > > > Not what you think, it looks like. > > > > I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When > > persisted, key goes to the key of a sequence file. In particular, it > means > > that there is a case of Bag[ key -> NamedVector]. Which means, external > > anchor could be saved to either key or name of a row. In practice it > causes > > compatibility mess, e.g. we saw those numerous cases where e.g. > seq2sparse > > saves external keys (file paths) into key, whereas e.g. clustering > > algorithms are not seeing them because they expect them to be the name > part > > of the vector. I am just saying we have two ways to name the rows, and it > > is generally not a healthy choice for the aforementioned reason. > > > > > >> In my experience Names and Properties are primarily used to store > >> external keys, which are quite healthy. > > > > Users never have data with Mahout keys, they must constantly go back and > >> forth. This is exactly what the R data frame does, no? I'm not so > concerned > >> with being able to address an element by the external key > >> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have > the > >> external ids follow the data through any calculation that makes sense. > >> > > > > I am with you on this. > > > > > >> This would mean clustering, recommendations, transpose, RSJ would > require > >> no id transforming steps. This would make dealing with Mahout much > easier. > >> > > > > Data frames is a little bit a different thing, right now we work just > with > > matrices. Although, yes, our in-core matrices support row and column > names > > (just like in R) and distributed matrices support row keys only. what i > > mean is that algebraic expression e.g. > > > > Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied > > above, but not necessarily named vectors, because internally algorithms > > blockify things into matrix blocks, and i am far from sure that Mahout > > in-core stuff works correctly with named vectors as part of a matrix > block > > in all situations. I may be wrong. I always relied on sequence file keys > to > > identify data points. > > > > Note that sequence file keys are bigger than just a name, it is anything > > Writable. I.e. you could save a data structure there, as long as you > have a > > Writable for it. > > > > > >>> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <p...@occamsmachete.com> wrote: > >>> > >>>> Are the Spark efforts supporting all Mahout Vector types? Named, > >> Property > >>>> Vectors? It occurred to me that data frames in R is a related but more > >>>> general solution. If all rows and columns of a DRM and their > >> coresponding > >>>> Vectors (row or column vectors) were to support arbitrary properties > >>>> attached to them in such a way that they are preserved during > >> transpose, > >>>> Vector extraction, and any other operations that make sense there > >> would be > >>>> a huge benefit for users. > >>>> > >>>> One of the constant problems with input to Mahout is translation of > >> IDs. > >>>> External to Mahout going in, Mahout to external coming out. Most of > >> this > >>>> would be unneeded if Mahout supported data frames, some would be > >> avoided by > >>>> supporting named or property vectors universally. > >>>> > >>>> > >>> > >> > > > > > >