[ https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Palumbo updated MAHOUT-1507: ----------------------------------- Labels: DSL scala spark (was: spark) > Support input and output using user defined ID wherever possible > ---------------------------------------------------------------- > > Key: MAHOUT-1507 > URL: https://issues.apache.org/jira/browse/MAHOUT-1507 > Project: Mahout > Issue Type: Bug > Components: Math > Affects Versions: 0.9 > Environment: Spark Scala, Mahout v2 > Reporter: Pat Ferrel > Labels: DSL, scala, spark > Fix For: 1.0 > > > All users of Mahout have data which is addressed by keys or IDs of their own > devise. In order to use much of Mahout they must translate these IDs into > Mahout IDs, then run their jobs and translate back again when retrieving the > output. If the ID space is very large this is a difficult problem for users > to solve at scale. > For many Mahout operations this would not be necessary if these external keys > could be maintained for vectors and dimensions, or for rows and columns of a > DRM. > The reason I bring this up now is that much groundwork is being laid for > Mahout's future on Spark so getting this notion in early could be > fundamentally important and used to build on. > If external IDs for rows and columns were maintained then RSJ, DRM Transpose > (and other DRM ops), vector extraction, clustering, and recommenders would > need no ID translation steps, a big user win. > A partial solution might be to support external row IDs alone somewhat like > the NamedVector and PropertyVector in the Mahout hadoop code. > On Apr 3, 2014, at 11:00 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > Perhaps this is best phrased as a feature request. > On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > PS. > sequence file keys have also special meaning if they are Ints. .E.g. A' > physical operator requires keys to be ints, in which case it interprets > them as row indexes that become column indexes. This of course isn't always > the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in > reality optimizer will never choose actual transposition as a physical step > in such pipeline. This interpretation is consistent with interpretation of > long-existing Hadoop-side DistributedRowMatrix#transpose. > On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > I think this duality, names and keys, is not very healthy really, and > just > creates addtutiinal hassle. Spark drm takes care of keys automatically > thoughout, but propagating names from name vectors is solely algorithm > concern as it stands. > Not sure what you mean. > Not what you think, it looks like. > I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When > persisted, key goes to the key of a sequence file. In particular, it means > that there is a case of Bag[ key -> NamedVector]. Which means, external > anchor could be saved to either key or name of a row. In practice it causes > compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse > saves external keys (file paths) into key, whereas e.g. clustering > algorithms are not seeing them because they expect them to be the name part > of the vector. I am just saying we have two ways to name the rows, and it > is generally not a healthy choice for the aforementioned reason. > In my experience Names and Properties are primarily used to store > external keys, which are quite healthy. > Users never have data with Mahout keys, they must constantly go back and > forth. This is exactly what the R data frame does, no? I'm not so concerned > with being able to address an element by the external key > drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the > external ids follow the data through any calculation that makes sense. > I am with you on this. > This would mean clustering, recommendations, transpose, RSJ would require > no id transforming steps. This would make dealing with Mahout much easier. > Data frames is a little bit a different thing, right now we work just with > matrices. Although, yes, our in-core matrices support row and column names > (just like in R) and distributed matrices support row keys only. what i > mean is that algebraic expression e.g. > Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied > above, but not necessarily named vectors, because internally algorithms > blockify things into matrix blocks, and i am far from sure that Mahout > in-core stuff works correctly with named vectors as part of a matrix block > in all situations. I may be wrong. I always relied on sequence file keys to > identify data points. > Note that sequence file keys are bigger than just a name, it is anything > Writable. I.e. you could save a data structure there, as long as you have a > Writable for it. > On Apr 2, 2014 1:08 PM, "Pat Ferrel" <p...@occamsmachete.com> wrote: > Are the Spark efforts supporting all Mahout Vector types? Named, > Property > Vectors? It occurred to me that data frames in R is a related but more > general solution. If all rows and columns of a DRM and their > coresponding > Vectors (row or column vectors) were to support arbitrary properties > attached to them in such a way that they are preserved during > transpose, > Vector extraction, and any other operations that make sense there > would be > a huge benefit for users. > One of the constant problems with input to Mahout is translation of > IDs. > External to Mahout going in, Mahout to external coming out. Most of > this > would be unneeded if Mahout supported data frames, some would be > avoided by > supporting named or property vectors universally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)