IDs for Vectors and Matrices

Pat Ferrel (JIRA) Sun, 06 Apr 2014 08:32:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961447#comment-13961447
 ]

Pat Ferrel commented on MAHOUT-1507:
------------------------------------

well we know that PropertyVectors ( or some new thing like a ForeignKeyVectors) 
are tenable for certain cases. If they become part of the contract with Mahout 
users as far as it makes sense (for example clustering) and there were some 
ways to hand a collections of external ids to a mahout job and get back a 
dictionary this would make the user's job much easier and would guarantee 
scalability. Too bad users can't ignore internal ids but that is the least of 
the issues. 

To make this even easier for users handing a DRM of PropertyVectors to any 
Matrix operation could automatically create the associated dictionary 
(optionally disabled).

Or handing a DRM of some new form--call it Foreign Key DRM for now, with IDs 
for rows and columns the operation would create two dictionaries. These could 
then be used for output of factorization, multiplication, transpose, etc.

Haven't really thought this through but maybe an output phase where you hand a 
Job both output DRM and dictionaries to create a Foreign Key DRM as output 
since the user will have to do this if Mahout doesn't.

In effect this is what the Solr Recommender pipeline does, foreign row and 
column keys are in the input and output because Solr needs to index them. The 
input has foreign keys for user id, item id, action id in triples representing 
preferences, These are put into a DRM with two dictionaries. The I run the 
recommender (really only need RSJ) and the cross-recommender running the output 
(preference DRMS, and similarity matrix DRMs) through the output translation 
phase. The output is a text delimited file but could just as easily be one of 
these Foreign Key DRMs.

Not sure if someone reading this has done dictionary creation and use in hadoop 
mr. The PIG implementation was ridiculously slow as I recall. Maybe the impl 
can choose in-memory vs some Spark scalable algo based on size of the input. 

The output to foreign keys actually was a show stopper in one contract I did 
since the input was too large for the in-memory solution and there was great 
resistance to putting Pig into production. The problem killed the project 
because they were unwilling to invest the time to create a java mr scalable 
implementation.

All this still begs the question, how are the dictionaries used? Either they 
are put into a DB (really just a distributed hashmap), into memory, or there is 
a job to create a Foreign Key DRM from a DRM and two dictionaries. Ideally the 
later, this would be easily digestible by the user. If the Foreign Key DRM has 
a text output option it would allow integration with virtually any other tools 
or languages. 

I know this sounds like a pretty thick layer to add to Mahout but I believe it 
or at least part of it is being created over and over by users each time Mahout 
is chosen for a project. It is the first barrier to use that Mahout faces.

> Support External/Foreign Keys/IDs for Vectors and Matrices
> ----------------------------------------------------------
>
>                 Key: MAHOUT-1507
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1507
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.9
>         Environment: Spark Scala
>            Reporter: Pat Ferrel
>              Labels: spark
>             Fix For: 1.0
>
>
> All users of Mahout have data which is addressed by keys or IDs of their own 
> devise. In order to use much of Mahout they must translate these IDs into 
> Mahout IDs, then run their jobs and translate back again when retrieving the 
> output. If the ID space is very large this is a difficult problem for users 
> to solve at scale.
> For many Mahout operations this would not be necessary if these external keys 
> could be maintained for vectors and dimensions, or for rows and columns of a 
> DRM.
> The reason I bring this up now is that much groundwork is being laid for 
> Mahout's future on Spark so getting this notion in early could be 
> fundamentally important and used to build on.
> If external IDs for rows and columns were maintained then RSJ, DRM Transpose 
> (and other DRM ops), vector extraction, clustering, and recommenders would 
> need no ID translation steps, a big user win.
> A partial solution might be to support external row IDs alone somewhat like 
> the NamedVector and PropertyVector in the Mahout hadoop code.
> On Apr 3, 2014, at 11:00 AM, Pat Ferrel <[email protected]> wrote:
> Perhaps this is best phrased as a feature request.
> On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
> PS.
> sequence file keys have also special meaning if they are Ints. .E.g. A'
> physical operator requires keys to be ints, in which case it interprets
> them as row indexes that become column indexes. This of course isn't always
> the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
> reality optimizer will never choose actual transposition as a physical step
> in such pipeline. This interpretation is consistent with interpretation of
> long-existing Hadoop-side DistributedRowMatrix#transpose.
> On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <[email protected]> wrote:
> On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <[email protected]> wrote:
> On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <[email protected]> wrote:
> I think this duality, names and keys, is not very healthy really, and
> just
> creates addtutiinal hassle. Spark drm takes care of keys automatically
> thoughout, but propagating names from name vectors is solely algorithm
> concern as it stands.
> Not sure what you mean.
> Not what you think, it looks like.
> I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
> persisted, key goes to the key of a sequence file. In particular, it means
> that there is a case of Bag[ key -> NamedVector]. Which means, external
> anchor could be saved to either key or name of a row. In practice it causes
> compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
> saves external keys (file paths) into  key, whereas e.g. clustering
> algorithms are not seeing them because they expect them to be the name part
> of the vector. I am just saying we have two ways to name the rows, and it
> is generally not a healthy choice for the aforementioned reason.
> In my experience Names and Properties are primarily used to store
> external keys, which are quite healthy.
> Users never have data with Mahout keys, they must constantly go back and
> forth. This is exactly what the R data frame does, no? I'm not so concerned
> with being able to address an element by the external key
> drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
> external ids follow the data through any calculation that makes sense.
> I am with you on this.
> This would mean clustering, recommendations, transpose, RSJ would require
> no id transforming steps. This would make dealing with Mahout much easier.
> Data frames is a little bit a different thing, right now we work just with
> matrices. Although, yes, our in-core matrices support row and column names
> (just like in R) and distributed matrices support row keys only.  what i
> mean is that algebraic expression e.g.
> Aexpr %*% Bexpr will automatically propagate _keys_ from Aexpr as implied
> above, but not necessarily named vectors, because internally algorithms
> blockify things into matrix blocks, and i am far from sure that Mahout
> in-core stuff works correctly with named vectors as part of a matrix block
> in all situations. I may be wrong. I always relied on sequence file keys to
> identify data points.
> Note that sequence file keys are bigger than just a name, it is anything
> Writable. I.e. you could save a data structure there, as long as you have a
> Writable for it.
> On Apr 2, 2014 1:08 PM, "Pat Ferrel" <[email protected]> wrote:
> Are the Spark efforts supporting all Mahout Vector types? Named,
> Property
> Vectors? It occurred to me that data frames in R is a related but more
> general solution. If all rows and columns of a DRM and their
> coresponding
> Vectors (row or column vectors) were to support arbitrary properties
> attached to them in such a way that they are preserved during
> transpose,
> Vector extraction, and any other operations that make sense there
> would be
> a huge benefit for users.
> One of the constant problems with input to Mahout is translation of
> IDs.
> External to Mahout going in, Mahout to external coming out. Most of
> this
> would be unneeded if Mahout supported data frames, some would be
> avoided by
> supporting named or property vectors universally.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1507) Support External/Foreign Keys/IDs for Vectors and Matrices

Reply via email to