Re: Generating a Document Similarity Matrix

Jake Mannix Wed, 09 Jun 2010 11:15:29 -0700

The ItemSimilarityJob actually uses implementations of the Vector
class hierarchy?  I think that's the issue - if the on-disk and in-mapper
representations are never Vectors, then they won't interoperate with
any of the matrix operations...

And yeah, keying on ints is necessary for now, unless we want to
make a new matrix type (at least for distributed matrices) which
keys on longs (which actually might be a good idea: now that
we're using VInt and VLong, the disk space and network usage
should be not be adversely affected - just the in-memory
representation).

In fact, the more I play with this, the more I see that the
distributed matrices really are different beasts than their
in-memory baby cousins (some operations just don't make
sense, and others are way inefficient, and yet others have
sneaky tricks which need to be represented differently).

If DistributedRowMatrix (and relatives) is really going to
be generalizable and useful, we're going to need to allow
the types to be configurable - key on ints or longs, have
values be vectors keyed on ints or longs, and even have
entries be either float / double / boolean.

  -jake

  -jake

On Wed, Jun 9, 2010 at 10:58 AM, Sean Owen <[email protected]> wrote:

> Well I'm not sure they're unique, they're just vectors. Would that not
> be the best neutral representation for things like this?
>
> What was the comment about keying by ints vs longs earlier? If
> unifying that helps bring things closer together I can look at it, if
> I can understand the issue.
>
> On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter
> <[email protected]> wrote:
> > The ItemSimilarityJob cannot be directly used as its not working on a
> > DistributedRowMatrix but on data structures unique to collaborative
> > filtering, so if you ask me I'd say that a separate job would be
> required.
> >
>

Re: Generating a Document Similarity Matrix

Reply via email to