The ItemSimilarityJob actually uses implementations of the Vector class hierarchy? I think that's the issue - if the on-disk and in-mapper representations are never Vectors, then they won't interoperate with any of the matrix operations...
And yeah, keying on ints is necessary for now, unless we want to make a new matrix type (at least for distributed matrices) which keys on longs (which actually might be a good idea: now that we're using VInt and VLong, the disk space and network usage should be not be adversely affected - just the in-memory representation). In fact, the more I play with this, the more I see that the distributed matrices really are different beasts than their in-memory baby cousins (some operations just don't make sense, and others are way inefficient, and yet others have sneaky tricks which need to be represented differently). If DistributedRowMatrix (and relatives) is really going to be generalizable and useful, we're going to need to allow the types to be configurable - key on ints or longs, have values be vectors keyed on ints or longs, and even have entries be either float / double / boolean. -jake -jake On Wed, Jun 9, 2010 at 10:58 AM, Sean Owen <[email protected]> wrote: > Well I'm not sure they're unique, they're just vectors. Would that not > be the best neutral representation for things like this? > > What was the comment about keying by ints vs longs earlier? If > unifying that helps bring things closer together I can look at it, if > I can understand the issue. > > On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter > <[email protected]> wrote: > > The ItemSimilarityJob cannot be directly used as its not working on a > > DistributedRowMatrix but on data structures unique to collaborative > > filtering, so if you ask me I'd say that a separate job would be > required. > > >
