I suppose I'm referring to the collection of algorithms as implemented
in Mahout. The online / non-distributed algorithms, as a group, have
collective assumptions and requirements that don't fit something like
Cassandra.

You're describing a perfectly valid variant which is partly offline.
You can support this with a JDBCItemSimilarity or
GenericItemSimilarity after pre-computing item-item similarity
somehow, offline. (This is what Sebastian has been doing -- that part
is distributable and is already implemented in Mahout as a Hadoop
job.)

This works fine then with slow-ish data stores like an RDBMS or
something like Cassandra. But this trick won't work for all
algorithms, in general.

So in that sense, yes, you can easily shim in Cassandra here by
implementing an ItemSimilarity. OK yes, that'd be a particular
non-trivial and interesting integration which, if someone wrote and
supported, and perhaps wanted to expand to its logical conclusion,
would be lovely.




On Mon, May 31, 2010 at 5:48 PM, Ted Dunning <[email protected]> wrote:
> Hmm....
>
> I have used Lucene very effectively in item-based recommendation settings
> before and user-based recommendations were marginally acceptable.
>
> All that the data store has to do is fetch the lists of related items for
> the most important few items in the persons history.  With a multi-fetch
> operation (which Cassandra may or may not have), this is one server
> round-trip.  It is definitely much faster to keep lots of items in memory,
> though.
>
> The off-line processing to build the item-item relationships, however, would
> require a scan of all user profiles which may be a bit intense, especially
> if the NOSQL store is being used at the same time to service user requests.
>  I have found it preferable to grovel some form of log file in HDFS to get
> this information in the past.
>
> On Mon, May 31, 2010 at 9:00 AM, Sean Owen <[email protected]> wrote:
>
>> So, any such data store is way too slow to use with a real-time
>> recommender.
>>
>> But a distributed algorithm? sure. As you say, the distributed version
>> runs on Hadoop, and you can transfer between HDFS and Cassandra. Not
>> sure whether to call that integration -- there's nothing the project
>> would meaningfully do here, since it reads off HDFS.
>>
>

Reply via email to