+10

Love the academics but I agree with this. Recently saw a VP from Netflix plead 
with the audience (mostly academics) to move past RMSE--focus on maximizing 
correct ranking, not rating prediction. 

Anyway I have a pipeline that does the following:
ingests logs either TSV or CSV of arbitrary column ordering--will pick out the 
actions by position and string 
replaces PreparePreferenceMatrixJob to create n matrixes depending on the 
number of actions you are splitting out. This job also creates external <-> 
internal item and user id BiHashMaps for going back and forth between the log's 
IDs and Mahout internal IDs. It guarantees a uniform item and user ID space and 
sparse matrix ranks by creating one from all actions. Not completely scalable 
since it is not done in m/r though it uses HDFS--I have a plan to m/r the 
process and get rid of the hashmap.
performs the RowSimilarityJob on the primary matrix "B" and does B'A to create 
a cooccurrence matrix for primary and secondary actions.
It then goes on to use the rest of the mahout pipeline on B to get recs and 
does a [B'A]H_v to calculate all cross-recommendations.
Stores all recs from all models in a NoSQL DB.
At rec request time it does a linear combination of req and cross-rec to return 
the highest scored ones. The stored IDs were external so all ready for display.
Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to 
Solr as the original external IDs from the log files, which were strings. This 
allows them to be treated as terms by Solr.

My understanding of the Solr proposal puts B's row similarity matrix in a 
vector per item. That means each row is turned into "terms" = external IDs--not 
sure how the weights of each term are encoded.  So the cross-recommender would 
just put the cross-action similarity matrix  in other field(s) on the same 
itemID/docID, right?

Then the straight out recommender queries on the B'B field(s) and the 
cross-recommender queries on the B'A field(s). I suppose to keep it simple the 
cross-action similarity matrix could be put in a separate index.  Is this about 
right?

On Jul 21, 2013, at 5:30 PM, Sebastian Schelter <s...@apache.org> wrote:

At the moment, the down sampling is done by PreparePreferenceMatrixJob
for the collaborative filtering functionality. We just want to move it
down to RowSimilarityJob to enable standalone usage.

I think that the CrossRecommender should be the next thing on our
agenda, after we have the deployment infrastructure.  I especially like
that its capable to include different kinds of interactions, as opposed
to most other (academically motivated) recommenders that focus on a
single interaction type like a rating.

--sebastian

On 22.07.2013 02:14, Ted Dunning wrote:
> The row similarity downsampling is just a matter of dropping elements at
> random from rows that have more data than we want.
> 
> If the join that puts the row together can handle two kinds of input, then
> RowSimilarity can be easily modified to be CrossRowSimilarity.  Likewise,
> if we have two DRM's with the same row id's in the same order, we can do a
> map-side merge.  Such a merge can be very efficient on a system like MapR
> where you can control files to live on the same nodes.
> 
> 
> On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
>> RowSimilarity downsampling? Are you referring to the a mod of the matrix
>> multiply to do cross similarity with LLR for the cross recommendations? So
>> similarity of rows of B with rows of A?
>> 
>> Sounds like you are proposing not only putting a recommender in Solr but
>> also a cross-recommender? This is why getting a real data set is
>> problematic?
>> 
>> On Jul 21, 2013, at 3:40 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> 
>> Pat,
>> 
>> Yes.  The first part probably just is the RowSimilarity job, especially
>> after Sebastian puts in the down-sampling.
>> 
>> The new part is exactly as you say, storing the DRM into Solr indexes.
>> 
>> There is no reason to not use a real data set.  There is a strong reason to
>> use a synthetic dataset, however, in that it can be trivially scaled up and
>> down both in items and users.  Also, the synthetic dataset doesn't require
>> that the real data be found and downloaded.
>> 
>> 
>> 
>> On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>>> Read the paper, and the preso.
>>> 
>>> As to the 'offline to Solr' part. It sounds like you are suggesting an
>>> item item similarity matrix be stored and indexed in Solr. One would have
>>> to create the action matrix from user profile data (preference history),
>> do
>>> a rowsimiarity job on it (using LLR similarity) and move the result to
>>> Solr. The first part of this is nearly identical to the current
>> recommender
>>> job workflow and could pretty easily be created from it I think. The new
>>> part is taking the DistributedRowMatrix and storing it in a particular
>> way
>>> in Solr, right?
>>> 
>>> BTW Is there some reason not to use an existing real data set?
>>> 
>>> On Jul 19, 2013, at 3:45 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>> 
>>> OK.  I think the crux here is the off-line to Solr part so let's see who
>>> else pops up.
>>> 
>>> Having a solr maven could be very helpful.
>>> 
>>> 
>>> 
>> 
>> 
> 


Reply via email to