Hi Sean, Myrrix does look interesting! I'll keep an eye on it.
What I'd like to do is recommend items to users yes. I looked at the IdRescorer and it did the job perfectly (pre filtering). I was a little misleading in regard to the size of the data. The raw data files are around 1GB. But after the interesting data is extracted -- session-id, item-id and type-of-event (product image clicked, product description viewed etc.), the data file comes out to about 10MB. Not so bad. Btw, just bought the Mahout in Action book! - Matt On Tue, Jul 3, 2012 at 10:40 AM, Sean Owen <sro...@gmail.com> wrote: > I'm not sure if Mridul's suggestion does what you want. Do you want to > recommend items to users? then no, you do not start with item IDs and > recommend to them. > > It sounds like your question is how to compute similarity data. The > first answer is that you do not use Hadoop unless you must use Hadoop. > > You don't compute it yourself, you let the framework do it with > LogLikelihoodSimilarity. It just happens automatically. You can use > caching, you can use precomputation, but that comes after you decide > that you have too much data to do it all in real-time. > > 1GB of input data suggests you have a lot of data. Is that tens of > millions of user-item associations? then yes you are not in simple > non-Hadoop land anymore and you need to look at RecommenderJob / > Hadoop. This doesn't have anything to do with FileDataModel or the > non-distributed bits. > > > To your second point -- this is really what Rescorer does for you, > lets you filter or boost certain results at query time. But this is > part of the non-distributed code. You could try stitching together > some offline similarities from the Hadoop job, and loading them > selectively in memory as part of the real-time Recommender, but it's > going to be a bit dicey to get it to work fast. > > > I don't mind mentioning that this is exactly the kind of problem I'm > working on in Myrrix (myrrix.com). It does the offline model building > on Hadoop and still lets you do real-time recommendations, with > Rescorer objects if you want. The whole point is to fix up this > "dicey" hard part mentioned above. Might take a look. > > > > On Tue, Jul 3, 2012 at 3:15 PM, Matt Mitchell <goodie...@gmail.com> wrote: >> Thanks Mridul, I'll try this out. Does getItemIDs return every item id >> from the file in your example? >> >> This kind of leads me to another, related question... I want to have >> my recommender engine recommend items to a user, but the items should >> be from a known set of item ids. For example, if a user is doing a >> search for "gaming system", I only want recommendations for "gaming >> system" items. I was thinking I could feed the recommendation engine a >> set of item IDs that are known to be "gaming systems" as a candidate >> set *when executing that actual recommendation*. Does this make sense? >> If so, do you know how I can do this? I basically want to constrain >> the recommendations to a set of known item IDs at recommendation time. >> >> Thanks again! >> >> - Matt >> >> On Tue, Jul 3, 2012 at 8:01 AM, Mridul Kapoor <mridulkap...@gmail.com> wrote: >>>> I'm thinking the session ID (in the cookie) would be used as the user ID. >>>> The events >>>> are tied to product IDs, so these would be used in generating the >>>> preferences. >>> >>> >>> I guess if you consider product-preference on a per session-basis (i.e. >>> only items for which a user expresses preference for, in a single session, >>> are similar to each other, in some way or the other). This way, you would >>> be considering the session-ids as dummy user-ids, which I think should be >>> good. >>> >>> >>> I'd like to eventually run this on Hadoop, but it'd also be nice to know if >>>> there is a way to do this locally, while developing the app, maybe using a >>>> smaller >>>> dataset? >>>> >>> >>> Yes just writing a small offline recommender (made to run on a local >>> machine) should do; you could take a subset of the data, use a >>> FileDataModel, then do something like >>> >>> LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); >>> >>> >>> and iterate over these; getting _n_ recommended items for each, storing >>> them somewhere (and maybe use this evaluating the recommender somehow) >>> >>> Best, >>> Mridul