Re: Persisting trained models in Mahout

Sebastian Schelter Thu, 08 Dec 2011 06:20:25 -0800

A model for item-based collaborative filtering simply consists of the
precomputed item similarities.


We currently support such a precomputation only as hadoop job, but it
should be a matter of an hour to create a class that precalculates the
item similarities sequentially using an ItemBasedRecommender.

You can either store these similarities in the database and load them
via MySQLJDBCInMemoryItemSimilarity/SQL92JDBCInMemoryItemSimilarity or
you can write them to a .csv file and load them via FileItemSimilarity.

A model for recommenders that use matrix factorization consists of the
user and item feature vectors. You can use a FilePersistenceStrategy
with any SVDRecommender to read and write these.

In the future we could also support loading the results of
ParallelALSFactorizationJob into an SVDRecommender.

--sebastian



On 08.12.2011 14:49, Sean Owen wrote:
> That's right, you could get this effect by computing and saving off all the
> user-user similarities, then reading them back in, putting them in a
> GenericUserSimilarity, and proceeding as below. Those similarities are the
> closest thing to a model here.
> 
> It's going to take a while to compute all those pairs, and most will be
> unused, and so reloading them is going to take a lot of time and memory.
> You could prune the small ones I suppose. It might be faster to recompute!
> 
> On Thu, Dec 8, 2011 at 1:46 PM, Vinod <[email protected]> wrote:
> 
>> I'll use the first example from Chapter 2 of your book to clarify what I
>> mean by training:-
>>
>> Following code trains the recommender:-
>>    DataModel model = new FileDataModel(new File("intro.csv"));
>>
>>    UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
>>    UserNeighborhood neighborhood =
>>      new NearestNUserNeighborhood(2, similarity, model);
>>
>>    Recommender recommender = new GenericUserBasedRecommender(
>>        model, neighborhood, similarity);
>>
>> At this point, recommender is trained on preferences of users 1 to 5 in
>> intro.csv.
>>
>> We should now be able to serialize() this recommender instance into a file,
>> say "Movie Recommender.model" using steps mentioned here (
>> http://java.sun.com/developer/technicalArticles/Programming/serialization/
>> )
>>
>> All we need to do now is deploy "Movie Recommender.model" to production.
>>
>> If I understand the behavior correctly, this model should now be able to
>> predict recommendation for a new user.
>>
>> As an example, lets assume that production has a different user base. If
>> recommender instance is loaded from "Movie Recommender.model" file and
>> asked to provide recommendations for user '7' who has rated 101 and 102 as
>> 4 and 3 respectively, it should be able to predict recommendations for 7.
>> right?
>>
>> regards,
>> Vinod
>>
>>
>>
>>
>> On Thu, Dec 8, 2011 at 6:49 PM, Sean Owen <[email protected]> wrote:
>>
>>> Yes, I mean you need to write it and read it in your own code.
>>>
>>> What do you mean by training a model? computing similarities? I don't
>> know
>>> if there's such a thing here as "training" on one data set and running on
>>> another. The implementations always use all currently available info. Is
>>> this a cold-start issue?
>>>
>>> OutOfMemoryError is nothing to do with this; on such a small data set it
>>> indicates you didn't set your JVM heap size above the default.
>>>
>>>
>>> On Thu, Dec 8, 2011 at 1:02 PM, Vinod <[email protected]> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Neither Recommender nor any of its parent interface extends
>> serializable
>>> so
>>>> there is no way that I'd be able to serialize it.
>>>>
>>>> I agree that the implementations may not have startup overhead.
>> However,
>>>> training a model on millions of row is a cpu, memory & time consuming
>>>> activity. For example, when data set is changed from 100K to 1M in
>>> chapter
>>>> 4, program crashes with OutOfMemory after significant amount of time.
>>>>
>>>> I feel that training should be done in development only. Once a
>> developer
>>>> is ok with test results, he should be able to save instance of the
>>> trained
>>>> and tested model  (for ex:- recommender or classifier).
>>>>
>>>> These saved instances of trained and tested models only should be
>>> deployed
>>>> to production.
>>>>
>>>> Thought?
>>>>
>>>> regards,
>>>> Vinod
>>>>
>>>>
>>>>
>>>> On Thu, Dec 8, 2011 at 6:00 PM, Sean Owen <[email protected]> wrote:
>>>>
>>>>> Ah right. No, there's still not a provision for this. You would just
>>> have
>>>>> to serialize it yourself if you like.
>>>>> Most of the implementations don't have a great deal of startup
>>> overhead,
>>>> so
>>>>> don't really need this. The exception is perhaps slope-one, but there
>>> you
>>>>> can actually save and supply pre-computed diffs.
>>>>> Still it would be valid to store and re-supply user-user similarities
>>> or
>>>>> something. You can do this, manually, by querying for user-user
>>>>> similarities, saving them, then loading them and supplying them via
>>>>> GenericUserSimilarity for instance.
>>>>>
>>>>> On Thu, Dec 8, 2011 at 12:27 PM, Vinod <[email protected]> wrote:
>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> By model, I am not referring to data model but, a "trained"
>>> recommender
>>>>>> instance.
>>>>>>
>>>>>> Weka, for examples, has ability to save and load models:-
>>>>>> http://weka.wikispaces.com/Serialization
>>>>>> http://weka.wikispaces.com/Saving+and+loading+models
>>>>>>
>>>>>> This avoids the need to train model (recommender) every time a
>> server
>>>> is
>>>>>> bounced or program is restarted.
>>>>>>
>>>>>> regards,
>>>>>> Vinod
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 8, 2011 at 5:43 PM, Sean Owen <[email protected]>
>> wrote:
>>>>>>
>>>>>>> The classes aren't Serializable, no. In the case of DataModel,
>> it's
>>>>>> assumed
>>>>>>> that you already have some persisted model somewhere, in a DB or
>>> file
>>>>> or
>>>>>>> something, so this would be redundant.
>>>>>>>
>>>>>>> On Thu, Dec 8, 2011 at 12:07 PM, Vinod <[email protected]>
>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This is my first day of experimentation with Mahout. I am
>>> following
>>>>>>> "Mahout
>>>>>>>> in Action" book and looking at the sample code provided, it
>> seems
>>>>> that
>>>>>>>> models for ex:- recommender, needs to be trained at the start
>> of
>>>> the
>>>>>>>> program (start/restart). Recommender interface extends
>>> Refreshable
>>>>>> which
>>>>>>>> doesn't extend serializable. So, I am wondering if Mahout
>>> provides
>>>> an
>>>>>>>> alternate mechanism to to persist trained models (recommender
>>>>> instance
>>>>>> in
>>>>>>>> this case).
>>>>>>>>
>>>>>>>> Apologies if this is a very silly question.
>>>>>>>>
>>>>>>>> Thanks & regards,
>>>>>>>> Vinod
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Persisting trained models in Mahout

Reply via email to