[ 
https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176111#comment-13176111
 ] 

Anatoliy Kats commented on MAHOUT-906:
--------------------------------------

You're right, old data should be excluded from the test corpus.  The point of a 
time-based algorithm is to treat older data differently from newer data.  So, 
if there is data it chooses to disregard, it should be be done at the 
recommender level, not the evaluator level.  However, we still need a test 
start and end date:  the entire idea is that we use only use preferences prior 
to the "current" time to make a recommendation.  In production, that's always 
the case.  In a test environment we need to approximate it using a sliding 
window approach:  train on days 1-20, test on day 21, then train on 1-21 test 
on 22, etc.

I don't mind using my local modifications of the time-based splitter, so long 
as the trunk maintains the hook.  I see the motivation as well for putting the 
hook in the abstract difference recommender.  If I get around to using a 
preference-value recommender, I'll look into it.
                
> Allow collaborative filtering evaluators to use custom logic in splitting 
> data set
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-906
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-906
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Anatoliy Kats
>            Priority: Minor
>              Labels: features
>         Attachments: MAHOUT-906.patch, MAHOUT-906.patch, MAHOUT-906.patch, 
> MAHOUT-906.patch, MAHOUT-906.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I want to start a discussion about factoring out the logic used in splitting 
> the data set into training and testing.  Here is how things stand:  There are 
> two independent evaluator based classes:  
> AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly 
> into a training and testing set.  GenericRecommenderIRStatsEvaluator takes 
> one user at a time, removes their top AT preferences, and counts how many of 
> them the system recommends back.
> I have two use cases that both deal with temporal dynamics.  In one case, 
> there may be expired items that can be used for building a training model, 
> but not a test model.  In the other, I may want to simulate the behavior of a 
> real system by building a preference matrix on days 1-k, and testing on the 
> ratings the user generated on the day k+1.  In this case, it's not items, but 
> preferences(user, item, rating triplets) which may belong only to the 
> training set.  Before we discuss appropriate design, are there any other use 
> cases we need to keep in mind?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to