[ 
https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175941#comment-13175941
 ] 

Sean Owen commented on MAHOUT-906:
----------------------------------

Old data can and should just be excluded from the test corpus, full stop. Or, 
to put it another way: that's not specific to a time-based split, and in 
general the need is already met earlier in the pipeline, so to speak. Why is an 
end date needed... seems best to just use all the recent data you have.

Understand about also including score in addition to time in the definition of 
"relevant". I think the problem I'm having is that the "time+score" 
implementation seems of nearly the same value as the current implementation, 
which is just "score". The time bit seems secondary, and partially used as a 
simple filter on the input. So I'm struggling a bit to like just this change; 
the new implementation is mostly a copy. It's on the borderline of being 
something you may just want to use locally for your own purpose.

Any other thoughts from anyone else here?

I suppose the time-based split makes a lot more sense to me for the 
estimation-based test and can clearly see the use in a hook and second 
implementation there. No question about that.
                
> Allow collaborative filtering evaluators to use custom logic in splitting 
> data set
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-906
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-906
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Anatoliy Kats
>            Priority: Minor
>              Labels: features
>         Attachments: MAHOUT-906.patch, MAHOUT-906.patch, MAHOUT-906.patch, 
> MAHOUT-906.patch, MAHOUT-906.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I want to start a discussion about factoring out the logic used in splitting 
> the data set into training and testing.  Here is how things stand:  There are 
> two independent evaluator based classes:  
> AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly 
> into a training and testing set.  GenericRecommenderIRStatsEvaluator takes 
> one user at a time, removes their top AT preferences, and counts how many of 
> them the system recommends back.
> I have two use cases that both deal with temporal dynamics.  In one case, 
> there may be expired items that can be used for building a training model, 
> but not a test model.  In the other, I may want to simulate the behavior of a 
> real system by building a preference matrix on days 1-k, and testing on the 
> ratings the user generated on the day k+1.  In this case, it's not items, but 
> preferences(user, item, rating triplets) which may belong only to the 
> training set.  Before we discuss appropriate design, are there any other use 
> cases we need to keep in mind?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to