[jira] [Commented] (MAHOUT-906) Allow collaborative filtering evaluators to use custom logic in splitting data set

Sean Owen (Commented) (JIRA) Wed, 14 Dec 2011 01:53:23 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169241#comment-13169241
 ]


Sean Owen commented on MAHOUT-906:
----------------------------------

Yes, the lightest-touch approach is to pull them out into an array of 
Preference-plus-time objects, and sort. DataModel does not need a sort by time 
method. At best you could add getTime() to Preference, which returns 0 by 
default in current implementations, and then create a new subclass of 
GenericPreference with time. It is sortable then with Comparable, or you could 
add sortByTime() to PreferenceArray.

O(pref^2) is not a problem when there are maybe 100 prefs for a user, and in an 
eval framework. But if the task is just to split them into objects before some 
time, and after some time, then you would not be doing anything with single 
prefs. Instead of sorting, using the TopN class to select the top, say, 5% by 
time is not hard, and will make O(prefs) calls to getPreferenceTime(), which is 
efficient. No wrapper objects and such needed.

You won't necessarily know when the data is sorted, without explicit time info. 
For example they don't appear in time order in a file, and anyway that ordering 
is ignored at reading.
                
> Allow collaborative filtering evaluators to use custom logic in splitting 
> data set
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-906
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-906
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Anatoliy Kats
>            Priority: Minor
>              Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I want to start a discussion about factoring out the logic used in splitting 
> the data set into training and testing.  Here is how things stand:  There are 
> two independent evaluator based classes:  
> AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly 
> into a training and testing set.  GenericRecommenderIRStatsEvaluator takes 
> one user at a time, removes their top AT preferences, and counts how many of 
> them the system recommends back.
> I have two use cases that both deal with temporal dynamics.  In one case, 
> there may be expired items that can be used for building a training model, 
> but not a test model.  In the other, I may want to simulate the behavior of a 
> real system by building a preference matrix on days 1-k, and testing on the 
> ratings the user generated on the day k+1.  In this case, it's not items, but 
> preferences(user, item, rating triplets) which may belong only to the 
> training set.  Before we discuss appropriate design, are there any other use 
> cases we need to keep in mind?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-906) Allow collaborative filtering evaluators to use custom logic in splitting data set

Reply via email to