[ 
https://issues.apache.org/jira/browse/MAHOUT-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169182#comment-13169182
 ] 

Anatoliy Kats commented on MAHOUT-906:
--------------------------------------

I am beginning to write an evaluator that takes time of preference into 
account.  It seems that the first thing I need is to sort all the preferences 
by their time.  I don't quite see how to do it without tearing into Mahout's 
logic.  The DataModel interface only makes a contract that we can pull one time 
of preferece at a time, with getPreferenceTime.  I see three options.  I can 
pull them out one at a time into a data structure that is sortable inside the 
evaluator class.  Or I could add a sortByTimePreference directly to the 
DataModel interface.  The latter is a lot more involved, and I can only do it 
in close collaboration with you.  The third option is to avoid sorting 
altogether.  Under this option, for every test user, and for every one of their 
preferences, loop over all the other preferences, and add only those that were 
made earlier to the training set.  This option is fastest to code, but slowest 
to run, in O((#prefs)^2).  For the testing alone, it works for me, so that's 
what I'll probably do.  Here is another compromise option:  Data is normally 
accumulated by the servers in order of preference-making, so the input is 
already approximately sorted.  If Mahout can preserve the order in which it 
read the data off the disk, we have sorted data without ever having to sort.  
In fact maybe it already does and all we have to do is contractualize it.  Is 
this kind of sorting necessary for a time-based algorithm itself, or only for 
the evaluation?  If it's the former, perhaps we can look into implementing this.
                
> Allow collaborative filtering evaluators to use custom logic in splitting 
> data set
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-906
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-906
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Anatoliy Kats
>            Priority: Minor
>              Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I want to start a discussion about factoring out the logic used in splitting 
> the data set into training and testing.  Here is how things stand:  There are 
> two independent evaluator based classes:  
> AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly 
> into a training and testing set.  GenericRecommenderIRStatsEvaluator takes 
> one user at a time, removes their top AT preferences, and counts how many of 
> them the system recommends back.
> I have two use cases that both deal with temporal dynamics.  In one case, 
> there may be expired items that can be used for building a training model, 
> but not a test model.  In the other, I may want to simulate the behavior of a 
> real system by building a preference matrix on days 1-k, and testing on the 
> ratings the user generated on the day k+1.  In this case, it's not items, but 
> preferences(user, item, rating triplets) which may belong only to the 
> training set.  Before we discuss appropriate design, are there any other use 
> cases we need to keep in mind?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to