Re: Proposal about improving Evaluation in Recommender System

Sean Owen Wed, 01 Feb 2012 10:39:39 -0800

Thanks Davide, I have several comments for you that may better explain what
is going on.


First, the precision and recall tests involve random selection of
'relevant' preferences, so you will see some variation in the result from
run to run even with the same parameters. This may explain some of what you
are seeing.

Second, if you are computing precision at N, and the framework also picks N
prefs to be considered "relevant", then precision will equal recall
exactly. That is normal, if these two values are equal.

Why would precision go down as recall goes up? I don't believe this is
necessarily true; it is not true here at least.

I think it would be easier to understand and evaluate your proposed changed
as a patch. Could you open a JIRA issue at
https://issues.apache.org/jira/browse/MAHOUT with a proposal.


That may help me understand what you mean by holding out 30% of data. 30%
of the user's preferences may be less than "@"; this would mean the test
could never score 100% precision, which seems wrong. That's why there is
this "2*@" limit in the code.

If you mean you want to hold out 30% of all data and then run these tests,
I think it is just giving you a less precise test result than running it
user-by-user.


But maybe I am misunderstanding your proposal.


On Wed, Feb 1, 2012 at 6:18 PM, Davide Romito <[email protected]>wrote:

> Hi all,
>
> this is the first time I’m writing to this mailing list. I’m Davide Romito,
> I hold a MS in Computer Engineering.
>
> In the last weeks I’ve been using Mahout to create a Boolean User-Based
> Collaborative Recommender System, but I’m experiencing some problems.
>
> I applied the suggestions from the book “Mahout in Action” to create a
> Boolean Collaborative Recommender System and to calculate precision and
> recall using the Mahout evaluator  (pages 38 -39).
>
> I used the Movielens dataset to evaluate precision and recall, as in the
> book. The standard code of my evaluator is:
>
>
>
>       *public* *static* *void* main(String[] args)
> *throws*TasteException, IOException {
>
>             RandomUtils.*useTestSeed*();
>
>             *int* at=10;
>
>             DataModel    model = *new* GenericBooleanPrefDataModel(
>
>
> GenericBooleanPrefDataModel.*toDataMap*(*new*FileDataModel(
>
>                                        *new* File("ua.base"))));
>
>
>
>             GenericRecommenderIRStatsEvaluator evaluator =
> *new*GenericRecommenderIRStatsEvaluator();
>
>             RecommenderBuilder recommenderBuilder =
> *new*RecommenderBuilder() {
>
>                    *public* Recommender buildRecommender(DataModel model)
>
>                                  *throws* TasteException {
>
>                           UserSimilarity similarity =
> *new*TanimotoCoefficientSimilarity(model);
>
>                           UserNeighborhood neighborhood =
> *new*NearestNUserNeighborhood(
>
>                                        model.getNumUsers() , similarity,
> model);
>
>                           *return*
> *new*GenericBooleanPrefUserBasedRecommender(model,
>
>                                        neighborhood, similarity);
>
>                    }
>
>             };
>
>             DataModelBuilder modelBuilder = *new* DataModelBuilder() {
>
>                    *public* DataModel buildDataModel(
>
>                                  FastByIDMap<PreferenceArray>
> trainingData) {
>
>                           *return* *new* GenericBooleanPrefDataModel(
>
>                                        GenericBooleanPrefDataModel.*
> toDataMap*(trainingData));
>
>                    }
>
>             };
>
>             IRStatistics stats = *null*;
>
>             *try* {
>
>                    stats =
> evaluator.evaluate(recommenderBuilder,modelBuilder,model,*null*,at,
>
>
> GenericRecommenderIRStatsEvaluator.*CHOOSE_THRESHOLD*,1.0);
>
>             } *catch* (TasteException e) {
>
>                    e.printStackTrace();
>
>             }
>
>
>
>             System.*out*.println(at+","+stats.getPrecision()+","
> +stats.getRecall());
>
> }
>
>
>
> To evaluate the recommender, I changed just the value of "at" from 1 to 10.
> The results are:
>
> @1 -    Precision: 0.2258748674443268       Recall: 0.2258748674443268
>
> @2 -    Precision: 0.24231177094379663     Recall: 0.24231177094379663
>
> @3 -    Precision: 0.20749381406857542     Recall: 0.20749381406857542
>
> @4 -    Precision: 0.19406150583244947     Recall: 0.19406150583244947
>
> @5 -    Precision: 0.20084835630965006     Recall: 0.20084835630965006
>
> @6 -    Precision: 0.2101449275362318       Recall: 0.2101449275362318
>
> @7 -    Precision: 0.21330101499772755     Recall: 0.21330101499772755
>
> @8 -    Precision: 0.21487274655355282     Recall: 0.21487274655355282
>
> @9 -    Precision: 0.21809826793920137     Recall: 0.21809826793920137
>
> @10 -Precision: 0.2218451749734886         Recall: 0.2218451749734886
>
>
>
>
>
> From classical theory (Dietmar Jannach, Mark Zanker, Alexander Felfernig,
> Gerhard Friedrich, “Recommender System – An Introduction") we know that
> Precision and Recall are computed as fractions of *hitsu*, the number of
> correctly recommended relevant items for user *u*. Moreover, Precision
> should monotonically decrease for increasing values of Recall. If try to
> plot the above represented values, you will not obtain any function.
> Because of this I think that the way Precision and Recall are computed (at
> least for the Boolean case) are not correct.
>
>
>
> The Precision (P) relates the number of hits to the total number of
> recommended items (|*recsetu*|).
>
>
>
> *Pu = |hitsu| / |recsetu|*
>
>
>
> In contrast, the Recall (R) computes the ratio of hits to the theoretical
> maximum number of hits owing to the testing set size (|*testsetu*|).
>
> * *
>
> *Ru = |hitsu| / |testsetu|*
>
>
>
>
>
> In the previous example, we can see that Precision and Recall are always
> the same for a given @. According to theory it could hold only if the
> denominator of the two ratios (i.e., *recsetu* and *testsetu*) are the
> same.
>
>
>
> I’d want to propose some changes/improvements to the existing code.
>
> First of all, Mahout uses the value “at” to calculate the
> relevantItemIDssize in
> the class GenericRecommenderIRStatsEvaluator (line 121).
>
> In the recommendation, all the users who have rated more then 2*@ items
> (line 115) are used. Then, the dataset is split in training set and test
> set.
>
> The test set has always a size of @.
>
> Usually the 70% of the dataset is used for training and the 30% is used for
> testing. I think we should allow user to set which percentage of the
> dataset using for training and which for test.
>
>
>
> For this, the first change is into the interface
> RecommenderIRStatsEvaluatorwhere I’ve inserted the overloaded the
> method “evaluate” (note the double
> testSetPercentage parameter):
>
>
>
> IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
>
>                               DataModelBuilder dataModelBuilder,
>
>                               DataModel dataModel,
>
>                               IDRescorer rescorer,
>
>                               *int* at,
>
>                               *double* relevanceThreshold,
>
>                               *double* evaluationPercentage,
> *double*testSetPercentage)
>
>                               *throws* TasteException;
>
>
>
> Then, in the class GenericRecommenderIRStatsEvaluator I made the following
> changes (please note the MODIFICATIONS comments):
>
>
>
>
>
>             LongPrimitiveIterator it = dataModel.getUserIDs();
>
>                    *while* (it.hasNext()) {
>
>
>
>                           *long* userID = it.nextLong();
>
>
>
>                           *if* (random.nextDouble() >=
> evaluationPercentage) {
>
>                                  // Skipped
>
>                                  *continue*;
>
>                           }
>
>
>
>                           *long* start = System.*currentTimeMillis*();
>
>
>
>                           PreferenceArray prefs = dataModel
>
>                                        .getPreferencesFromUser(userID);
>
>                           *int* size = prefs.length();
>
>                           *if* (size < 2 * at) {
>
>                                  // Really not enough *prefs* to
> meaningfully evaluate this
>
>                                  // user
>
>                                  *continue*;
>
>                           }
>
>
>
>                           //MODIFICATIONS
>
>                           // if-else added in case the testSet size is
> declared, if it is
>
>                           // not expressed use the at size
>
>                           *int* testSetSize;
>
>                           *if* (Double.*isNaN*(testSetPercentage)) {
>
>                                  testSetSize = at;
>
>                           } *else* {
>
>                                  testSetSize = (*int*) (prefs.length() *
> testSetPercentage);
>
>                           }
>
>
>
>                           // MODIFICATIONS – using testSetSize value to
> determine the size of the relevantItemsIDs
>
>                           FastIDSet relevantItemIDs =
> *new*FastIDSet(testSetSize);
>
>
>
>                           // List some most-preferred items that would
> count as (most)
>
>                           // "relevant" results
>
>                           *double* theRelevanceThreshold =
> Double.*isNaN*(relevanceThreshold)
> ? *computeThreshold*(prefs)
>
>                                        : relevanceThreshold;
>
>
>
>                           prefs.sortByValueReversed();
>
>
>
> // MODIFICATIONS – using testSetSize value to determine the size of the
> relevantItemsIDs
>
>                           *for* (*int* i = 0; i < testSetSize; i++) {
>
>                                  *if* (prefs.getValue(i) >=
> theRelevanceThreshold) {
>
>
> relevantItemIDs.add(prefs.getItemID(i));
>
>                                  }
>
>                           }
>
>
>
>                           *int* numRelevantItems = relevantItemIDs.size();
>
>                           *if* (numRelevantItems <= 0) {
>
>                                  *continue*;
>
>                           }
>
>
>
>                           FastByIDMap<PreferenceArray> trainingUsers = *new
> * FastByIDMap<PreferenceArray>(
>
>                                        dataModel.getNumUsers());
>
>                           LongPrimitiveIterator it2 =
> dataModel.getUserIDs();
>
>                           *while* (it2.hasNext()) {
>
>                                  *processOtherUser*(userID,
> relevantItemIDs, trainingUsers,
>
>                                               it2.nextLong(), dataModel);
>
>                           }
>
>
>
>                           DataModel trainingModel = dataModelBuilder == *
> null* ? *new* GenericDataModel(
>
>                                        trainingUsers) : dataModelBuilder
>
>                                        .buildDataModel(trainingUsers);
>
>                           Recommender recommender = recommenderBuilder
>
>                                        .buildRecommender(trainingModel);
>
>
>
>                           *try* {
>
>
> trainingModel.getPreferencesFromUser(userID);
>
>                           } *catch* (NoSuchUserException nsee) {
>
>                                  *continue*; // Oops we excluded all *prefs
> * for the user -- just
>
>                                                      // move on
>
>                           }
>
>
>
>                           *int* intersectionSize = 0;
>
>                           List<RecommendedItem> recommendedItems =
> recommender.recommend(
>
>                                        userID, at, rescorer);
>
>                           *for* (RecommendedItem recommendedItem :
> recommendedItems) {
>
>
> *if*(relevantItemIDs.contains(recommendedItem.getItemID())) {
>
>                                        intersectionSize++;
>
>                                  }
>
>                           }
>
>
>
>                           *int* numRecommendedItems =
> recommendedItems.size();
>
>
>
>                           // Precision
>
>                           *…*
>
>
>
> Using the 30% of the dataset as test set, the values of precision and
> recall are:
>
> @1 -    Precision: 0.41887592788971406  - Recall: 0.02203446639760426
>
> @2 -    Precision: 0.36585365853658586  - Recall: 0.038498761823094016
>
> @3 -    Precision: 0.35348179568752225  - Recall: 0.054509155919394
>
> @4 -    Precision: 0.34172852598091197  - Recall: 0.07081668395014626
>
> @5 -    Precision: 0.3346765641569464  -    Recall: 0.08682432326362303
>
> @6 -    Precision: 0.34554678692220986  - Recall: 0.09602074799164206
>
> @7 -    Precision: 0.3492628368073204  -    Recall: 0.10585290948384189
>
> @8 -    Precision: 0.35034119106699724  - Recall: 0.1117366914783968
>
> @9 -    Precision: 0.3519239083441425  -    Recall: 0.11677870191013658
>
> @10 - Precision: 0.3528225806451612  -     Recall: 0.12537365741292747
>
>
> In this case, precision and recall are different for a given @, and recall
> increases as expected. Moreover, this time you can plot the Precision and
> Recall.
>
>
>
> Another modification is about the maximum allowed value for @.
>
> For example, in the Movielens dataset every user has at least 20 ratings,
> so the evaluation will be done on all the users in the dataset (943 users)
> up  @10 (i.e. 20/2). However, if we try to evaluate precision and recall
> @11, there will be only 887 users,  so precision and recall will be
> calculated on a subset of the dataset, and the values could be very
> unexpected. In other words, after a monotonic trend of Precision vs. Recall
> up to 10, the values start to oscillate due to the different sizes of the
> users considered in the dataset.
>
>
>
> In order to avoid calculating values of precision and recall for a subset
> of different users, I've inserted a new block where the max@ is calculated
> and the evaluation isn’t computed for values of @s greater than @max:
>
> // Block for understanding the max allowed value of AT
>
>             *int* atMax = Integer.*MAX_VALUE*;
>
>             LongPrimitiveIterator itMaxItem = dataModel.getUserIDs();
>
>             *while* (itMaxItem.hasNext()) {
>
>                    *long* user = itMaxItem.nextLong();
>
>                    *if* (atMax >
> dataModel.getItemIDsFromUser(user).size()) {
>
>                           atMax =
> dataModel.getItemIDsFromUser(user).size();
>
>                    }
>
>             }
>
>             //if at is greater than atMax - return precision, recall,
> fall-out and nDCG Double.NaN
>
>             *if* (at > atMax / 2) {
>
>                    *return* *new* IRStatisticsImpl();
>
> } *else* {
>
>                    LongPrimitiveIterator it = dataModel.getUserIDs();
>
> …
>
>
> I’ve also added a new constructor in the class IRStatisticsImpl:
>
>       IRStatisticsImpl() {
>
>             *this*.precision = Double.*NaN*;
>
>             *this*.recall = Double.*NaN*;
>
>             *this*.fallOut = Double.*NaN*;
>
>             *this*.ndcg = Double.*NaN*;
>
>       }
>
> The reason for the:
>
>  if(at > atMax / 2)
>
> is the same as found at line 113 of the GenericRecommenderIRStatsEvaluator
>
>      PreferenceArray prefs = dataModel.getPreferencesFromUser(userID);
>
>      *int* size = prefs.length();
>
>      *if* (size < 2 * at) {
>
>        // Really not enough prefs to meaningfully evaluate this user
>
>        *continue*;
>
>      }
>
>
>
> In this case the evaluation is done only for the users who have more than
> 2*at preferences.
>
> So if the atMax is 20, it means that all the users have at least 20
> preferences and so the maximum @ is 10.
>
>
>
> I would really appreciate your comments and opinions about the above
> mentioned issues and the proposed solutions.
>
>
>
> Thanks,
>
> Davide Romito
>

Re: Proposal about improving Evaluation in Recommender System

Reply via email to