Re: How to map UUID to userId in Preference class to use mahout recommender?
You can use the low-order bits, or have a look at what the UUID class does to hash itself to 32 bits in hashCode() and emulate that for 64 bits. Collisions in a 64-bit space are very very very rare, enough to not care about here by a wide margin. A collision only means you confuse prefs from two users -- it still mostly works anyway. Yes keys were originally Comparable. It was just too much memory / performance overhead. Instead, you can use a mapping to/from 64-bit values. See IDMigrator for instance. On Mon, Apr 8, 2013 at 3:51 AM, Phoenix Bai baizh...@gmail.com wrote: Hi All, the input format required for mahout recommender is : *userId (long), itemId (long), rating (optional)* while, currently, my input format is: *userId (UUID, which is 128bit long), itemId (long), boolean* so, my question is, how could I convert userId in UUID format to long datatype? e.g. how to map value like *550e8400-e29b-41d4-a716-44665544* to long datatype? My current solution is to convert it to java UUID instance and extract the least significant bits and use it as long type userId. But I am worried about the collision that is not supposed to exist with uuid. I am wondering two things: 1) if the collision is low, could I use above approach? what`s the possible pros and cons? 2) is it possible to change or extend Preference class to modify userId to String datatype? is it feasible? thanks
Re: Detecting rank-deficiency, or worse, via QR decomposition
For example, here's Y: Y = -0.278098 -0.256438 0.127559 -0.045869 -0.769172 -0.255599 0.150450 -0.436548 0.209881 -0.526238 0.613175 -0.600739 -0.291662 -1.142282 0.277204 -0.296846 -0.175122 0.031656 -0.202138 -0.254480 -0.187816 -0.889571 0.052191 -0.304053 0.498097 -0.049822 -0.972282 -0.240532 0.155711 -0.627668 -0.065179 -0.055424 0.977480 0.104342 0.594501 0.033205 -0.896222 -0.345715 -0.371288 -0.489602 -0.434807 -0.403650 0.264583 -0.110285 -1.318951 -0.452470 0.274445 -0.755704 0.313150 -0.903234 and R from the QR decomposition of Y' * Y: R = 2.56259 -1.35164 -2.43837 1.27844 -0.17692 -0.30514 1.09366 -0.84664 0.58601 1.06875 0.0 1.03316 2.61600 -0.46070 -1.46785 -0.10841 0.24828 -2.32186 -2.00163 -0.71470 0.0 0.0 2.11507 1.15523 1.10757 0.36407 -0.31567 2.77361 0.77367 -0.84055 0.0 0.0 0.0 0.54242 -0.01545 0.21761 0.26630 0.13972 0.44089 0.02783 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Separately I tried avoiding the inverse altogether here and just using the QR decomposition to solve a system where necessary. Probably a better move anyway. But same result. I think I'm not really quantifying the problem properly, but it's not really a matter of condition number or machine precision. Condition numbers are 1 in these cases but not that large. On Sun, Apr 7, 2013 at 12:19 AM, Koobas koo...@gmail.com wrote: I don't see why the inverse of Y'*Y does not exist. What Y do you end up with?
Re: Detecting rank-deficiency, or worse, via QR decomposition
(On this aside -- the Commons Math version uses Householder reflections but operates on a transposed representation for just this reason.) On Thu, Apr 4, 2013 at 11:11 PM, Ted Dunning ted.dunn...@gmail.com wrote: But then I started trying to build a HH version using vector ops and realized that the likely reason for the speed is actually just due to the fact that the matrix is stored in row major form. The operations in my GS implementation are very much row oriented. The operations in the old HH implementation were very column oriented. It is hard to frame HH in a row major fashion. I might be able to figure out a Given's rotation method that is row oriented. The payoff is that doing HH well (or Givens) should give about another 2x speedup. The downside is that nobody has time to fix stuff that isn't broken.
Re: Detecting rank-deficiency, or worse, via QR decomposition
OK yes you're on to something here. I should clarify. Koobas you are right that the ALS algorithm itself is fine here as far as my knowledge takes me. The thing it inverts to solve for a row of X is something like (Y' * Cu * Y + lambda * I). No problem there, and indeed I see why the regularization term is part of that. I'm talking about a later step, after the factorization. You get a new row in A and want to solve A = X*Y' for X, given the current Y. (And vice versa). I'm using a QR decomposition for that, but not to directly solve the system (and this may be the issue), but instead to compute and save off (Y' * Y)^-1 so that we can figure A * Y * (Y'*Y)^-1 very fast at runtime. That is to say the problem centers around the inverse of Y'*Y and in this example, it does not even exist. I am not sure it's just a numerical precision thing since using an SVD to get the inverse gives the same result. But I certainly have examples where the data (A) is most certainly rank k and get this bad behavior -- for example, when lambda is very *high*. On Fri, Apr 5, 2013 at 6:57 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Apr 5, 2013 at 2:40 AM, Koobas koo...@gmail.com wrote: Anyways, I saw no particular reason for the method to fail with k approaching or exceeding m and n. It does if there is no regularization. But with regularization in place, k can be pretty much anything. Ahh... this is an important point and it should handle all of the issues of poor conditioning. The regularizer takes the rank deficient A and makes it reasonably well conditioned. How well conditioned depends on the choice of lambda, the regularizing scale constant.
Detecting rank-deficiency, or worse, via QR decomposition
This is more of a linear algebra question, but I thought it worth posing to the group -- As part of a process like ALS, you solve a system like A = X * Y' for X or for Y, given the other two. A is sparse (m x n); X and Y are tall and skinny (m x k, m x n, where k m,n) For example to solve for X, just: X = A * Y * (Y' * Y)^-1 This fails if the k x k matrix Y' * Y is not invertible of course. This can happen if the data is tiny and k is actually large relative to m,n. It also goes badly if it is nearly not invertible. The solution for X can become very large, for example, for a small A, which is obviously wrong. You can -- often -- detect this by looking at the diagonal of R in a QR decomposition, looking for near-zero values. However I find a similar behavior even when the rank k seems intuitively fine (easily low enough given the data), but when, for example, the regularization term is way too high. X and Y are so constrained that the inverse above becomes a badly behaved operator too. I think I understand the reasons for this intuitively. The goal isn't to create a valid solution since there is none here; the goal is to define and detect this bad situation reliably and suggest a fix to parameters if possible. I have had better success looking at the operator norm of (Y' * Y)^-1 (its largest singular value) to get a sense of when it is going to potentially scale its input greatly, since that's a sign it's bad, but I feel like I'm missing the rigorous understanding of what to do with that info. I'm looking for a way to think about a cutoff or threshold for that singular value that will make it be rejected (1?) but think I have some unknown-unknowns in this space. Any insights or pointers into the next concept that's required here are appreciated. Sean
Re: Detecting rank-deficiency, or worse, via QR decomposition
I think that's what I'm saying, yes. Small rows X shouldn't become large rows of A -- and similarly small changes in X shouldn't mean large changes in A. Not quite the same thing but both are relevant. I see that this is just the ratio of largest and smallest singular values. Is there established procedure for evaluating the ill-conditioned-ness of matrices -- like a principled choice of threshold above which you say it's ill-conditioned, based on k, etc.? On Thu, Apr 4, 2013 at 3:19 PM, Koobas koo...@gmail.com wrote: So, the problem is that the kxk matrix is ill-conditioned, or is there more to it?
Re: Detecting rank-deficiency, or worse, via QR decomposition
Does it complete without problems? It may complete without error but the result may be garbage. The matrix that's inverted is not going to be singular due to round-off. Even if it's not you may find that the resulting vectors are infinite or very large. In particular I at least had to make the singularity threshold a lot larger than Double.MIN_VALUE in the QR decomposition. Try some simple dummy data like below, without maybe k=10. If it completes with error that's a problem! 0,0,1 0,1,4 0,2,3 1,2,3 2,1,4 2,3,3 2,4,2 3,0,5 3,2,2 3,4,3 4,3,5 5,0,2 5,1,4 On Thu, Apr 4, 2013 at 7:05 PM, Koobas koo...@gmail.com wrote: I took Movie Lens 100K data without ratings and ran non-weighted ALS in Matlab. I set number of features k=2000, which is larger than the input matrix (1000 x 1700). I used QR to do the inversion. It runs without problems. Can you share your data? On Thu, Apr 4, 2013 at 1:10 PM, Koobas koo...@gmail.com wrote: Just to throw another bit. Just like Ted was saying. If you take the largest singular value over the smallest singular value, you get your condition number. If it turns out to be 10^16, then you're loosing all the digits of double precision accuracy, meaning that your solver is nothing more than a random number generator. On Thu, Apr 4, 2013 at 12:21 PM, Dan Filimon dangeorge.fili...@gmail.comwrote: For what it's worth, here's what I remember from my Numerical Analysis course. The thing we were taught to use to figure out whether the matrix is ill conditioned is the condition number of a matrix (k(A) = norm(A) * norm(A^-1)). Here's a nice explanation of it [1]. Suppose you want to solve Ax = b. How much worse results will you get using A if you're not really solving Ax = b but A(x + delta) = b + epsilon (x is still a solution for Ax = b). So, by perturbing the b vector by epsilon, how much worse is delta going to be? There's a short proof [1, page 4] but the inequality you get is: norm(delta) / norm(x) = k(A) * norm(epsilon) / norm(b) The rule of thumb is that if m = log10(k(A)), you lose m digits of accuracy. So, equivalently, if m' = log2(k(A)) you lose m' bits of accuracy. Since floats are 32bits, you can decide that say, at most 2 bits may be lost, therefore any k(A) 4 is not acceptable. Anyway there are lots of possible norms and you need to look at ways of actually interpreting the condition number but from what I learned this is probably the thing you want to be looking at. Good luck! [1] http://www.math.ufl.edu/~kees/ConditionNumber.pdf [2] http://www.rejonesconsulting.com/CS210_lect07.pdf On Thu, Apr 4, 2013 at 5:26 PM, Sean Owen sro...@gmail.com wrote: I think that's what I'm saying, yes. Small rows X shouldn't become large rows of A -- and similarly small changes in X shouldn't mean large changes in A. Not quite the same thing but both are relevant. I see that this is just the ratio of largest and smallest singular values. Is there established procedure for evaluating the ill-conditioned-ness of matrices -- like a principled choice of threshold above which you say it's ill-conditioned, based on k, etc.? On Thu, Apr 4, 2013 at 3:19 PM, Koobas koo...@gmail.com wrote: So, the problem is that the kxk matrix is ill-conditioned, or is there more to it?
Re: Detecting rank-deficiency, or worse, via QR decomposition
It might make a difference that you're just running 1 iteration. Normally it's run to 'convergence' -- or here let's say, 10+ iterations to be safe. This is the QR factorization of Y' * Y at the finish? This seems like it can't be right... Y has only 5 vectors in 10 dimensions and Y' * Y is certainly not invertible. I get: 1.20857 -0.20462 0.08707 -0.16972 0.17038 0.00342 0.24459 -0.23287 0.51142 -0.06083 0.0 1.13242 0.23155 0.24354 0.32995 0.47781 -0.02832 0.43071 -0.24968 0.41470 0.0 0.0 0.91070 0.37732 0.05296 0.39886 -0.62426 0.07809 0.53891 0.24877 0.0 0.0 0.0 0.69369 -0.21648 -0.10501 0.09706 -0.03683 -0.10512 0.02849 0.0 0.0 0.0 0.0 0.60165 0.37106 -0.00193 -0.23392 0.10109 -0.09897 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 I think there are some other differences here but probably not meaningful in this context. For example I was doing implicit-feedback ALS. (But the result above is from an Octave implementation of regular ALS like what your'e running) There are a bunch of useful thoughts here I am going to both read up and explore as conditions. On Thu, Apr 4, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote: BTW, my initialization of X and Y is simply random: X = rand(m,k); Y = rand(k,n); On Thu, Apr 4, 2013 at 3:51 PM, Koobas koo...@gmail.com wrote: It's done in one iteration. This is the R from QR factorization: 5.06635.81224.97044.39876.34004.59705.0334 4.25813.38085.3250 02.40361.17222.32961.65800.45751.1706 2.10401.67381.4839 0 01.50850.09661.25810.52360.4712 -0.04110.31430.5957 0 0 01.86820.1834 -0.3244 -0.0073 0.38171.16730.4783 0 0 0 01.95690.86660.3201 -0.41670.07320.3114 0 0 0 0 01.35200.2326 -0.1156 -0.27930.0103 0 0 0 0 0 01.1689 0.31510.05900.0435 0 0 0 0 0 0 0 1.6296 -0.3494 -0.0024 0 0 0 0 0 0 0 01.43070.1803 0 0 0 0 0 0 0 0 01.1404
Re: Parallel GenericRecommenderIRStatsEvaluator?
No, just was never written I suppose back in the day. The way it is structured now it creates a test split for each user, which is also slow, and may be challenging to memory limitations as that's N data models in memory. You could take a crack at a patch. When I rewrote this aspect in a separate project it was certainly threaded and relied on a single test split. It's much faster indeed. On Mon, Apr 1, 2013 at 11:26 AM, Gabor Bernat ber...@primeranks.net wrote: Hello, Is there any good reason why the *GenericRecommenderIRStatsEvaluator* does not support parallel (multi-CPU) evaluation. Today is quite common to have CPUs with more than one core, and IR evaluation on any reasonably sized data set takes forever to finish. I'm thinking if we could parallelize the evaluation, by breaking down the input into subsets, and accumulating at the end the measurements of each subset, the evaluation time could be heavily improved. For example I have a data set with 2+ million ratings, and evaluating IR with even 10% of this with a simple recommender takes more than 3 hours with just a single core of my CPU being kept busy... So? Bernát GÁBOR
Re: Reproducibility, and Recommender Algorithms in Mahout
You should be able to get reproducible random seed values by calling RandomUtils.useTestSeed() at the very start of your program. But if your goal is to get an unbiased view of the quality of results, you want to run several times and take the average yes. On Sat, Mar 30, 2013 at 3:57 PM, Reinhard Denis Najogie najo...@gmail.com wrote: Dear all, I am doing experiments as a part of my final project. I'm comparing the performance of Mahout's implementations of recommender algorithms on some public dataset (so far bookcross and grouplens). I want to ask 2 questions: 1. The score (RMSE) results quite vary each time I run an algorithm (sometimes +- 0.5 difference on some algorithms). Is there any way that I can make it produce the same result on each run? Maybe by setting a seed somewhere on the code? Or should I just do like 10 run and take the average score? 2. Where can I see the list of all recommender algorithms already implemented by Mahout? From what I read on Mahout in Action book, there are 6 algorithms: UserBased, ItemBased, Slope One, SVD, KnnItemBased, and TreeClustering. Are there new algorithms since then? Oh, and I found both KnnItem and TreeClustering are deprecated on the newest version of Mahout (0.8-SNAPSHOT) ? Why is this the case? -- Regards, Reinhard Denis Najogie
Re: Setting preferences in GenericDataModel.
Yes it's OK. You need to care for thread safety though, which will be hard. The other problem is that changing the underlying data doesn't necessarily invalidate caches above it. You'll have to consider that part as well. I suppose this is part of why it was conceived as a model where the data is only periodically re-read -- you gain speed from immutability and cacheability. But you lose, of course, real-time updates. On Fri, Mar 29, 2013 at 5:46 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote: Hello, I checked the implementation of GenericDataModel for adding and removing preferences after instantiation. Those methods (setPreference(long, long, float) and removePreference(long, long)) throw UnsupportedOperationException s. I'd like to know whether there is an important reason for not altering content of a GenericDataModel, since in our application data can fit into memory and we want our data to be up to date. DataModel interface have those methods, and GenericDataModel is just an in-memory implementation of it. Would it be ok if I write an implementation of DataModel like GenericDataModel, but with setPreference and removePreference methods not throwing exceptions? Thanks, Ceyhun Can ULKER
Re: Number of Clustering MR-Jobs
This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to run multiple mappers on less than one block of data, even with a low max split size. Reducers you can control. On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister sebastian.briesemeis...@unister-gmbh.de wrote: Thank you. Splitting the files leads to multiple MR-tasks! Only changing the MR settings of hadoop did not help. In the future it would be nice if the drivers would scale themself and would split the data according to the dataset size and the number of available MR-slots.
Re: sql data model w/where clause
Modify the existing code to change the SQL -- it's just a matter of copying a class that only specifies SQL and making new SQL statements. I think there's a version that even reads from a Properties object. On Mon, Mar 25, 2013 at 12:11 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I have a table of user preferences with the following columns: user_id item_id tag I want to build a data model in mahout, but not use the entire table. I'd like to add a where clause like where tag = 'A' when building the model instance. Is this possible? If not, any way around this besides creating a view or new table? Thanks, Matt
Re: Mathematical background of ALS recommenders
Points from across several e-mails -- The initial item-feature matrix can be just random unit vectors too. I have slightly better results with that. You are finding the least-squares solution of A = U M' for U given A and M. Yes you can derive that analytically as the zero of the derivative of the error function. With m users and n items, and k features, where k=n, then I suppose you don't need any iterations at all since there is a trivial solution: U = A, M = I(n) (the nxn identity matrix). You would not find this on the first iteration, however, if you followed the algorithm, because you would be starting from some random starting point. But if you initialized M to the identity matrix, yes you'd find the exact solution immediately. Yes it is an iterative algorithm and you have to define a convergence criterion. I use average absolute difference in (U M') entries from one iteration to the next. (Well, a sample.) It's possible that you reach your criterion in 1 iteration, or, not. It depends on the criterion. Usually when you restart ALS on updated data, you use the previous M matrix as a starting point. If the change in data is small, one iteration should usually leave you still converged actually. But, from random starting point -- not typical. ALS is similar to SVD only in broad terms. The SVD is not always used to make a low-rank factorization, and, its outputs carry more guarantees -- they are orthonormal bases because it has factored out scaling factors into the third matrix Sigma. I think of ALS as more simplistic and therefore possibly faster. WIth k features I believe (?) the SVD is necessarily a k-iteration process at least, whereas ALS might be of use after 1 iteration. The SVD is not a shortcut for ALS. If you go to the trouble of a full SVD, you can certainly use that factorization as-is though. You need regularization. It should be pointed out that the ALS often spoken of here is not one that factors the input matrix A. There's a variant, that I have had good results with, for 'implicit' feedback. There, you are actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using values in A as weights in the loss function. You're reconstructing interacted or not and using input value as a confidence measure. This works for ratings although the interpretation in this variant doesn't line up with the nature of ratings. It works quite nicely for things like clicks, etc. (Much more can be said on this point.) On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner cont...@dhuebner.com wrote: It's quite hard for me to get the mathematical concepts of the ALS recommenders. It would be great if someone could help me to figure out the details. This is my current status: 1. The item-feature (M) matrix is initialized using the average ratings and random values (explicit case) 2. The user-feature (U) matrix is solved using the partial derivative of the error function with respect to u_i (the columns of row-vectors of U) Supposed we use as many features as items are known and the error function does not use any regularization. Would U be solved within the first iteration? If not, I do not understand why more than one iteration is needed. Furthermore, I believe to have understood that using fewer features than items and also applying regularization, does not allow to solve U in a way that the stopping criterion can be met after only one iteration. Thus, iteration is required to gradually converge to the stopping criterion. I hope I have pointed out my problems clearly enough.
Re: Mathematical background of ALS recommenders
OK, the 'k iterations' happen inline in one job? I thought the Lanczos algorithm found the k eigenvalues/vectors one after the other. Yeah I suppose that doesn't literally mean k map/reduce jobs. Yes the broader idea was whether or not you might get something useful out of ALS earlier. On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning ted.dunn...@gmail.com wrote: SVD need not be iterative at all. The SSVD code uses roughly 5 map-reduces to give you a high quality SVD approximation. There is the option to add 0, 1 or more extra iterations, but it is rare to need more than 1. ALS could well be of use after less work. This is especially try for incremental solutions.
Re: Mathematical background of ALS recommenders
On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter s...@apache.org wrote: Well in LSI it is ok to do that, as a missing entry means that the document contains zero occurrences of a given term which is totally fine. In Collaborative Filtering with explicit feedback, a missing rating is not automatically a rating of zero, it is simply unknown what the user would give as rating. fOR implicit data (number of interactions), a missing entry is indeed zero, but in most cases you might not have the same confidence in that observation as if you observed an interaction. Koren's ALS paper discusses this and introduces constructs to handle this, by putting more weight on minimizing the loss over observed interactions. In matrix factorization for CF, the factorization usually has to minimize the regularized loss over the known entries only. If all unknown entries were simply considered zero, I'd assume that the factorization that resulted would not generalize very well to unseen data. Some researchers title matrix factorization for CF as matrix completion which IMHO better describes the problem. Yes it's just that you shouldn't if inputs are rating-like, not that you literally couldn't. If your input is ratings on a scale of 1-5 then reconstructing a 0 everywhere else means you assume everything not viewed is hated, which doesn't work at all. You can subtract the mean from observed ratings, and then you assume everything unobserved has an average rating. But the assumption works nicely for click-like data. Better still when you can weakly prefer to reconstruct the 0 for missing observations and much more strongly prefer to reconstruct the 1 for observed data.
Re: postgres recommendation adapter
Are you using the 'integration' artifact? this is not in 'core'. On Mon, Mar 25, 2013 at 12:43 PM, Matt Mitchell goodie...@gmail.com wrote: Yeah sorry. I'm attempting to load this class: org.apache.mahout.cf.taste.impl.model.jdbc.PostgreSQLBooleanPrefJDBCDataModel but getting a ClassNotFoundException I'm using version 0.7 of mahout-core and mahout-math, and version 0.5 of mahout-utils. - Matt On Mon, Mar 25, 2013 at 6:21 AM, Sean Owen sro...@gmail.com wrote: I think you'd have to define not working first On Mon, Mar 25, 2013 at 1:32 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I've seen references to a postgres, user pref class via google searches, but can't seem to get this to work using mahout-core version 0.7. Could someone describe how to get postgres working with Mahout CF?
Re: Mathematical background of ALS recommenders
(The unobserved entries are still in the loss function, just with low weight. They are also in the system of equations you are solving for.) On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Classic als wr is bypassing underlearning problem by cutting out unrated entries from linear equations system. It also still has a fery defined regularization technique which allows to find optimal fit in theory (but still not in mahout, not without at least some additional sweat, i heard).
Re: Mathematical background of ALS recommenders
On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote: But the assumption works nicely for click-like data. Better still when you can weakly prefer to reconstruct the 0 for missing observations and much more strongly prefer to reconstruct the 1 for observed data. This does seem intuitive. How does the benefit manifest itself? In lowering the RMSE of reconstructing the interaction matrix? Are there any indicators that it results in better recommendations? Koobas In this approach you are no longer reconstructing the interaction matrix, so there is no RMSE vs the interaction matrix. You're reconstructing a matrix of 0 and 1. Because entries are weighted differently, you're not even minimizing RMSE over that matrix -- the point is to take some errors more seriously than others. You're minimizing a *weighted* RMSE, yes. Yes of course the goal is better recommendations. This broader idea is harder to measure. You can use mean average precision to measure the tendency to predict back interactions that were held out. Is it better? depends on better than *what*. Applying algorithms that treat input like ratings doesn't work as well on click-like data. The main problem is that these will tend to pay too much attention to large values. For example if an item was clicked 1000 times, and you are trying to actually reconstruct that 1000, then a 10% error costs (0.1*1000)^2 = 1. But a 10% error in reconstructing an item that was clicked once costs (0.1*1)^2 = 0.01. The former is considered a million times more important error-wise than the latter, even though the intuition is that it's just 1000 times more important. Better than algorithms that ignore the weight entirely -- yes probably if only because you are using more information. But as in all things it depends.
Re: Mathematical background of ALS recommenders
If your input is clicks, carts, etc. yes you ought to get generally good results from something meant to consume implicit feedback, like ALS (for implicit feedback, yes there are at least two main variants). I think you are talking about the implicit version since you mention 0/1. lambda is the regularization parameter. It is defined a bit differently in the various papers though. Test a few values if you can. But you said no weights in the regularization... what do you mean? you don't want to disable regularization entirely. On Mon, Mar 25, 2013 at 2:14 PM, Koobas koo...@gmail.com wrote: On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen sro...@gmail.com wrote: On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote: But the assumption works nicely for click-like data. Better still when you can weakly prefer to reconstruct the 0 for missing observations and much more strongly prefer to reconstruct the 1 for observed data. This does seem intuitive. How does the benefit manifest itself? In lowering the RMSE of reconstructing the interaction matrix? Are there any indicators that it results in better recommendations? Koobas In this approach you are no longer reconstructing the interaction matrix, so there is no RMSE vs the interaction matrix. You're reconstructing a matrix of 0 and 1. Because entries are weighted differently, you're not even minimizing RMSE over that matrix -- the point is to take some errors more seriously than others. You're minimizing a *weighted* RMSE, yes. Yes of course the goal is better recommendations. This broader idea is harder to measure. You can use mean average precision to measure the tendency to predict back interactions that were held out. Is it better? depends on better than *what*. Applying algorithms that treat input like ratings doesn't work as well on click-like data. The main problem is that these will tend to pay too much attention to large values. For example if an item was clicked 1000 times, and you are trying to actually reconstruct that 1000, then a 10% error costs (0.1*1000)^2 = 1. But a 10% error in reconstructing an item that was clicked once costs (0.1*1)^2 = 0.01. The former is considered a million times more important error-wise than the latter, even though the intuition is that it's just 1000 times more important. Better than algorithms that ignore the weight entirely -- yes probably if only because you are using more information. But as in all things it depends. Let's say the following. Classic market basket. Implicit feedback. Ones and zeros in the input matrix, no weights in the regularization, lambda=1. What I will get is: A) a reasonable recommender, B) a joke of a recommender.
Re: Boosting User-Based with the user's attributes
You would have to make up the similarity metric separately since it depends entirely on how you want to define it. The part of the book you are talking about concerns rescoring, which is not the same thing. Combine the similarity metrics, I mean, not make two recommenders. Make a metric that is the product of two other metrics. Normalize both of those metrics to the range [0,1]. Sean On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana a.filian...@gmail.comwrote: Hi, Thank Sean for the response. I like the idea of multiplying the similarity metric based on user properties with the one based on CF data. I understand that I have to create a seperate similarity metric - can I do this with the help of Mahout or does this have to be done seperately, as in I have to implement my own similarity measure? It would be great if there is some clue on how I get this started. Is this somehow similar to the subject of *Injecting domain-specific information* in the book Mahout in Action (with the example of the gender-based item similarity metric)? And also how can I multiply the two results - will this affect the result of the evaluation of the recommender system? Or it should be normalized in a way? Thank you and sorry for the basic questions. Regards, Agata Filiana On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote: There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender and interests, then you can use it as the similarity metric in GenericBooleanPrefUserBasedRecommender. You would be using both data sets in some way. Of course this means learning a whole different similarity metric somehow. A variant on this is to make a similarity metric based on user properties, and also use one based on CF data, and multiply them together to make a new combined similarity metric for this approach. This might work OK. It can also work to treat age and gender and other features as categorical features, and then model them as 'items' that the user interacts with. They would not have much of an effect here given how many items there are. In other models like ALS-WR you can weight these pseudo-items much more highly and get the desired effect to a degree. On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.com wrote: Hi, I'm fairly new to Mahout. Right now I am experimenting Mahout by trying to build a simple recommendation system. What I have is just a boolean data set, with only the userID and itemID. I understand that for this case I have to use GenericBooleanPrefUserBasedRecommender - which I have and works fine. Apart from the userID and itemID data, I also have the user's attributes (their age, gender, list of interests). I would like to combine this into the recommendation system to increase the performance of the recommender. Is this possible to do or am I trying something that does not make sense? It would be great if you can give me any inputs or ideas for this. (Or any good read based on this matter) Thank you! Regards, *Agata Filiana* Erasmus Mundus Student -- *Agata Filiana *
Re: Boosting User-Based with the user's attributes
There is a difference between the recommender and the similarity metric it uses. My suggestion was to either use your item data with the recommender and hobby data with the similarity metric, or, use both in the similarity metric by making a combined metric. On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana a.filian...@gmail.comwrote: I understand how it works logically. However I am having problem understanding about the implementation of it and how to get the final outcome. Say the user's attribute is Hobbies: hobby1,hobby2,hobby3 So I would make the similarity metric of the users and hobbies. Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender with the boolean data set (userID and itemID). Then somehow combine the two? But at the end, my goal is to recommend the items in the second data set (the itemID, not recommend the hobbies) - does this make sense? Or am I confusing myself? Agata On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote: You would have to make up the similarity metric separately since it depends entirely on how you want to define it. The part of the book you are talking about concerns rescoring, which is not the same thing. Combine the similarity metrics, I mean, not make two recommenders. Make a metric that is the product of two other metrics. Normalize both of those metrics to the range [0,1]. Sean On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana a.filian...@gmail.com wrote: Hi, Thank Sean for the response. I like the idea of multiplying the similarity metric based on user properties with the one based on CF data. I understand that I have to create a seperate similarity metric - can I do this with the help of Mahout or does this have to be done seperately, as in I have to implement my own similarity measure? It would be great if there is some clue on how I get this started. Is this somehow similar to the subject of *Injecting domain-specific information* in the book Mahout in Action (with the example of the gender-based item similarity metric)? And also how can I multiply the two results - will this affect the result of the evaluation of the recommender system? Or it should be normalized in a way? Thank you and sorry for the basic questions. Regards, Agata Filiana On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote: There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender and interests, then you can use it as the similarity metric in GenericBooleanPrefUserBasedRecommender. You would be using both data sets in some way. Of course this means learning a whole different similarity metric somehow. A variant on this is to make a similarity metric based on user properties, and also use one based on CF data, and multiply them together to make a new combined similarity metric for this approach. This might work OK. It can also work to treat age and gender and other features as categorical features, and then model them as 'items' that the user interacts with. They would not have much of an effect here given how many items there are. In other models like ALS-WR you can weight these pseudo-items much more highly and get the desired effect to a degree. On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.com wrote: Hi, I'm fairly new to Mahout. Right now I am experimenting Mahout by trying to build a simple recommendation system. What I have is just a boolean data set, with only the userID and itemID. I understand that for this case I have to use GenericBooleanPrefUserBasedRecommender - which I have and works fine. Apart from the userID and itemID data, I also have the user's attributes (their age, gender, list of interests). I would like to combine this into the recommendation system to increase the performance of the recommender. Is this possible to do or am I trying something that does not make sense? It would be great if you can give me any inputs or ideas for this. (Or any good read based on this matter) Thank you! Regards, *Agata Filiana* Erasmus Mundus Student -- *Agata Filiana * -- *Agata Filiana *
Re: ALS-WR on Million Song dataset
One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for Implicit Feedback Datasets. I've been working with some folks who point out that alpha=40 seems to be too high for most data sets. After running some tests on common data sets, alpha=1 looks much better. YMMV. In the end you have to evaluate these two parameters, and the # of features, across a range to determine what's best. Is this data set not a bunch of audio features? I am not sure it works for ALS, not naturally at least. On Mon, Mar 18, 2013 at 12:39 PM, Han JU ju.han.fe...@gmail.com wrote: Hi, I'm wondering has someone tried ParallelALS with implicite feedback job on million song dataset? Some pointers on alpha and lambda? In the paper alpha is 40 and lambda is 150, but I don't know what are their r values in the matrix. They said is based on time units that users have watched the show, so may be it's big. Many thanks! -- *JU Han* UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 061960
Re: ALS-WR on Million Song dataset
Yes that's fine input then. Large alpha should go with small R values, not large R. Really alpha controls how much observed input (R != 0) is weighted towards 1 versus how much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to complete this effect. On Mon, Mar 18, 2013 at 1:06 PM, Han JU ju.han.fe...@gmail.com wrote: Thanks for quick responses. Yes it's that dataset. What I'm using is triplets of user_id song_id play_times, of ~ 1m users. No audio things, just plein text triples. It seems to me that the paper about implicit feedback matchs well this dataset: no explicit ratings, but times of listening to a song. Thank you Sean for the alpha value, I think they use big numbers is because their values in the R matrix is big. 2013/3/18 Sebastian Schelter ssc.o...@googlemail.com JU, are you refering to this dataset? http://labrosa.ee.columbia.edu/millionsong/tasteprofile On 18.03.2013 17:47, Sean Owen wrote: One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for Implicit Feedback Datasets. I've been working with some folks who point out that alpha=40 seems to be too high for most data sets. After running some tests on common data sets, alpha=1 looks much better. YMMV. In the end you have to evaluate these two parameters, and the # of features, across a range to determine what's best. Is this data set not a bunch of audio features? I am not sure it works for ALS, not naturally at least. On Mon, Mar 18, 2013 at 12:39 PM, Han JU ju.han.fe...@gmail.com wrote: Hi, I'm wondering has someone tried ParallelALS with implicite feedback job on million song dataset? Some pointers on alpha and lambda? In the paper alpha is 40 and lambda is 150, but I don't know what are their r values in the matrix. They said is based on time units that users have watched the show, so may be it's big. Many thanks! -- *JU Han* UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 061960 -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 061960
Re: Boosting User-Based with the user's attributes
I'm not sure what you mean. The only thing I am suggesting to combine are two similarity metrics, not data or recommendations. You combine metrics by multiplying their values. On Mon, Mar 18, 2013 at 12:54 PM, Agata Filiana a.filian...@gmail.comwrote: In this case, would be correct if I somehow loop through the item data and the hobby data and then combine the score for a pair of users? I am having trouble in how to combine both similarity into one metric, could you possibly point me out a clue? Thank you On 18 March 2013 14:54, Sean Owen sro...@gmail.com wrote: There is a difference between the recommender and the similarity metric it uses. My suggestion was to either use your item data with the recommender and hobby data with the similarity metric, or, use both in the similarity metric by making a combined metric. On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana a.filian...@gmail.com wrote: I understand how it works logically. However I am having problem understanding about the implementation of it and how to get the final outcome. Say the user's attribute is Hobbies: hobby1,hobby2,hobby3 So I would make the similarity metric of the users and hobbies. Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender with the boolean data set (userID and itemID). Then somehow combine the two? But at the end, my goal is to recommend the items in the second data set (the itemID, not recommend the hobbies) - does this make sense? Or am I confusing myself? Agata On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote: You would have to make up the similarity metric separately since it depends entirely on how you want to define it. The part of the book you are talking about concerns rescoring, which is not the same thing. Combine the similarity metrics, I mean, not make two recommenders. Make a metric that is the product of two other metrics. Normalize both of those metrics to the range [0,1]. Sean On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana a.filian...@gmail.com wrote: Hi, Thank Sean for the response. I like the idea of multiplying the similarity metric based on user properties with the one based on CF data. I understand that I have to create a seperate similarity metric - can I do this with the help of Mahout or does this have to be done seperately, as in I have to implement my own similarity measure? It would be great if there is some clue on how I get this started. Is this somehow similar to the subject of *Injecting domain-specific information* in the book Mahout in Action (with the example of the gender-based item similarity metric)? And also how can I multiply the two results - will this affect the result of the evaluation of the recommender system? Or it should be normalized in a way? Thank you and sorry for the basic questions. Regards, Agata Filiana On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote: There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender and interests, then you can use it as the similarity metric in GenericBooleanPrefUserBasedRecommender. You would be using both data sets in some way. Of course this means learning a whole different similarity metric somehow. A variant on this is to make a similarity metric based on user properties, and also use one based on CF data, and multiply them together to make a new combined similarity metric for this approach. This might work OK. It can also work to treat age and gender and other features as categorical features, and then model them as 'items' that the user interacts with. They would not have much of an effect here given how many items there are. In other models like ALS-WR you can weight these pseudo-items much more highly and get the desired effect to a degree. On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.com wrote: Hi, I'm fairly new to Mahout. Right now I am experimenting Mahout by trying to build a simple recommendation system. What I have is just a boolean data set, with only the userID and itemID. I understand that for this case I have to use GenericBooleanPrefUserBasedRecommender - which I have and works fine. Apart from the userID and itemID data, I also have the user's attributes (their age, gender, list of interests). I would like to combine this into the recommendation system to increase
Re: reproducibility
What's your question? ALS has a random starting point which changes the results a bit. Not sure about KNN though. On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote: Can anybody shed any light on the issue of reproducibility in Mahout, with and without Hadoop, specifically in the context of kNN and ALS recommenders?
Re: reproducibility
If an algorithm has a stochastic/random element, no it won't necessarily produce the same result, by design. If you can fix the seed of the random number generator, you should get the same result. Except that if the process is multi-threaded or distributed, even that doesn't guarantee it -- the RNG could be accessed in a different order. Even if you can control your code it can be hard to control the RNGs in third-party libraries. Even in a deterministic single-threaded program Java's floating point results are not guaranteed to be the same across platforms (unless you use strictfp). ALS definitely has a random starting point, so reproducibility is not guaranteed even from the top. If you fix the random seed in the context of this project's unit tests, you *should* get the same result since I think it manages to use no third-party RNGs and runs a test from a fixed starting point in 1 thread. KNN does not have a stochastic element. I think you would get the same results on one platform, unless I'm missing something. I don't think exact reproducibility is an issue. Certainly at scale where the entire computation is distributed over such a complex cluster environment. Most ML is about guessing at what's not known anyway. As long as very small differences make only very small differences in the outcome, differing FP behavior will make no or vanishingly small difference. The only place where I think FP reproducibility matters -- of the sort that numerical libraries care about -- is in under/overflow issues. But that is solved by moving into a log space or something. You would never want to depend on the nth significant digit of a float mattering. On Sun, Mar 17, 2013 at 1:43 PM, Koobas koo...@gmail.com wrote: I am asking the basic reproducibility question. If I run twice on the same dataset, with the same hardware setup, will I always get the same resuts? Or is there any chance that on two different runs, the same user will get slightly different suggestions? I am mostly revolving in the space of numerical libraries, where reproducibility is, sort of, a big deal. Maybe it's not much of a concern in machine learning. I am just curious. On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen sro...@gmail.com wrote: What's your question? ALS has a random starting point which changes the results a bit. Not sure about KNN though. On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote: Can anybody shed any light on the issue of reproducibility in Mahout, with and without Hadoop, specifically in the context of kNN and ALS recommenders?
Re: Boosting User-Based with the user's attributes
There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender and interests, then you can use it as the similarity metric in GenericBooleanPrefUserBasedRecommender. You would be using both data sets in some way. Of course this means learning a whole different similarity metric somehow. A variant on this is to make a similarity metric based on user properties, and also use one based on CF data, and multiply them together to make a new combined similarity metric for this approach. This might work OK. It can also work to treat age and gender and other features as categorical features, and then model them as 'items' that the user interacts with. They would not have much of an effect here given how many items there are. In other models like ALS-WR you can weight these pseudo-items much more highly and get the desired effect to a degree. On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.comwrote: Hi, I'm fairly new to Mahout. Right now I am experimenting Mahout by trying to build a simple recommendation system. What I have is just a boolean data set, with only the userID and itemID. I understand that for this case I have to use GenericBooleanPrefUserBasedRecommender - which I have and works fine. Apart from the userID and itemID data, I also have the user's attributes (their age, gender, list of interests). I would like to combine this into the recommendation system to increase the performance of the recommender. Is this possible to do or am I trying something that does not make sense? It would be great if you can give me any inputs or ideas for this. (Or any good read based on this matter) Thank you! Regards, *Agata Filiana* Erasmus Mundus Student
Re: QR decomposition in ALS-WR code
I think you are referring to the same step? QR decomposition is how you solve for u_i which is what I imagine the same step you have in mind.
Re: Mahout and Hadoop 2
I think someone submitted a different build profile that changes the dependencies for you. I believe the issue is using hadoop-common and not hadoop-core as well as changing versions. I think the rest is compile compatible and probably runtime compatible. But I've not tried. On Wed, Mar 13, 2013 at 7:58 PM, Jian Fang jian.fang.subscr...@gmail.comwrote: Hi, Is there anyway to make mahout 0.7 or 0.8 work with Hadoop 2.0.2-alpha? Seems Mahout builds against Hadoop 1.X by default in the pom.xml and it also requires hadoop-core.jar, which only exists in Hadoop 1.x if I remember correctly. Thanks, Jian
Re: Top-N recommendations from SVD
Yeah that's right, he said 20 features, oops. And yes he says he's talking about the recs only too, so that's not right either. That seems way too long relative to factorization. And the factorization seems quite fast; how many machines, and how many iterations? I thought the shape of the computation was to cache B' (yes whose columns are B rows) and multiply against the rows of A. There again probably wrong given the latest timing info. On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins h...@joshdevins.com wrote: So the 80 hour estimate is _only_ for the U*M', top-n calculation and not the factorization. Factorization is on the order of 2-hours. For the interested, here's the pertinent code from the ALS `RecommenderJob`: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148 I'm sure this can be optimised, but by an order of magnitude? Something to try out, I'll report back if I find anything concrete. On 6 March 2013 11:13, Ted Dunning ted.dunn...@gmail.com wrote: Well, it would definitely not be the for time I counted incorrectly. Anytime I do arithmetic the result should be considered suspect. I do think my numbers are correct, but then again, I always do. But the OP did say 20 dimensions which gives me back 5x. Inclusion of learning time is a good suspect. In the other side of the ledger, if the multiply is doing any column wise access it is a likely performance bug. The computation is AB'. Perhaps you refer to rows of B which are the columns of B'. Sent from my sleepy thumbs set to typing on my iPhone. On Mar 6, 2013, at 4:16 AM, Sean Owen sro...@gmail.com wrote: If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops -- I think you're missing an M, and the features by an order of magnitude. That's still 1 day on an 8-core machine by this rule of thumb. The 80 hours is the model building time too (right?), not the time to multiply U*M'. This is dominated by iterations when building from scratch, and I expect took 75% of that 80 hours. So if the multiply was 20 hours -- on 10 machines -- on Hadoop, then that's still slow but not out of the question for Hadoop, given it's usually a 3-6x slowdown over a parallel in-core implementation. I'm pretty sure what exists in Mahout here can be optimized further at the Hadoop level; I don't know that it's doing the multiply badly though. In fact I'm pretty sure it's caching cols in memory, which is a bit of 'cheating' to speed up by taking a lot of memory. On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M Flops. You should get about a Gflop per core in Java so this should about 60 ms. You can make this faster with more cores or by using ATLAS. Are you expecting 3 million unique people every 80 hours? If no, then it is probably more efficient to compute the recommendations on the fly. How many recommendations per second are you expecting? If you have 1 million uniques per day (just for grins) and we assume 20,000 s/day to allow for peak loading, you have to do 50 queries per second peak. This seems to require 3 cores. Use 16 to be safe. Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours. I think that your map-reduce is under performing by about a factor of 10. This is quite plausible with bad arrangement of the inner loops. I think that you would have highest performance computing the recommendations for a few thousand items by a few thousand users at a time. It might be just about as fast to do all items against a few users at a time. The reason for this is that dense matrix multiply requires c n x k + m x k memory ops, but n x k x m arithmetic ops. If you can re-use data many times, you can balance memory channel bandwidth against CPU speed. Typically you need 20 or more re-uses to really make this fly.
Re: Top-N recommendations from SVD
OK and he mentioned that 10 mappers were running, when it ought to be able to use several per machine. The # of mappers is a function of the input size really, so probably needs to turn down the max file split size to induce more mappers? On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: Btw, all important jobs in ALS are map-only, so its the number of map slotes that counts.
Re: Top-N recommendations from SVD
That too, even better. Isn't that already done? Could be in one place but not another. IIRC there were also cases where it was a lot easier to pass around an object internally and mutability solved the performance issue, without much risk since it was only internal. You can (nay, must) always copy the objects before being returned. On Wed, Mar 6, 2013 at 4:01 PM, Ted Dunning ted.dunn...@gmail.com wrote: I would recommend against a mutable object on maintenance grounds. Better is to keep the threshold that a new score must meet and only construct the object on need. That cuts the allocation down to negligible levels. On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen sro...@gmail.com wrote: OK, that's reasonable on 35 machines. (You can turn up to 70 reducers, probably, as most machines can handle 2 reducers at once). I think the recommendation step loads one whole matrix into memory. You're not running out of memory but if you're turning up the heap size to accommodate, you might be hitting swapping, yes. I think (?) the conventional wisdom is to turn off swap for Hadoop. Sebastian yes that is probably a good optimization; I've had good results reusing a mutable object in this context. On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins h...@joshdevins.com wrote: The factorization at 2-hours is kind of a non-issue (certainly fast enough). It was run with (if I recall correctly) 30 reducers across a 35 node cluster, with 10 iterations. I was a bit shocked at how long the recommendation step took and will throw some timing debug in to see where the problem lies exactly. There were no other jobs running on the cluster during these attempts, but it's certainly possible that something is swapping or the like. I'll be looking more closely today before I start to consider other options for calculating the recommendations.
Re: Top-N recommendations from SVD
Without any tricks, yes you have to do this much work to really know which are the largest values in UM' for every row. There's not an obvious twist that speeds it up. (Do you really want to compute all user recommendations? how many of the 2.6M are likely to be active soon, or, ever?) First, usually it's only a subset of all items that are recommendable anyway. You don't want them out of the model but don't need to consider them. This is domain specific of course, but, if 90% of the items are out of stock or something, of course you can not bother to score them in the first place Yes, LSH is exactly what I do as well. You hash the item feature vectors into buckets and then only iterate over nearby buckets to find candidates. You can avoid looking at 90+% of candidates this way without much if any impact on top N. Pruning is indeed third on the list but usually you get the problem to a pretty good size from the points above. On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins h...@joshdevins.com wrote: Hi all, I have a conceptually simple problem. A user-item matrix, A, whose dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with 20 features reduces this in the usual way to A = UM'. Trying to generate top-n (where n=100) recommendations for all users in U is quite a long process though. Essentially, for every user, it's generating a prediction for all unrated items in M then taking the top-n (all in-memory). I'm using the standard ALS `RecommenderJob` for this. Considering that there are ~2.6M users and ~2.8M items, this is a really, really, time consuming way to find the top-n recommendations for all users in U. I feel like there could be a tricky way to avoid having to compute all item predictions of a user though. I can't find any reference in papers about improving this but at the moment, the estimate (with 10 mappers running the `RecommenderJob`) is ~80 hours. When I think about this problem I wonder if applying kNN or local sensitive min-hashing would somehow help me. Basically find the nearest neighbours directly and calculate predictions on those items only and not every item in M. On the flip side, I could start to reduce the item space, since it's quite large, basically start removing items that have low in-degrees since these probably don't contribute too much to the final recommendations. I don't like this so much though as it could remove some of the long-tail recommendations. At least, that is my intuition :) Thoughts anyone? Thanks in advance, Josh
Re: Top-N recommendations from SVD
Ah OK, so this is quite a big problem. Still, it is quite useful to be able to make recommendations in real-time, or near-real-time. It saves the relatively quite large cost of precomputing, and lets you respond immediately to new data. If the site has a lot of occasional or new users, that can make a huge difference -- if I visit once, or once a month, precomputing recommendations every day from tomorrow doesn't help much. Of course, that can be difficult to reconcile with 100ms response times, but with some tricks like LSH and some reasonable hardware I think you'd find it possible at this scale. It does take a lot of engineering. On Tue, Mar 5, 2013 at 9:43 PM, Josh Devins h...@joshdevins.com wrote: Thanks Sean, at least I know I'm mostly on the right track ;) So in our case (a large, social, consumer website), this is already a small subset of all users (and items for that matter) and is really only the active users. In fact, in future iterations, the number of users will likely grow by around 3x (or at least, that's my optimistic target). So it's not very likely to be able to calculate recommendations for fewer users, but I like the idea of leaving all items in the matrix but not computing preference predictions for all of them. I will think on this and see if it fits for our domain (probably will work), and maybe a pull request to Mahout if I can make this generic in some way! LSH was my instinctual approach also but wasn't totally sure if this was sane! I'll have a look into this as well if needed. Thanks for the advice! Josh On 5 March 2013 22:23, Sean Owen sro...@gmail.com wrote: Without any tricks, yes you have to do this much work to really know which are the largest values in UM' for every row. There's not an obvious twist that speeds it up. (Do you really want to compute all user recommendations? how many of the 2.6M are likely to be active soon, or, ever?) First, usually it's only a subset of all items that are recommendable anyway. You don't want them out of the model but don't need to consider them. This is domain specific of course, but, if 90% of the items are out of stock or something, of course you can not bother to score them in the first place Yes, LSH is exactly what I do as well. You hash the item feature vectors into buckets and then only iterate over nearby buckets to find candidates. You can avoid looking at 90+% of candidates this way without much if any impact on top N. Pruning is indeed third on the list but usually you get the problem to a pretty good size from the points above. On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins h...@joshdevins.com wrote: Hi all, I have a conceptually simple problem. A user-item matrix, A, whose dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with 20 features reduces this in the usual way to A = UM'. Trying to generate top-n (where n=100) recommendations for all users in U is quite a long process though. Essentially, for every user, it's generating a prediction for all unrated items in M then taking the top-n (all in-memory). I'm using the standard ALS `RecommenderJob` for this. Considering that there are ~2.6M users and ~2.8M items, this is a really, really, time consuming way to find the top-n recommendations for all users in U. I feel like there could be a tricky way to avoid having to compute all item predictions of a user though. I can't find any reference in papers about improving this but at the moment, the estimate (with 10 mappers running the `RecommenderJob`) is ~80 hours. When I think about this problem I wonder if applying kNN or local sensitive min-hashing would somehow help me. Basically find the nearest neighbours directly and calculate predictions on those items only and not every item in M. On the flip side, I could start to reduce the item space, since it's quite large, basically start removing items that have low in-degrees since these probably don't contribute too much to the final recommendations. I don't like this so much though as it could remove some of the long-tail recommendations. At least, that is my intuition :) Thoughts anyone? Thanks in advance, Josh
Re: FileDataModel
That's true, it does now. Depending on the implementation, you may still need to rebuild things to reflect the changes. Also note that this wouldn't invalidate caches you put on top. On Sun, Mar 3, 2013 at 7:55 AM, Nadia Najjar ned...@gmail.com wrote: Thanks, Sean! The remove/setPreference methods throw an UnsupportedOperationException. I read in an old thread that you had updated these methods to work. I'm not sure what I'm missing here. Can you point me in the right direction? On Mar 2, 2013, at 6:42 AM, Sean Owen wrote: Yes to integrate any new data everything must be reloaded. On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote: I am using a FileDataModel and remove and insert preferences before estimating preferences. Do I need to rebuild the recommender after these methods are called for it to be reflected in the prediction?
Re: FileDataModel
Yes to integrate any new data everything must be reloaded. On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote: I am using a FileDataModel and remove and insert preferences before estimating preferences. Do I need to rebuild the recommender after these methods are called for it to be reflected in the prediction?
Re: Hadoop version compatibility
Although I don't know of any specific incompatibility, I would not be surprised. 0.18 is pretty old. As you can see in pom.xml it currently works against the latest stable version, 1.1.1. On Sat, Mar 2, 2013 at 6:16 PM, MARCOS UBIRAJARA marcosubiraj...@ig.com.brwrote: Dear Gentleman, First of all, many thanks for this active and vibrant community, and for the Mahout creators as well. I'm giving the first steps with mahout and hadoop, in order I can go ahead with my research. I'm facing some problems with mahout 0.7 and hadoop 0.18. Please let me know if both are compatible, and if not, what hadoop version is compatible with mahout 0.7? Thanks in advance for your help, for sure will be very helpfull, Marcos Manaus Amazon - Brasil
Re: How to remove popular items?
It's true, although many of the algorithms will by nature not emphasize popular items. There is an old and semi-deprecated class in the project called InverseUserFrequency, which you can use to manually de-emphasize popular items internally. I wouldn't really recommend it. You can always use IDRescorer yes. If you have business rules that dictate some things must be filtered, that's the right way to go. As purely a tool to demote popular items.. it's a bit heavy-handed and not the ideal way to solve it. On Wed, Feb 27, 2013 at 1:39 PM, Aleksei Udatšnõi a.udac...@gmail.comwrote: Consider using IDRescorer to penalize or skip items. On Mon, Feb 4, 2013 at 6:54 PM, Zia mel ziad.kame...@gmail.com wrote: Hi , is there a current way to remove the popular items in the recommendations? Something like STOP words. Thanks !
Re: Vector distance within a cluster
A common measure of cluster coherence is the mean distance or mean squared difference between the members and the cluster centroid. It sounds like this is the kind of thing you're measuring with this all-pairs distances. That could be a measure too; I've usually seen that done by taking the maximum such intracluster distance, the 'diameter'. To answer Ted's question -- you're measuring internal consistency. You're not trying to find clusters that match some external standard that says these 100 docs should cluster together, etc. I'm speaking off the cuff, but I think the idea was that L1/Manhattan distance may give you clusters that tend to spread out over few rather than more dimensions, and so that may make them more interpretable -- because they will tend to be nearly identical in the other several dimensions and those homogenous dimensions tell you what they're about. The reason is that L1 is indifferent across dimensions -- moving a unit in any dimension makes you a unit further/closer from another point -- while in L2 moving along a dimension where you are already close does little. On Wed, Feb 27, 2013 at 3:23 PM, Chris Harrington ch...@heystaks.comwrote: Hmmm, you may have to dumb things down for me here. I have don't have much of a background in the area of ML and I'm just piecing things together and learning as I go. So I don't really understand what you mean by Coherence against an external standard? Or internal consistency/homogeneity? or One thought along these lines is to add L_1 regularization to the k-means algorithm. Is L_1 regularization the same as manhattan distance? That aside I'm outputting a file with the top terms and the text of 20 random documents that ended up in that cluster and eyeballing that, not very high-tech or efficient but it was the only way I knew to make a relevance judgment on a cluster topic. For example If the majority of the samples are sport related and 82.6% of the vector distances in my cluster are quite similar I'm happy to call that cluster sport.
Re: Cross recommendation
I may not be 100% following the thread, but: Similarity metrics won't care whether some items are really actions and some items are items. The math is the same. The problem which you may be alluding to is the one I mentioned earlier -- there is no connection between item and item-action in the model, when there plainly is in real life. The upside is what Ted mention: you get to treat actions like views separately from purchases, and yes it's also certain those aren't the same thing in real life. YMMV. The piece of code you're playing with has nothing to do with latent factor models and won't learn weights. It's going to assume by default that all items (+actions) are equal. (user+action,item) doesn't make sense. You compute item-item similarity from (user,item+action) data. Some of the results are really item-action similarities or action-action. It may be useful, maybe not, to know these things too but you can just look at item-item if you want. On Sun, Feb 24, 2013 at 4:39 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes I understand that you need (user, item+action) input for user based recs returned from recommender.recommend(userID, n). But can you expect item similarity to work with the same input? I am fuzzy about how item similarity is calculated in cf/taste. I was expecting to train one recommender with (user, item+action) and call recommender1.recommend(userID, n) to get recs but also train another recommender with (user+action, item) to get recommender2.mostSimilarItems( itemID, n). I realize it's a hack but that aside is this second recommender required? I'd expect it to return items that use all actions to calculate similarity and therefore will use view information to improve the similarity calculation. No? On Feb 23, 2013, at 10:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: No. It is uniformly better to have (item+action, user). In fact, I would prefer to have it the other way around when describing it to match the matrix row x column convention. (user, item+action) where action is binary leads to A = [A_1 | A_2] = user by 2xitem. The alternative of (user+action, item) leads to [ A_1 ] A = [ ] = 2xuser by item [ A_2 ] This last form doesn't have a uniform set of users to connect the items together. When you compute the cooccurrence matrix you get A_1' A_1 + A_2' A_2 which gives you recommendations from 1=1 and from 2=2, but no recommendations 1=2 or 2=1. Thus, no cross recommendations. On Sat, Feb 23, 2013 at 10:39 AM, Pat Ferrel pat.fer...@gmail.com wrote: But the discussion below lead me to realize that cf/taste is doing something in addition to [B'B] h_p, which returns user history based recs. I'm getting better results currently from item similarity based recs, which I blend with user-history based recs. To get item similarity based recs cf/taste is using a similarity metric and I'd guess that it uses the input matrix to get these results (something like the dot product for cosine). For item similarity should I create a training set of (item, user+action)?
Re: GenericUserBasedRecommender vs GenericItemBasedRecommender
It's also valid, yes. The difference is partly due to asymmetry, but also just historical (i.e. no great reason). The item-item system uses a different strategy for picking candidates based on CandidateItemStrategy. On Thu, Feb 21, 2013 at 2:37 PM, Koobas koo...@gmail.com wrote: In the GenericUserBasedRecommender the concept of a neighborhood seems to be fundamental. I.e., it is a classic implementation of the kNN algorithm. But it is not the case with the GenericItemBasedRecommender. I understand that the two approaches are not meant to be completely symmetric, but still, wouldn't it make sense, from the performance perspective, to compute items' neighborhoods first, and then use them to compute recommendations? If kNN was run on items first, then every item-item similarity would be computed once. It looks like in the GenericItemBasedRecommender each item-item similarity will be computed multiple times. (How much, depends on the data, but still.) I am wondering if anybody has any thoughts on the validity of doing item-item kNN in the context of: 1) performance, 2) quality of recommendations.
Re: Precision used by mahout
I think all of the code uses double-precision floats. I imagine much of it could work as well with single-precision floats. MapReduce and a GPU are very different things though, and I'm not sure how you would use both together effectively. On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade shrutiranad...@gmail.comwrote: Hi, I am a beginner in mahout. I am working on k-means MR implementation and trying to run it on a GPGPU.* I wanted to know if mahout computations are all double precision or single precision. * Suggest me any documentation that I need to refer to. Thanks, Shruti
Re: Precision used by mahout
I think this is quite possible too. I just think there's little point in matching this up with Hadoop. They represent entirely different architectures for large-scale computation. I mean, you can probably write an M/R job that uses GPUs on workers, but I imagine it would be an artificial marriage of technologies. Probably Hadoop being used simply to distribute data. If you want to use a GPU, and want to use it properly, most of your work is to create an effective in-core parallel implementation, not distributed across computers and distributed file systems. You use JNI or CUDA bindings in Java to push computations into hardware from Java. This is an exercise in a) modifying a matrix/vector library to use native hardware, then b) writing algorithms that use that library. I think your best starting point in Java may be something more general like Commons Math. On Wed, Feb 20, 2013 at 10:22 AM, 万代豊 20525entrad...@gmail.com wrote: This is the agenda that I'm interested in too. I believe Item-Based Recomemndation in Mahout (Not only about Mahout though) should spend sometime doing multiplication of cooccurrence matrix and user preference vector. If we could pass this multiplication task off loaded to GGPU, then that will be a great acceleration. What I'm not really clear is how double precision multiplication task inside Java Virtual Machine can take advantage of the HW accelerator.(I mean how can you make GGPU visible to Mahout through JVM?) If we could get over this in addition to what Ted Dunning presented the other day on Solr involment in building/loading cooccurrence matrix for Mahout recommendation, it should be a big leap in innovating Mahout recommendation. Am I missing sothing or just dreamig? Regards,,, Y.Mandai 2013/2/20 Sean Owen sro...@gmail.com I think all of the code uses double-precision floats. I imagine much of it could work as well with single-precision floats. MapReduce and a GPU are very different things though, and I'm not sure how you would use both together effectively. On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade shrutiranad...@gmail.com wrote: Hi, I am a beginner in mahout. I am working on k-means MR implementation and trying to run it on a GPGPU.* I wanted to know if mahout computations are all double precision or single precision. * Suggest me any documentation that I need to refer to. Thanks, Shruti
Re: Problems with Mahout's RecommenderIRStatsEvaluator
I agree with that explanation. Is it why it's unsupervised.. well I think of recommendation in the context of things like dimension reduction, which are just structure-finding exercises. Often the input has no positive or negative label (a click); everything is 'positive'. If you're predicting anything, it's not one target, but many targets, one per item, as if you have many small supervised problems. Whatever that is called -- I was just saying that it's not a simple supervised problem, and so it's not necessarily true that the things you do when testing that kind of thing apply here. Viewed through the supervised lens, I suppose you could say that this process only ever predicts the positive class, and that's different. In fact it is not classifying given test examples at all... it's like it is telling you which of many classifiers (items) would be most likely to return the positive class On Sun, Feb 17, 2013 at 11:56 AM, Osman Başkaya osman.bask...@computer.orgwrote: I am sorry to extend the unsupervised/supervised discussion which is not the main question here but I need to ask. Sean, I don't understand your last answer. Let's assume our rating scale is from 1 to 5. We can say that those movies which a particular user rates as 5 are relevant for him/her. 5 is just a number, we can use *relevance threshold *like you did and we can follow the method described in Cremonesi et al. Performance of Recommender Algorithms on Top-N Recommendation Taskshttp://goo.gl/pejO7( *2. Testing Methodology - p.2*). Are you saying that this job is unsupervised since no user can rate all of the movies. For this reason, we won't be sure that our predicted top-N list contains no relevant item because it can be possible that our top-N recommendation list has relevant movie(s) which hasn't rated by the user * yet* as relevant. By using this evaluation procedure we miss them. In short, The following assumption can be problematic: We randomly select 1000 additional items unrated by user u. We may assume that most of them will not be of interest to user u. Although bigger N values overcomes this problem mostly, still it does not seem totally supervised. On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen sro...@gmail.com wrote: The very question at hand is how to label the data as relevant and not relevant results. The question exists because this is not given, which is why I would not call this a supervised problem. That may just be semantics, but the point I wanted to make is that the reasons choosing a random training set are correct for a supervised learning problem are not reasons to determine the labels randomly from among the given data. It is a good idea if you're doing, say, logistic regression. It's not the best way here. This also seems to reflect the difference between whatever you want to call this and your garden variety supervised learning problem. On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sean I think it is still a supervised learning problem in that there is a labelled training data set and an unlabeled test data set. Learning a ranking doesn't change the basic dichotomy between supervised and unsupervised. It just changes the desired figure of merit. -- Osman Başkaya Koc University MS Student | Computer Science and Engineering
Re: Problems with Mahout's RecommenderIRStatsEvaluator
No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was more useful to have larger tests. There are several ways to choose the test data. One common way is by time, but there is no time information here by default. The problem is that, for example, recent ratings may be low -- or at least not high ratings. But the evaluation is of course asking the recommender for items that are predicted to be highly rated. Random selection has the same problem. Choosing by rating at least makes the test coherent. It does bias the training set, but, the test set is supposed to be small. There is no way to actually know, a priori, what the top recommendations are. You have no information to evaluate most recommendations. This makes a precision/recall test fairly uninformative in practice. Still, it's better than nothing and commonly understood. While precision/recall won't be high on tests like this, because of this, I don't get these values for movielens data on any normal algo, but, you may be, if choosing an algorithm or parameters that don't work well. On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.comwrote: Hi, I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here. According to my understanding the experimental protocol used in this code is something like this: It takes away a certain percentage of users as test users. For each test user it builds a training set consisting of ratings given by all other users + the ratings of the test user which are below the relevanceThreshold. It then builds a model and makes a recommendation to the test user and finds the intersection between this recommendation list and the items which are rated above the relevanceThreshold by the test user. It then calculates the precision and recall in the usual way. Probems: 1. (mild) It builds a model for every test user which can take a lot of time. 2. (severe) Only the ratings (of the test user) which are below the relevanceThreshold are put into the training set. This means that the algorithm only knows the preferences of the test user about the items which s/he don't like. This is not a good representation of user ratings. Moreover when I run this evaluator on movielens 1m data, the precision and recall turned out to be, respectively, 0.011534185658699288 0.007905982905982885 and the run took about 13 minutes on my intel core i3. (I used user based recommendation with k=2) Altgough I know that it is not ok to judge the performance of a recommendation algorithm by looking at these absolute precision and recall values, still these numbers seems to me too low which might be the result of the second problem I mentioned above. Am I missing something? Thanks Ahmet
Re: Problems with Mahout's RecommenderIRStatsEvaluator
Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing? On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.comwrote: But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings. Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3 people (A, B, and C) on 5 items. A's ratings: 1 2 3 4 5 B's ratings: 1 3 5 2 4 C's ratings: 1 2 3 4 5 Suppose that A is the test user. Now if we put only the low ratings of A (1, 2, and 3) into the training set and mean normalize the ratings then A will be more similar to B than C, which is not true. From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz ahmetyilmazefe...@yahoo.com Sent: Saturday, February 16, 2013 8:41 PM Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was more useful to have larger tests. There are several ways to choose the test data. One common way is by time, but there is no time information here by default. The problem is that, for example, recent ratings may be low -- or at least not high ratings. But the evaluation is of course asking the recommender for items that are predicted to be highly rated. Random selection has the same problem. Choosing by rating at least makes the test coherent. It does bias the training set, but, the test set is supposed to be small. There is no way to actually know, a priori, what the top recommendations are. You have no information to evaluate most recommendations. This makes a precision/recall test fairly uninformative in practice. Still, it's better than nothing and commonly understood. While precision/recall won't be high on tests like this, because of this, I don't get these values for movielens data on any normal algo, but, you may be, if choosing an algorithm or parameters that don't work well. On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com wrote: Hi, I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here. According to my understanding the experimental protocol used in this code is something like this: It takes away a certain percentage of users as test users. For each test user it builds a training set consisting of ratings given by all other users + the ratings of the test user which are below the relevanceThreshold. It then builds a model and makes a recommendation to the test user and finds the intersection between this recommendation list and the items which are rated above the relevanceThreshold by the test user. It then calculates the precision and recall in the usual way. Probems: 1. (mild) It builds a model for every test user which can take a lot of time. 2. (severe) Only the ratings (of the test user) which are below the relevanceThreshold are put into the training set. This means that the algorithm only knows the preferences of the test user about the items which s/he don't like. This is not a good representation of user ratings. Moreover when I run this evaluator on movielens 1m data, the precision and recall turned out to be, respectively, 0.011534185658699288 0.007905982905982885 and the run took about 13 minutes on my intel core i3. (I used user based recommendation with k=2) Altgough I know that it is not ok to judge the performance of a recommendation algorithm by looking at these absolute precision and recall values, still these numbers seems to me too low which might be the result of the second problem I mentioned above. Am I missing something? Thanks Ahmet
Re: Problems with Mahout's RecommenderIRStatsEvaluator
This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic. On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing?
Re: Problems with Mahout's RecommenderIRStatsEvaluator
Sure, if you were predicting ratings for one movie given a set of ratings for that movie and the ratings for many other movies. That isn't what the recommender problem is. Here, the problem is to list N movies most likely to be top-rated. The precision-recall test is, in turn, a test of top N results, not a test over prediction accuracy. We aren't talking about RMSE here or even any particular means of generating top N recommendations. You don't even have to predict ratings to make a top N list. On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: No, rating prediction is clearly a supervised ML problem On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote: This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic. On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing?
Re: Problems with Mahout's RecommenderIRStatsEvaluator
If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out items with low rating, then it's also the same idea, except you're randomly throwing away some lower-rated data from both test and train. I don't see what that helps either. On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: What I mean is you can choose ratings randomly and try to recommend the ones above the threshold
Re: Problems with Mahout's RecommenderIRStatsEvaluator
I understand the idea, but this boils down to the current implementation, plus going back and throwing out some additional training data that is lower rated -- it's neither in test or training. Anything's possible, but I do not imagine this is a helpful practice in general. On Sat, Feb 16, 2013 at 10:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I'm suggesting the second one. In that way the test user's ratings in the training set will compose of both low and high rated items, that prevents the problem pointed out by Ahmet. On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro...@gmail.com wrote: If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out items with low rating, then it's also the same idea, except you're randomly throwing away some lower-rated data from both test and train. I don't see what that helps either. On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: What I mean is you can choose ratings randomly and try to recommend the ones above the threshold
Re: Problems with Mahout's RecommenderIRStatsEvaluator
The very question at hand is how to label the data as relevant and not relevant results. The question exists because this is not given, which is why I would not call this a supervised problem. That may just be semantics, but the point I wanted to make is that the reasons choosing a random training set are correct for a supervised learning problem are not reasons to determine the labels randomly from among the given data. It is a good idea if you're doing, say, logistic regression. It's not the best way here. This also seems to reflect the difference between whatever you want to call this and your garden variety supervised learning problem. On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sean I think it is still a supervised learning problem in that there is a labelled training data set and an unlabeled test data set. Learning a ranking doesn't change the basic dichotomy between supervised and unsupervised. It just changes the desired figure of merit.
Re: Improving quality of item similarities?
Yes, I don't know if removing that data would improve results. It might mean you can compute things faster, at little or no observable loss in quality of the results. I'm not sure, but you probably have repeat purchases of the same item, and items of different value. Working in that data may help here since you have relatively few items. On Thu, Feb 14, 2013 at 10:25 AM, Julian Ortega jorte...@gmail.com wrote: Hi everyone. I have a data set that looks like this: Number of users: 198651 Number of items: 9972 Statistics of purchases from users mean number of purchases 3.3 stdDev number of purchases 3.5 min number of purchases 1 max number of purchases 176 median number of purchases 2 Statistics of purchased items mean number of times bought 65.1 stdDev number of times bought 120.7 min number of times bought 1 max number of times bought 3278 median number of times bought 25 I'm using a GenericItemBasedRecommender with LogLikelihoodSimilarity to generate a list of similar items. However, I've been wondering how should I pre-process the data between passing it to the recommender to improve the quality. Some things I have consider are: - Removing all users that have 5 or less purchases - Removing all items that have been purchased 5 or less times In general terms, would that make sense? Presumably it will make the matrix less sparse and also avoid weak associations, albeit if I'm not mistaken LogLikelihood account for low number of occurrences. Any thoughts? Thanks, Julian
Re: Shopping cart
This sounds like a job for frequent item set mining, which is kind of a special case of the ideas you've mentioned here. Given N items in a cart, which next item most frequently occurs in a purchased cart? On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote: I thought you might say that but we don't have the add-to-cart action. We have to calculate cart purchases by matching cart IDs or session IDs. So we only have cart purchases with items. If we had the add-to-cart and the purchase we could use your cross-action method for getting recs by training only on those two actions. Still without the add-to-cart the method below should work, right? The main problem being finding a similar cart in the training set quickly. Are there other problems? On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this is an excellent use case for cross recommendation from cart contents (items) to cart purchases (items). The cross aspect is that the recommendation is from two different kinds of actions, not two kinds of things. The first action is insertion into a cart and the second is purchase of an item. On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com wrote: There are several methods for recommending things given a shopping cart contents. At the risk of using the same tool for every problem I was thinking about a recommender's use here. I'd do something like train on shopping cart purchases so row = cartID, column = itemID. Given cart contents I could find the most similar cart in the training set by using a similarity measure then get recs for this closest matched cart. The search for similar carts may be slow if I have to check for pairwise similarity so I could cluster and find the best cluster then search it for the best cart. I could create a decision tree on all trained carts and walk as far as I can down the tree to find the cart with the most cooccurrences. There may be other cooccurrence based methods in mahout??? With the id of the cart I can then get recs from the training set. I could also fold-in the new cart contents to the training set and ask for recs based on it (this seems like it would take a long time to compute). This last would also pollute the trained matrix with partial carts over time. This seems like another place where Lucene might help but are there other mahout methods to look at before I diving into Lucene?
Re: Shopping cart
I don't think it's necessarily slow; this is how item-based recommenders work. The only thing stopping you from using Mahout directly is that I don't think there's an easy way to say recommend to this collection of items. But that's what is happening inside when you recommend for a user. You can just roll your own version of it. Yes you are computing similarity for k carted items by all N items, but is N so large? hundreds of thousands of products? this is still likely pretty fast even if the similarity is over millions of carts. Some smart precomputation and caching goes a long way too. On Thu, Feb 14, 2013 at 7:10 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes, one time tested way to do this is the apriori algo which looks at frequent item sets and creates rules. I was looking for a shortcut using a recommender, which would be super easy to try. The rule builder is a little harder to implement but we can also test precision on that and compare the two. The recommender method below should be reasonable AFAICT except for the method(s) of retrieving recs, which seem likely to be slow. On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote: This sounds like a job for frequent item set mining, which is kind of a special case of the ideas you've mentioned here. Given N items in a cart, which next item most frequently occurs in a purchased cart? On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote: I thought you might say that but we don't have the add-to-cart action. We have to calculate cart purchases by matching cart IDs or session IDs. So we only have cart purchases with items. If we had the add-to-cart and the purchase we could use your cross-action method for getting recs by training only on those two actions. Still without the add-to-cart the method below should work, right? The main problem being finding a similar cart in the training set quickly. Are there other problems? On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this is an excellent use case for cross recommendation from cart contents (items) to cart purchases (items). The cross aspect is that the recommendation is from two different kinds of actions, not two kinds of things. The first action is insertion into a cart and the second is purchase of an item. On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com wrote: There are several methods for recommending things given a shopping cart contents. At the risk of using the same tool for every problem I was thinking about a recommender's use here. I'd do something like train on shopping cart purchases so row = cartID, column = itemID. Given cart contents I could find the most similar cart in the training set by using a similarity measure then get recs for this closest matched cart. The search for similar carts may be slow if I have to check for pairwise similarity so I could cluster and find the best cluster then search it for the best cart. I could create a decision tree on all trained carts and walk as far as I can down the tree to find the cart with the most cooccurrences. There may be other cooccurrence based methods in mahout??? With the id of the cart I can then get recs from the training set. I could also fold-in the new cart contents to the training set and ask for recs based on it (this seems like it would take a long time to compute). This last would also pollute the trained matrix with partial carts over time. This seems like another place where Lucene might help but are there other mahout methods to look at before I diving into Lucene?
Re: Shopping cart
Yes your only issue there, which I think you had touched on, was that you have to put your current cart (which hasn't been purchased) into the model in order to get an answer out of a recommender. I think we've talked about the recommend-to-anonymous function in the context of another system, which is exactly what you need here. Yes, all you have to do then is reproduce the recommender computation. But I understand that you were hoping to avoid rewriting it. It's really just a loop though, so not much work to reproduce. 100K items x a few items in a cart is a few hundred thousand similarities. This isn't trivial but not going to take seconds, I think. Yes this gets much faster if you can precompute item-item similarity. Computing NxN pairs is going to take a long time though when N=100,000. So yes something like clustering is the nice way to scale that. Then your clusters greatly limit the number of candidates to consider because you can round every other inter-cluster similarity to 0. By this point... I imagine it's about as hard to whip up a frequent itemset implementation! or crib one and adapt it. This is in mahout. That's probably the right tool for the job. On Thu, Feb 14, 2013 at 8:19 PM, Pat Ferrel pat.fer...@gmail.com wrote: I'm creating a matrix of cart ids and items ids so cart x items in cart. The 'preference' then is cartID, itemID. This will create the correct matrix I think. For any cart id I would get a ranked list of recommended items that was calculated from other carts. This seems like what is needed in a SC recommender. So doing this should give a recommend to this collection of items, right? The only issue is finding the best cart to get the recs. I would be doing a pair-wise similarity comparison for N carts to the current cart contents and the result would have to come back in a very short amount of time, on the order of the time to get recs for 3M users and 100K items. Not sure what N is yet but the # of items is the same as in the purchase matrix. So finding the best cart to get recs for will be N similarity comparisons--worst case. Each cart is likely to have only a few items in it and I imagine this speeds the similarity calc. I guess I'll try it as described and optimize for speed if the precision is good compared to the apriori algo. On Feb 14, 2013, at 10:57 AM, Sean Owen sro...@gmail.com wrote: I don't think it's necessarily slow; this is how item-based recommenders work. The only thing stopping you from using Mahout directly is that I don't think there's an easy way to say recommend to this collection of items. But that's what is happening inside when you recommend for a user. You can just roll your own version of it. Yes you are computing similarity for k carted items by all N items, but is N so large? hundreds of thousands of products? this is still likely pretty fast even if the similarity is over millions of carts. Some smart precomputation and caching goes a long way too. On Thu, Feb 14, 2013 at 7:10 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes, one time tested way to do this is the apriori algo which looks at frequent item sets and creates rules. I was looking for a shortcut using a recommender, which would be super easy to try. The rule builder is a little harder to implement but we can also test precision on that and compare the two. The recommender method below should be reasonable AFAICT except for the method(s) of retrieving recs, which seem likely to be slow. On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote: This sounds like a job for frequent item set mining, which is kind of a special case of the ideas you've mentioned here. Given N items in a cart, which next item most frequently occurs in a purchased cart? On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote: I thought you might say that but we don't have the add-to-cart action. We have to calculate cart purchases by matching cart IDs or session IDs. So we only have cart purchases with items. If we had the add-to-cart and the purchase we could use your cross-action method for getting recs by training only on those two actions. Still without the add-to-cart the method below should work, right? The main problem being finding a similar cart in the training set quickly. Are there other problems? On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this is an excellent use case for cross recommendation from cart contents (items) to cart purchases (items). The cross aspect is that the recommendation is from two different kinds of actions, not two kinds of things. The first action is insertion into a cart and the second is purchase of an item. On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com wrote: There are several methods for recommending things given a shopping cart contents. At the risk
Re: Implicit preferences
I think you'd have to hack the code to not exclude previously-seen items, or at least, not of the type you wish to consider. Yes you would also have to hack it to add rather than replace existing values. Or for test purposes, just do the adding yourself before inputting the data. My hunch is that it will hurt non-trivially to treat different interaction types as different items. You probably want to predict that someone who viewed a product over and over is likely to buy it, but this would only weakly tend to occur if the bought-item is not the same thing as the viewed-item. You'd learn they go together but not as strongly as ought to be obvious from the fact that they're the same. Still, interesting thought. There ought to be some 'signal' in this data, just a question of how much vs noise. A purchase means much more than a page view of course; it's not as subject to noise. Finding a means to use that info is probably going to help. On Sat, Feb 9, 2013 at 7:50 PM, Pat Ferrel pat.fer...@gmail.com wrote: I'd like to experiment with using several types of implicit preference values with recommenders. I have purchases as an implicit pref of high strength. I'd like to see if add-to-cart, view-product-details, impressions-seen, etc. can increase offline precision in holdout tests. These less than obvious implicit prefs will get a much lower value than purchase and i'll experiment with different mixes. The problem is that some of these prefs will indicate that the user, for whom I'm getting recs, has expressed a preference. Using these implicit prefs seems reasonable in finding similarity of taste between users but presents several problems. 1) how to encode the prefs, each impression-seen will increase the strength of preference of a user for an item but the recommender framework replaces the preference value for items preferred more than once, doesn't it? 2) AFAIK the current recommender framework will return recs only for items that the user in question has expressed no preference for. If I use something like view-product-details or impressions-seen, I will be removing anything the user has seen from the recs, which is not what I want in this experiment. Has anyone tried something like this? I'm not convinced that these other implicit preferences will add anything to the recommender, just trying to find out.
Re: Implicit preferences
Yeah I bet it does actually work well... but aren't you basically spending an extra step to make the item-item matrix, to relearn that bought-X and viewed-X go together? yeah you learn a lot more along the way, as this is item-based recommendation at heart. It seems like you could add back that knowledge. On Sun, Feb 10, 2013 at 5:36 PM, Ted Dunning ted.dunn...@gmail.com wrote: Actually treating the different interactions separately can lead to very good recommendations. The only issue is that the interactions are no longer dyadic. If you think about it, having two different kinds of interactions is like adjoining interaction matrices for the two different kinds of interaction. Suppose that you have user x views in matrix A and you have user x purchases in matrix B. The complete interaction matrix of user x (views + purchases) is [A | B]. When you compute cooccurrence in this matrix, you get [A | B] = [ A' ] [ A' A A' B ] [A | B]' [A | B] = [] [A | B] = [] [A | B] = [ B' ] [ B' A B' B ] This matrix is (view + purchase) x (view + purchase). But we don't care about predicting views so we only really need a matrix that is purchase x (view + purchase). This is just the bottom part of the matrix above, or [ B'A | B'B ]. When you produce purchase recommendations r_p by multiplying by a mixed view and purchase history vector h which has a view part h_v and a purchase part h_p, you get r_p = [ B' A B' B ] h = B'A h_v + B'B h_p That is a prediction of purchases based on past views and past purchase. Note that this general form applies for both decomposition techniques such as SVD, ALS and LLL as well as for sparsification techniques such as the LLR sparsification. All that changes is the mechanics of how you do the multiplications. Weighting of components works the same as well. What is very different here is that we have a component of cross recommendation. That is the B'A in the formula above. This is very different from a normal recommendation and cannot be computed with the simple self-join that we normally have in Mahout cooccurrence computation and also very different from the decompositions that we normally do. It isn't hard to adapt the Mahout computations, however. When implementing a recommendation using a search engine as the base, these same techniques also work extremely well in my experience. What happens is that for each item that you would like to recommend, you would have one field that has components of B'A and one field that has components of B'B. It is handy to simply use the binary values of the sparsified versions of these and let the search engine handle the weighting of different components at query time. Having these components separated into different fields in the search index seems to help quite a lot, which makes a fair bit of sense. On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen sro...@gmail.com wrote: I think you'd have to hack the code to not exclude previously-seen items, or at least, not of the type you wish to consider. Yes you would also have to hack it to add rather than replace existing values. Or for test purposes, just do the adding yourself before inputting the data. My hunch is that it will hurt non-trivially to treat different interaction types as different items. You probably want to predict that someone who viewed a product over and over is likely to buy it, but this would only weakly tend to occur if the bought-item is not the same thing as the viewed-item. You'd learn they go together but not as strongly as ought to be obvious from the fact that they're the same. Still, interesting thought. There ought to be some 'signal' in this data, just a question of how much vs noise. A purchase means much more than a page view of course; it's not as subject to noise. Finding a means to use that info is probably going to help. On Sat, Feb 9, 2013 at 7:50 PM, Pat Ferrel pat.fer...@gmail.com wrote: I'd like to experiment with using several types of implicit preference values with recommenders. I have purchases as an implicit pref of high strength. I'd like to see if add-to-cart, view-product-details, impressions-seen, etc. can increase offline precision in holdout tests. These less than obvious implicit prefs will get a much lower value than purchase and i'll experiment with different mixes. The problem is that some of these prefs will indicate that the user, for whom I'm getting recs, has expressed a preference. Using these implicit prefs seems reasonable in finding similarity of taste between users but presents several problems. 1) how to encode the prefs, each impression-seen will increase the strength of preference of a user for an item but the recommender framework replaces the preference value for items preferred more than once, doesn't
Re: Rating scale
You don't have to fix a scale. But your data needs to be consistent. It wouldn't work to have users rate on a 1-5 scale one day, and 1-100 tomorrow (unless you go back and normalize the old data to 1-100). On Mon, Feb 4, 2013 at 3:56 PM, Zia mel ziad.kame...@gmail.com wrote: Hi , is there a necessity to have a fix rating scale while running recommendations or it can be dynamic based on the users' data ? Many Thanks
Re: Failed to execute goal Surefire plugin -- any ideas?
You can -DskipTests to skip tests, since that's what it is complaining about. There aren't any current failures in trunk so could be something specific to your setup. Or a flaky test. It may still be something to fix. On Mon, Feb 4, 2013 at 3:37 PM, jellyman colm_r...@hotmail.com wrote: Hi everyone, Can you help me please? I'm new to Mahout and am trying to get it running on my local windows box on Eclipse IDE but I'm stuck. Here is what I have done so far: 1. Pulled own latest source from:- http://svn.apache.org/repos/asf/mahout/trunk 2. Following the instructions here: https://cwiki.apache.org/MAHOUT/buildingmahout.html 3. mvn compile from inside the core directory -- results is good 4. mvn install from inside the core directory -- I get an error message like: BUILD FAILED. Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (d3efault-test) on project mahout-core. There are test failures. 5. I then run: mvn -X install for more information. Error message is: org.apache.maven.lifecycle.LifecycleExecutionException: failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.13:test I'm running Maven v3.0.4 and Eclipse 4.2. I have C:\Program Files\Java\jdk1.7.0_06\bin in the environment path etc... Just wondering can anyone help me? Any ideas/suggestions that you would like to share? Thank a mill in advance, jelly. -- View this message in context: http://lucene.472066.n3.nabble.com/Failed-to-execute-goal-Surefire-plugin-any-ideas-tp4038361.html Sent from the Mahout User List mailing list archive at Nabble.com.
Re: Threshold-based neighborhood and getReach
You are asking for a smaller and smaller neighborhood around a user. At some point the neighborhood includes no users, for some people -- or, the neighborhood includes no new items. Nothing can be recommended, and so recall drops. Precision and recall tend to go in opposite directions for similar reasons. On Mon, Feb 4, 2013 at 3:54 PM, Zia mel ziad.kame...@gmail.com wrote: Hi , when selecting Threshold-based neighborhood, as the threshold increase the precision increase which makes sense. However, the getReach max provide recommendations for 0.2 users and decrease to 0.0002 , is that normal? The recall also drops. When using a fixed-size neighborhood getReach provide much higher results. //=== Code used UserNeighborhood neighborhood =new ThresholdUserNeighborhood(thresholdValue, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 10, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); stats.getReach() //=== Thanks
Re: Server sizing Hadoop + Mahout
The problem with this POV is that it assumes it's obvious what the right outcome is. With a transaction test or a disk write test or big sort, it's obvious and you can make a benchmark. With ML, it's not even close. For example, I can make you a recommender that is literally as fast as you like by picking any random set of items. Classifiers can likewise do so by randomly picking a class. Specifying even a desired answer isn't useful, since then you are just selecting a process that picks a particular answer on a particular data set. I don't think that works, since the classic idea of benchmark is not well-defined here, but you're welcome to go create and run whatever tests you like. On Sat, Feb 2, 2013 at 3:19 PM, jordi jord...@gmail.com wrote: Hi Sean! First of all, thanks for your reply! I do agree that it's very complicated to do the sizing of an environment since there are many variables that should be considerated. You have mentioned some of them: the algorithm, the distribution of data, the amount of data, type of hardware, etc. But I dont agree that it's impossible to give a baseline. Maybe should be a great idea for the Mahout+Hadoop community to take a look to this guys (Standard Performance Evaluation Corporation, http://www.spec.org/). They run the same benchmark on different types of architectures, establishing empirically a baseline that can be used as a good start point to do a capacity planning. They have a lot of benchmarks depending on CPU, Java Client Server, etc. Obviously, thats only a start point: before your software goes live to production mode, it's desirable to benchmark again your software running a load-test, adequating your infraestructure depending on performance results.
Re: (near) real time recommender/predictor
It's a good question. I think you can achieve a partial solution in Mahout. Real-time suggests that you won't be able to make use of Hadoop-based implementations, since they are by nature big batch processes. All of the implementations accept the same input -- user,item,value. That's OK; you can probably just reduce all of your user-thing interactions to tuples like this. Any reasonable mapping should be OK. Tags can be items too. I don't think any of the implementations take advantage of time. The non-Hadoop implementations are not-quite-realtime. The model is loading data into memory from backing store, computing and maybe caching partial results, and serving results as quickly as possible. New input can't be immediately used, no. It comes into play when the model is reloaded only. I think you have very sparse input -- a high number of users and items (tags, likes), but relatively few interactions. Matrix factorization / latent factor models work well here. The ones in Mahout that are not Hadoop-based may work for you, like SVDRecommender. It's worth a try. (Advertisement: the new recommender product I am commercializing, Myrrix, does the real-time and matrix factorization thing just fine. It's easy enough to start with that I would encourage you to experiment with the open source system also: http://myrrix.com/download/) On Thu, Jan 31, 2013 at 7:02 PM, Frederik Kraus frederik.kr...@gmail.com wrote: Hi Guys, I'm rather new to the whole Mahout ecosystem, so please excuse if the questions I have are rather dumb ;) Our problem basically boils down to this: we want to match users with either the content they interested in and/or the content they could contribute to. To do this matching we have several dimensions both of users and content items (things like: contribution history, tags, browsing history, diggs, likes, ….). As interest of users can change over time some kind of CF algorithm including temporal effects would obviously be best, but for the time being those effects could probably be neglected. Now my questions: - what algorithm from the mahout toolkit would best fit our case? - How can we get this near realtime, i.e. not having to recalculate the entire model when user dimensions change and/or new content is being added to the system (or updated) - how would we model the user and item vectors (especially things like tags)? - any hints on where to start? ;) Thanks a lot! Fred.
Re: Using setPreference() to update recommendations in DataModel in Memory
It throws an exception except in a few implementations, mostly the ones based on a database. It isn't something that's really used -- you instead update the backing store indirectly. Yes, the model is batch re-reads of data once in a while. Updates are not in real time in this model. On Wed, Jan 30, 2013 at 8:21 AM, Henning Kuich hku...@gmail.com wrote: So what does the method do instead? And basically the conclusion is: To update your recommender with new preference values, you need to reload the data model and everything that follows? Thanks, Henning On Tue, Jan 29, 2013 at 7:30 PM, Sean Owen sro...@gmail.com wrote: It doesn't really work this way. The model is predicated on loading the data from backing store periodically. In the short term it is read only. This method is misleading in a sense. On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote: Dear All, I would like to be able to update recommendations in the DataModel, and I understand that this can be done with the setPreference() method. So this can be used to create a new user-item-preference entry into the data model, or update an already existing one. My question is the following: I run my recommender.recommend, and get a recommendation for user1. As it happens, user1 now rates 5 other items, and I use the setPreference() method to place those 5 new ratings into my DataModel. If I now re-run the recommender.recommend, does the recommender automatically incorporate the 5 new ratings that have just been made, or do I need to update the recommender in between? And if so, how do I do this? I hope this question makes sense, and many thanks in advance. Henning -- P. Henning J. L. Kuich email: hku...@gmail.com twitter: @hkuich http://twitter.com/hkuich facebook: henning.kuich G+: hkuich Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. -- P. Henning J. L. Kuich email: hku...@gmail.com twitter: @hkuich http://twitter.com/hkuich facebook: henning.kuich G+: hkuich Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: Using setPreference() to update recommendations in DataModel in Memory
It doesn't really work this way. The model is predicated on loading the data from backing store periodically. In the short term it is read only. This method is misleading in a sense. On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote: Dear All, I would like to be able to update recommendations in the DataModel, and I understand that this can be done with the setPreference() method. So this can be used to create a new user-item-preference entry into the data model, or update an already existing one. My question is the following: I run my recommender.recommend, and get a recommendation for user1. As it happens, user1 now rates 5 other items, and I use the setPreference() method to place those 5 new ratings into my DataModel. If I now re-run the recommender.recommend, does the recommender automatically incorporate the 5 new ratings that have just been made, or do I need to update the recommender in between? And if so, how do I do this? I hope this question makes sense, and many thanks in advance. Henning -- P. Henning J. L. Kuich email: hku...@gmail.com twitter: @hkuich http://twitter.com/hkuich facebook: henning.kuich G+: hkuich Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: Question about server/computer architecture...
This is quite small and certainly doesn't require Hadoop. That's the good news. Any reasonable server will do well for you. You won't be memory bound. More cores will let you serve more QPS. Your pain points will be elsewhere like tuning for best quality and real time updates. See my separate email for a possible different solution. Sean On Jan 29, 2013 5:21 PM, Henning Kuich hku...@gmail.com wrote: Thanks for the quick answer Ted. I want to build a User-based recommender for an e-commerce start-up. The 1M ratings dataset from grouplens is about what we are expecting in the nearer future. the data will be preferences either from 1-5 or 1-3... I hope this makes my question a bit more complete.. sorry about that! On Tue, Jan 29, 2013 at 5:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: Depends on what you want to do with Mahout. What is that? How much data? What kind of data? On Tue, Jan 29, 2013 at 7:14 AM, Henning Kuich hku...@gmail.com wrote: Dear All, is there a preferred computer architecture for Mahout? for example, do multicore processors help? is there anything else in terms of server hardware that one should know about, or anything that might be particularly favorable to implement Mahout? Thanks in advance, Henning -- P. Henning J. L. Kuich email: hku...@gmail.com twitter: @hkuich http://twitter.com/hkuich Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. -- P. Henning J. L. Kuich email: hku...@gmail.com twitter: @hkuich http://twitter.com/hkuich facebook: henning.kuich G+: hkuich Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: QRDecomposition performance
Is it worth simply using the Commons Math implementation? On Mon, Jan 28, 2013 at 8:04 AM, Sebastian Schelter s...@apache.org wrote: This is great news and will automatically boost the performance of all our ALS-based recommenders which are all using QRDecomposition internally. On 28.01.2013 04:02, Ted Dunning wrote: Did that. You are right. The QRD in mahout is abysmally slow. I wrote a new version on the airplane that seems to be about 10x faster and still jsut about as accurate (and vastly simpler). I will put up some tests and a patch in the next week or so.
Re: MatrixMultiplicationJob runs with 1 mapper only ?
These are settings to Hadoop, not Mahout. You may need to set them in your cluster config. They are still only suggestions. The question still remains why you think you need several mappers. Why? On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I would like to again consolidate all the steps which I performed. Issue : MatrixMultiplication example is getting executed with only 1 map task. Steps : 1. I created a file with size 104MB which is divided into 11 blocks with size 10MB each. The file contains 200x10 size of matrix. 2. I exported $MAHOUT_OPTS to the following $ echo $MAHOUT_OPTS -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7 3. Tried to execute matrix multiplication example using commandline : mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200 --numColsA 10 --inputPathB /test/points/matrixA --numRowsB 200 --numColsB 10 --tempDir /test/temp When I check the Jobtracker UI , its shows me following for the running job : Running Map Tasks : 1 Occupied Map Slots: 1 How can I distribute the map task on different mappers for MatrixMultiplication Job dynamically. Is it even possible that MatrixMultiplication can run distributedly on multiple mappers as it internally uses CompositeInputFormat . Please Suggest Thanks Stuti -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 23, 2013 6:42 PM To: Mahout User List Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? Mappers are usually extremely fast since they start themselves on top of the data and their job is usually just parsing and emitting key value pairs. Hadoop's choices are usually fine. If not it is usually because the mapper is emitting far more data than it ingests. Are you computing some kind of Cartesian product of input? That's slow no matter what. More mappers may increase parallelism but its still a lot of I/O. Avoid it if you can by sampling or pruning unimportant values. Otherwise , try to implement a Combiner. On Jan 23, 2013 12:04 PM, Jonas Grote jfgr...@gmail.com wrote: I'd play with the mapred.map.tasks option. Setting it to something bigger than 1 gave me performance improvements for various hadoop jobs on my cluster. 2013/1/16 Ashish paliwalash...@gmail.com I am afraid I don't know the answer. Need to experiment a bit more. I have not used CompositeInputFormat so cannot comment. Probably, someone else on the ML(Mailing List) would be able to guide here. On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Thanks Ashish, So according to the link if one is using CompositeInputFormat then it will take entire file as Input to a mapper without considering InputSplits/blocksize. If I am understanding it correctly then it is asking to break [Original Input File]-[flie1,file2,] . So If my file is [/test/MatrixA] -- [/test/smallfiles/file1, [/test/smallfiles/file2, [/test/smallfiles/file3... ] Now will the input path in MatrixMultiplicationJob will be directory path : /test/smallfiles ?? Will breaking file in such manner will cause problem in algorithmic execution of MR job. Im not sure if output will be correct . -Original Message- From: Ashish [mailto:paliwalash...@gmail.com] Sent: Wednesday, January 16, 2013 5:44 PM To: user@mahout.apache.org Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class); conf.setInputFormat(CompositeInputFormat.class); and AFAIK, CompositeInputFormat ignores the splits. See this http://stackoverflow.com/questions/8654200/hadoop-file-splits-composit einputformat-inner-join Unfortunately, I don't know any other alternative as of now. On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi stutiawas...@hcl.com wrote: The issue is that currently my matrix is of dimension (100x100k), Later it can be (1MX10M) or big. Even now if my job is running with the single mapper for (100x100k) and it is not able to complete the Job. As I mentioned map task just proceed to 0.99% and started spilling the map output. Hence I wanted to tune my job so that Mahout is able to complete the job and I can utilize my cluster resources. As MatrixMultiplicationJob is a MR, so it should be able to handle parallel map tasks. I am not sure if there is any algorithmic constraints due to which it runs only with single mapper ? I have taken the reference of thread so that I can set Configuration myself rather by getting it with getConf() but did not got any success http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers
Re: Precision question
Impossible to say. More data means a more reliable estimate all else equal. That's about it. On Jan 28, 2013 5:17 PM, Zia mel ziad.kame...@gmail.com wrote: Any thoughts of this ? On Sat, Jan 26, 2013 at 10:55 AM, Zia mel ziad.kame...@gmail.com wrote: OK , in the precison when we reduce the size of sample to .1 or 0.05 , would the results be related when we check with all the data ? For example, if we have data1 and data2 and test them using 0.1 and found that data 1 is producing better results , would the same thing stand when we check with all data? IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 10, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 0.05); Many thanks On Fri, Jan 25, 2013 at 12:26 PM, Sean Owen sro...@gmail.com wrote: No, it takes a fixed at value. You can modify it to do whatever you want. You will see it doesn't bother with users with little data, like 2*at data points. On Fri, Jan 25, 2013 at 6:23 PM, Zia mel ziad.kame...@gmail.com wrote: Interesting. Using IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); Can it be adjusted to each user ? In other words, is there a way to select a threshold instead of using 5 ? mm Something like selecting y set , each set have a min of z user ? On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote: The way I do it is to set x different for each user, to the number of items in the user's test set -- you ask for x recommendations. This makes precision == recall, note. It dodges this problem though. Otherwise, if you fix x, the condition you need is stronger, really: each user needs = x *test set* items in addition to training set items to make this test fair. On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com wrote: When selecting precision at x let's say 5 , should I check that all users have 5 items or more? For example, if a user have 3 items and they were removed as top items, then how can the recommender suggest items since there are no items to learn from? Thanks !
Re: Precision question
Yes several independent samples of all the data will, together, give you a better estimate of the real metric value than any individual one. On Mon, Jan 28, 2013 at 5:41 PM, Zia mel ziad.kame...@gmail.com wrote: What about running several tests on small data , can't that give an indicator of how big data will perform ? Thanks
Re: Precision question
The way I do it is to set x different for each user, to the number of items in the user's test set -- you ask for x recommendations. This makes precision == recall, note. It dodges this problem though. Otherwise, if you fix x, the condition you need is stronger, really: each user needs = x *test set* items in addition to training set items to make this test fair. On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com wrote: When selecting precision at x let's say 5 , should I check that all users have 5 items or more? For example, if a user have 3 items and they were removed as top items, then how can the recommender suggest items since there are no items to learn from? Thanks !
Re: Precision question
No, it takes a fixed at value. You can modify it to do whatever you want. You will see it doesn't bother with users with little data, like 2*at data points. On Fri, Jan 25, 2013 at 6:23 PM, Zia mel ziad.kame...@gmail.com wrote: Interesting. Using IRStatistics stats = evaluator.evaluate(recommenderBuilder, null, model, null, 5, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); Can it be adjusted to each user ? In other words, is there a way to select a threshold instead of using 5 ? mm Something like selecting y set , each set have a min of z user ? On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote: The way I do it is to set x different for each user, to the number of items in the user's test set -- you ask for x recommendations. This makes precision == recall, note. It dodges this problem though. Otherwise, if you fix x, the condition you need is stronger, really: each user needs = x *test set* items in addition to training set items to make this test fair. On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com wrote: When selecting precision at x let's say 5 , should I check that all users have 5 items or more? For example, if a user have 3 items and they were removed as top items, then how can the recommender suggest items since there are no items to learn from? Thanks !
Re: EMR setup for seq2sparse
In my experience, using many small instances hurts since there is more data transferred (less data is local to any given computation) and the instance have lower I/O performance. On the high end, super-big instances become counter-productive because they are not as cheap on the spot market -- and you should be using the spot market for everything but your master for sure. ml.xlarge is a good default. EMR's default config says that each can handle 3 reducers. So set your parallelism to at least 3 times the number of workers you run. If you can get away with computing on one machine, without Hadoop, do so. .Distributing via Hadoop tends to cost 5x as much computing resource or more. And, you can rent amazingly huge machines in the cloud. There's still a point past which you can't fit on one machine, or it's not economical -- the huge EC2 instances are expensive and not on the spot market. But it may be big enough for a lot of problems. On Thu, Jan 24, 2013 at 2:01 PM, Matti Kokkola matti.kokk...@iki.fi wrote: Hi, I'm using Mahout to vectorize and cluster data consisting of short texts. So far I have done vectorizing on a single multi-core machine and been quite happy with the results. However, now we are doing a lot of small adjustments to increase the qulity of results and thus would like to tighten the feedback loop, ie. get vectors more quickly. Does anyone have good reference setup for Amazon EMR configuration for such a task? I tried with 6 m1.small instances, but terminated the job after 24 hrs, because I thought there is something wrong with the setup. I pretty much followed the guides in Mahout wiki for the basic setup. In the test case, my seq file size was 50MB and previous seq2sparse runs have resulted around 400k vectors from that data. Rest of the configuration was as follows: - mahout v0.7 - 6 instances, instance type default (m1.small) - numReducers 6 - maxNGramsize 2 Does this sound right (24 hrs and more to come...) for the given data size? How mouch improvement should I except, if I use m1.large instances instead? Any other recommendations?-) br, Matti
Re: Boolean preferences and evaluation
Not quite, the evaluation considers every item in the test set to be good, but you would and should fix the test set size across evaluations for this reason. You are right that there is a big assumption there -- that everything in the test set is good. You have to believe your test split process supports that assumption. On Thu, Jan 24, 2013 at 6:37 PM, Zia mel ziad.kame...@gmail.com wrote: In general boolean recommender will get higher precision than using a recommender with preferences, since the boolean considers every item as good which is not true! So is there a way to make a realistic measure from boolean ? For example, does dividing the precison by 2 makes sense since we get high precison using boolean? Thanks On Wed, Jan 23, 2013 at 3:49 PM, Ted Dunning ted.dunn...@gmail.com wrote: LLR should not be used to indicate proximity, but rather simply as a value to compare to a threshold. On Thu, Jan 24, 2013 at 1:45 AM, Zia mel ziad.kame...@gmail.com wrote: OK . The TanimotoCoefficientSimilarity and LogLikelihoodSimilarity used in MIA page 54 and 55 provide a score, so it seems they were not using a Boolean recommender , something like code 1 maybe? Thanks On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote: Yes any metric that concerns estimated value vs real value can't be used since all values are 1. Yes, when you use the non-boolean version with boolean data you always get 1. When you use the boolean version with boolean data you will get nonsense since the output of this recommender is not an estimated rating at all. On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote: I got 0 when I used GenericUserBasedRecommender in code 2 but when using GenericBooleanPrefUserBasedRecommender score was not 0 . I repeat the test with different data and again I got some results. Moreover , when I use DataModel model = new FileDataModel(new File(ua.base)); in code 2, the MAE score was higher. When you say RMSE can't be used with boolean data, I assume MAE also can't be used? Thanks ! On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote: RMSE can't be used with boolean data.
Re: Boolean preferences and evaluation
Well, if you are throwing away rating data, you are throwing away rating data. They are no longer 100% different but 100% the same. If that's not a good thing to do, don't do it. It's possible that using ratings gets better precision, and it's possible that it doesn't. It depends on whether the ratings data are useful or noise, and whether you use them or not. On Thu, Jan 24, 2013 at 7:52 PM, Zia mel ziad.kame...@gmail.com wrote: There should be something to solve this :) . For example, 2 users having the same items could rate them 100% different , but using the boolean their items will be recommended to each other. Is there a chance that using preferences would get higher precison that boolean? if so, when is that case?
Re: Boolean preferences and evaluation
Yes, but the similarities are no longer weights, because there is nothing to weight. They are used to compute a score directly, which is not a weighted average but a function of the similarities themselves. While it is true that more distant neighbors have less effect in general, when the similarities *are* used as weights, it's not true that a small bad contribution can't hurt. A small bad contribution can still be bad. On Thu, Jan 24, 2013 at 7:58 PM, Koobas koo...@gmail.com wrote: A naive question: Boolean recommender means that we are ignoring ratings, but aren't recommendations still weighted by user-user similarities or item-item similarities? Which would also mean that increasing the neighborhood will not deteriorate the results, because bad contributions from farther neighbors are attenuated by their lower similarities.
Re: MatrixMultiplicationJob runs with 1 mapper only ?
Mappers are usually extremely fast since they start themselves on top of the data and their job is usually just parsing and emitting key value pairs. Hadoop's choices are usually fine. If not it is usually because the mapper is emitting far more data than it ingests. Are you computing some kind of Cartesian product of input? That's slow no matter what. More mappers may increase parallelism but its still a lot of I/O. Avoid it if you can by sampling or pruning unimportant values. Otherwise , try to implement a Combiner. On Jan 23, 2013 12:04 PM, Jonas Grote jfgr...@gmail.com wrote: I'd play with the mapred.map.tasks option. Setting it to something bigger than 1 gave me performance improvements for various hadoop jobs on my cluster. 2013/1/16 Ashish paliwalash...@gmail.com I am afraid I don't know the answer. Need to experiment a bit more. I have not used CompositeInputFormat so cannot comment. Probably, someone else on the ML(Mailing List) would be able to guide here. On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Thanks Ashish, So according to the link if one is using CompositeInputFormat then it will take entire file as Input to a mapper without considering InputSplits/blocksize. If I am understanding it correctly then it is asking to break [Original Input File]-[flie1,file2,] . So If my file is [/test/MatrixA] -- [/test/smallfiles/file1, [/test/smallfiles/file2, [/test/smallfiles/file3... ] Now will the input path in MatrixMultiplicationJob will be directory path : /test/smallfiles ?? Will breaking file in such manner will cause problem in algorithmic execution of MR job. Im not sure if output will be correct . -Original Message- From: Ashish [mailto:paliwalash...@gmail.com] Sent: Wednesday, January 16, 2013 5:44 PM To: user@mahout.apache.org Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class); conf.setInputFormat(CompositeInputFormat.class); and AFAIK, CompositeInputFormat ignores the splits. See this http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join Unfortunately, I don't know any other alternative as of now. On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi stutiawas...@hcl.com wrote: The issue is that currently my matrix is of dimension (100x100k), Later it can be (1MX10M) or big. Even now if my job is running with the single mapper for (100x100k) and it is not able to complete the Job. As I mentioned map task just proceed to 0.99% and started spilling the map output. Hence I wanted to tune my job so that Mahout is able to complete the job and I can utilize my cluster resources. As MatrixMultiplicationJob is a MR, so it should be able to handle parallel map tasks. I am not sure if there is any algorithmic constraints due to which it runs only with single mapper ? I have taken the reference of thread so that I can set Configuration myself rather by getting it with getConf() but did not got any success http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc ers-in-DistributedRowMatrix-Jobs-td888980.html Stuti -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 16, 2013 4:46 PM To: Mahout User List Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ? Why do you need multiple mappers? Is one too slow? Many are not necessarily faster for small input On Jan 16, 2013 10:46 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I tried to call programmatically also but facing same issue : Only single MapTask is running and that too spilling the map output continuously. Hence im not able to generate the output for large matrix multiplication. Code Snippet : DistributedRowMatrix a = new DistributedRowMatrix(new Path(/test/points/matrixA), new Path(/test/temp),Integer.parseInt(100), Integer.parseInt(10)); DistributedRowMatrix b = new DistributedRowMatrix(new Path(/test/points/matrixA),new Path(tempDir),Integer.parseInt(100), Integer.parseInt(10)); Configuration conf = new Configuration(); conf.set(fs.default.name, hdfs://DS-1078D24B4736:10818); conf.set(mapred.child.java.opts, -Xmx2048m); conf.set(mapred.max.split.size,10485760); a.setConf(conf); b.setConf(conf); a.times(b); Where Im going wrong. Any idea ? Thanks Stuti -Original Message- From: Stuti Awasthi Sent: Wednesday, January 16, 2013 2:55 PM To: Mahout User List Subject: RE: MatrixMultiplicationJob
Re: Finding best NearestNUserNeighborhood size
The stochastic nature of the evaluation means your results will vary randomly from run to run. This looks to my eyeballs like most of the variation you see. You probably want to average over many runs. You will probably find that accuracy peaks around some neighborhood size: adding more useful neighbors helps but at some point the next nearest isn't so similar and the additional data harms the result more than helps. On Jan 23, 2013 1:01 PM, Zia mel ziad.kame...@gmail.com wrote: Hi I used NearestNUserNeighborhood with RMSE in a user recommender that use PearsonCorrelationSimilarity , I found that changing the neighborhood size has no clear pattern or effect. Sometimes it increase others decrease. While using the neighborhood size with precision has a better pattern. Any reason? Another point is that the RMSE change for every run since it choose different sample , so would running the code for 10 or 20 times and taking the average be a good idea or there is better thing to do? //-- RUN 1 2, 0.5523623146152608 3, 0.5425283201773704 4, 0.669846658662311 5, 0.5956616542334392 6, 0.6033911039809353 7, 0.6135206544496685 8, 0.5740444208649034 9, 0.642798288443049 10, 0.626653651472 //-- RUN 2 2, 0.5415411343523825 3, 0.6784589323396696 4, 0.6347069968141124 5, 0.6968820296725008 6, 0.5953849874479478 7, 0.6791828191904128 8, 0.6072462830257853 9, 0.6461346217476011 10, 0.6043919119341171 Thanks !
Re: Finding best NearestNUserNeighborhood size
That is good for making a test repeatable because you are picking the same random sample repeatedly. For evaluation purposes here that's not a good thing and you do want several actually different samples of the result. On Jan 23, 2013 1:19 PM, Stevo Slavić ssla...@gmail.com wrote: When evaluating recommender before running evaluator put RandomUtils.useTestSeed(); to make splitting of data set consistent; don't use it in production, just for evaluation. This is all explained more thoroughly in Mahout in Action book. Kind regards, Stevo Slavic. On Wed, Jan 23, 2013 at 2:01 PM, Zia mel ziad.kame...@gmail.com wrote: Hi I used NearestNUserNeighborhood with RMSE in a user recommender that use PearsonCorrelationSimilarity , I found that changing the neighborhood size has no clear pattern or effect. Sometimes it increase others decrease. While using the neighborhood size with precision has a better pattern. Any reason? Another point is that the RMSE change for every run since it choose different sample , so would running the code for 10 or 20 times and taking the average be a good idea or there is better thing to do? //-- RUN 1 2, 0.5523623146152608 3, 0.5425283201773704 4, 0.669846658662311 5, 0.5956616542334392 6, 0.6033911039809353 7, 0.6135206544496685 8, 0.5740444208649034 9, 0.642798288443049 10, 0.626653651472 //-- RUN 2 2, 0.5415411343523825 3, 0.6784589323396696 4, 0.6347069968141124 5, 0.6968820296725008 6, 0.5953849874479478 7, 0.6791828191904128 8, 0.6072462830257853 9, 0.6461346217476011 10, 0.6043919119341171 Thanks !
Re: Boolean preferences and evaluation
These can use non boolean data as the value will just be ignored. The opposite is what does not work. On Jan 23, 2013 4:45 PM, Zia mel ziad.kame...@gmail.com wrote: OK . The TanimotoCoefficientSimilarity and LogLikelihoodSimilarity used in MIA page 54 and 55 provide a score, so it seems they were not using a Boolean recommender , something like code 1 maybe? Thanks On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote: Yes any metric that concerns estimated value vs real value can't be used since all values are 1. Yes, when you use the non-boolean version with boolean data you always get 1. When you use the boolean version with boolean data you will get nonsense since the output of this recommender is not an estimated rating at all. On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote: I got 0 when I used GenericUserBasedRecommender in code 2 but when using GenericBooleanPrefUserBasedRecommender score was not 0 . I repeat the test with different data and again I got some results. Moreover , when I use DataModel model = new FileDataModel(new File(ua.base)); in code 2, the MAE score was higher. When you say RMSE can't be used with boolean data, I assume MAE also can't be used? Thanks ! On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote: RMSE can't be used with boolean data.
Re: ItemBased and data size
It's hard to make such generalization, but all else equal, I'd expect more data to improve results and decrease error, yes. On Wed, Jan 23, 2013 at 8:02 PM, Zia mel ziad.kame...@gmail.com wrote: Is there a relation between ItemBased and data size? I found when I increase the data size the MAE decrease. Does that indicate anything? Many thanks
Re: Boolean preferences and evaluation
That sounds reversed. Are you sure? without pref values, you should get 0. With values, you almost certainly won't get 0 RMSE. RMSE can't be used with boolean data. Code #4 needs to use the boolean user-based recommender or else you will get 1 for all estimates and results are randomly ordered then. On Tue, Jan 22, 2013 at 4:04 PM, Zia mel ziad.kame...@gmail.com wrote: Thanks Sean. - When I used GenericUserBasedRecommender in code 2 I got 0 , but when using GenericBooleanPrefUserBasedRecommender both MAE and RMSE in case 2 gave me scores, so only RMSE is not useful or also MAE ? - If I want to compare between recommenders that use preferences and those that don't use , does using code 3 and 4 below with GenericRecommenderIRStatsEvaluator makes sense? Since using code 2 with GenericBooleanPrefUserBasedRecommender creates different recommender that uses weights. //--- Code 3 - DataModel model = new FileDataModel(new File(ua.base)); RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(k, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); }}; //--- Code 4 --- DataModel model = new GenericBooleanPrefDataModel( GenericBooleanPrefDataModel.toDataMap( new FileDataModel(new File(ua.base; RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator(); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new LogLikelihoodSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(k, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); }}; On Tue, Jan 22, 2013 at 1:58 AM, Sean Owen sro...@gmail.com wrote: No it's really #2, since the first still has data that is not true/false. I am not sure what eval you are running, but an RMSE test wouldn't be useful in case #2. It would always be 0 since there is only one value in the universe: 1. No value can ever be different from the right value. On Tue, Jan 22, 2013 at 4:34 AM, Zia mel ziad.kame...@gmail.com wrote: Hi ! Can we say that both code 1 and 2 below are using boolean recommender since they both use LogLikelihoodSimilarity? Which code is used by default when no preferences are available ? When using GenericUserBasedRecommender in code 1 it gave a score during evaluation , but when using it in code 2 it gave 0 , is the score given by code 1 correct since in MAI book page 23 said In the case of Boolean preference data, only a precision-recall test is available anyway. //-- Code 1 -- DataModel model = new GroupLensDataModel(new File(ratings.dat)); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new LogLikelihoodSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); }}; //--- Code 2 --- DataModel model = new GenericBooleanPrefDataModel( GenericBooleanPrefDataModel.toDataMap( new FileDataModel(new File(ua.base; RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new LogLikelihoodSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericBooleanPrefUserBasedRecommender (model, neighborhood, similarity); }}; Many Thanks !
Re: Boolean preferences and evaluation
Yes any metric that concerns estimated value vs real value can't be used since all values are 1. Yes, when you use the non-boolean version with boolean data you always get 1. When you use the boolean version with boolean data you will get nonsense since the output of this recommender is not an estimated rating at all. On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote: I got 0 when I used GenericUserBasedRecommender in code 2 but when using GenericBooleanPrefUserBasedRecommender score was not 0 . I repeat the test with different data and again I got some results. Moreover , when I use DataModel model = new FileDataModel(new File(ua.base)); in code 2, the MAE score was higher. When you say RMSE can't be used with boolean data, I assume MAE also can't be used? Thanks ! On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote: RMSE can't be used with boolean data.
Re: Question - Mahout Taste - User-Based Recommendations...
Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(), and Recommender.estimatePreference(). These do what you are interested in, and yes they are easy since they are just steps in the recommendation process anyway. On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com wrote: Dear All, I am wondering if I understand the User-based recommendation algorithm correctly. I need to be able to answer the following questions, given users and ratings: 1) Which users are closest to a given user and 2) given a user and a product, predict the preference for the product apart from the standard return topN recommendations. But as I understand it, the above two questions are just subquestions of the topN problem, correct? Because the algorithm determines the closest users since it's a user-based recommender, and since it calculates all potential user-item associations, the second question should also be taken care of. Do I understand this correctly? I would greatly appreciate any help, Henning Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: Question - Mahout Taste - User-Based Recommendations...
That's a question of using item-item similarity. For that you need to use something based on an ItemSimilarity, which is not user-based but instead the item-based implementation. Or you can just use ItemSimilarity directly to iterate over the possibilities and find most similar, but, the recommender would do it for you. On Tue, Jan 22, 2013 at 7:50 PM, Henning Kuich hku...@gmail.com wrote: Oh, I forgot one thing: Is it just as simple using the User-based recommendation to find similar products, or is this only possible using item-based recommendations? So basically if a given user rated a certain product with x stars, to figure out what item is most like the one he has just rated, but using only user-based recommendation algorithms? HK On Tue, Jan 22, 2013 at 7:44 PM, Henning Kuich hku...@gmail.com wrote: That's what i though. I just wanted to make sure! Thanks so much for the quick reply! HK On Tue, Jan 22, 2013 at 7:40 PM, Sean Owen sro...@gmail.com wrote: Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(), and Recommender.estimatePreference(). These do what you are interested in, and yes they are easy since they are just steps in the recommendation process anyway. On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com wrote: Dear All, I am wondering if I understand the User-based recommendation algorithm correctly. I need to be able to answer the following questions, given users and ratings: 1) Which users are closest to a given user and 2) given a user and a product, predict the preference for the product apart from the standard return topN recommendations. But as I understand it, the above two questions are just subquestions of the topN problem, correct? Because the algorithm determines the closest users since it's a user-based recommender, and since it calculates all potential user-item associations, the second question should also be taken care of. Do I understand this correctly? I would greatly appreciate any help, Henning Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender
You would have to write this yourself, yes. If you're not keeping the data in memory, you're not updating the results in real-time. So there's no real need to keep any DataModel around at all. Just pre-compute and store recommendations and update them periodically. Nothing has to be on-line then. On Mon, Jan 21, 2013 at 7:54 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote: Hello, In our application we are using ReloadFromJDBCDataModel for its speed advantage of in-memory representation and being able to update periodically to pull in new data from a database source. However, once the recommender is build we do not want to keep the ratings data in memory (we would like to query the database when rating data is needed). We want to replace the ReloadFromJDBCDataModel with a MySqlJDBCDataModel after build. But there is no setter method for it, furthermore, the field that keeps the DataModel is in AbstractRecommender (superclass of SVDRecommender) and it is declared final. We thought we could write a new class that derives from DataModel, which initial keeps a Reload model instance (let's call this delegateModel), has a setter method for it, and delegates all DataModel methods, so that we could set this delegateModel field to another instance, say MySqlJDBCDataModel instance. Is this a good method for removing in-memory representation dependency after the build process? How can we achieve this change? Or is there an alternative and better way to achieve this? Thanks Ceyhun Can Ulker
Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender
If you don't have the data in memory you can't compute anything. The recommender itself doesn't do anything without data. That's why it seemed like you really just wanted to compute everything offline first, in which case the simplest solution is to store it however you like and fetch that result however you like. On Mon, Jan 21, 2013 at 8:22 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote: Hi again, Thank you for your quick reply, Sean. I couldn't understand one point. What do you mean by pre-compute and store recommendations? Doesn't it mean having a dense (rather filled?) rating matrix? So it would make memory usage much worse, even if it is possible. Wouldn't it better to keep the model and compute whenever necessary? Thanks Ceyhun Can Ulker On Mon, Jan 21, 2013 at 9:58 PM, Sean Owen sro...@gmail.com wrote: You would have to write this yourself, yes. If you're not keeping the data in memory, you're not updating the results in real-time. So there's no real need to keep any DataModel around at all. Just pre-compute and store recommendations and update them periodically. Nothing has to be on-line then. On Mon, Jan 21, 2013 at 7:54 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote: Hello, In our application we are using ReloadFromJDBCDataModel for its speed advantage of in-memory representation and being able to update periodically to pull in new data from a database source. However, once the recommender is build we do not want to keep the ratings data in memory (we would like to query the database when rating data is needed). We want to replace the ReloadFromJDBCDataModel with a MySqlJDBCDataModel after build. But there is no setter method for it, furthermore, the field that keeps the DataModel is in AbstractRecommender (superclass of SVDRecommender) and it is declared final. We thought we could write a new class that derives from DataModel, which initial keeps a Reload model instance (let's call this delegateModel), has a setter method for it, and delegates all DataModel methods, so that we could set this delegateModel field to another instance, say MySqlJDBCDataModel instance. Is this a good method for removing in-memory representation dependency after the build process? How can we achieve this change? Or is there an alternative and better way to achieve this? Thanks Ceyhun Can Ulker
Re: Boolean preferences and evaluation
No it's really #2, since the first still has data that is not true/false. I am not sure what eval you are running, but an RMSE test wouldn't be useful in case #2. It would always be 0 since there is only one value in the universe: 1. No value can ever be different from the right value. On Tue, Jan 22, 2013 at 4:34 AM, Zia mel ziad.kame...@gmail.com wrote: Hi ! Can we say that both code 1 and 2 below are using boolean recommender since they both use LogLikelihoodSimilarity? Which code is used by default when no preferences are available ? When using GenericUserBasedRecommender in code 1 it gave a score during evaluation , but when using it in code 2 it gave 0 , is the score given by code 1 correct since in MAI book page 23 said In the case of Boolean preference data, only a precision-recall test is available anyway. //-- Code 1 -- DataModel model = new GroupLensDataModel(new File(ratings.dat)); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new LogLikelihoodSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); }}; //--- Code 2 --- DataModel model = new GenericBooleanPrefDataModel( GenericBooleanPrefDataModel.toDataMap( new FileDataModel(new File(ua.base; RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new LogLikelihoodSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericBooleanPrefUserBasedRecommender (model, neighborhood, similarity); }}; Many Thanks !
Re: Any utility to solve the matrix inversion in Map/Reduce Way
And, do you really need an inverse, or pseudo-inverse? But, no, there are really no direct utilities for this. But we could probably tell you how to do it efficiently, as long as you don't actually mean a full inverse. On Fri, Jan 18, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: Left unsaid in this comment is the fact that matrix inversion of any sizable matrix is almost always a mistake because it is (a) inaccurate, (b) slow. In scalable numerics it is also commonly true that the only really scalable problems are sparse. The reason for that is that systems whose cost grows with O(n^2) cannot be scaled to arbitrary size n. Sparse systems with only k items on average per row can often be handled with o(n) complexity which a requirement for a practical system. On Thu, Jan 17, 2013 at 8:49 PM, Koobas koo...@gmail.com wrote: Martix inversion, as in explicitly computing the inverse, e.g. computing variance / covariance, or matrix inversion, as in solving a linear system of equations? On Thu, Jan 17, 2013 at 7:49 PM, Colin Wang colin.bin.wang.mah...@gmail.com wrote: Hi All, I want to solve the matrix inversion, of course, big size, in Map/Reduce way. I don't know if Mahout offers this kind of utility. Could you please give me some tips? Thank you, Colin
Re: Problem with mahout and AWS
You should give more detail about the errors. You are running out of memory on the child workers. This is not surprising since the default memory they allocate is fairly small, and you're running a complete recommender system inside each mapper. It has not much to do with the size of the instane you use. I am not sure what the second thing is, you should give more detail. On Fri, Jan 18, 2013 at 2:02 PM, Iñigo Llamosas inigollamo...@gmail.com wrote: Hi, I am trying to run a simple recommender on AWS, but I'm getting errors when reducing. These are the jar-parameters lines: s3://inigobucket/jars/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob -Dmapred.input.dir=s3://inigobucket/data/grouplens10m/ratings.dat -Dmapred.output.dir=s3://inigobucket/output/ --recommenderClassName org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender Starts OK, but when reducing it gives 2 kind of problems. -Heap space error. This confuses me because I had that error with a 2 m.small slave cluster but also with a 5 c1.medium slave cluster -org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists error. Any suggestion? Many thanks, Inigo
Re: trying to get grouplens example to run
That's the error right there: On Thu, Jan 17, 2013 at 9:57 PM, Kamal Ali k...@grokker.com wrote: Caused by: java.io.IOException: Unexpected input format on line: 1 1 5
RE: MatrixMultiplicationJob runs with 1 mapper only ?
Why do you need multiple mappers? Is one too slow? Many are not necessarily faster for small input On Jan 16, 2013 10:46 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I tried to call programmatically also but facing same issue : Only single MapTask is running and that too spilling the map output continuously. Hence im not able to generate the output for large matrix multiplication. Code Snippet : DistributedRowMatrix a = new DistributedRowMatrix(new Path(/test/points/matrixA), new Path(/test/temp),Integer.parseInt(100), Integer.parseInt(10)); DistributedRowMatrix b = new DistributedRowMatrix(new Path(/test/points/matrixA),new Path(tempDir),Integer.parseInt(100), Integer.parseInt(10)); Configuration conf = new Configuration(); conf.set(fs.default.name, hdfs://DS-1078D24B4736:10818); conf.set(mapred.child.java.opts, -Xmx2048m); conf.set(mapred.max.split.size,10485760); a.setConf(conf); b.setConf(conf); a.times(b); Where Im going wrong. Any idea ? Thanks Stuti -Original Message- From: Stuti Awasthi Sent: Wednesday, January 16, 2013 2:55 PM To: Mahout User List Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ? Hey Sean, Thanks for response. MatrixMultiplicationJob help shows the usage like : usage: command [Generic Options] [Job-Specific Options] Here Generic Option can be provided by -D property=value. Hence I tried with commandline -D options but it seems like that it is not making any effect. It is also suggested in : https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html Here I have noted 1 thing after your suggestion that currently Im passing arguments like -Dproperty=value rather than -D property=value. I tried with space between -D and property=value also but then its giving error like: 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected /test/points/matrixA while processing Job-Specific Options: No such error comes if im passing the arguments without space between -D. By reference of Hadoop Definite Guide : Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser requires them to be separated by whitespace. Hence I suppose that GenericOptions should be parsed by -D property=value rather than -Dproperty=value. Additionally I tried -Dmapred.max.split.size=10485760 also through commandline but again only single MapTask started. Please Suggest -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 16, 2013 1:23 PM To: Mahout User List Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? It's up to Hadoop in the end. Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value, like your 10MB (1000). I don't know if Hadoop params can be set as sys properties like that anyway? On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created. So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT. Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched. Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output. Mahout Command: mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 --numColsA 10 --inputPathB /test/matrixA --numRowsB 100 --numColsB 10 --tempDir /test/temp Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] . Thanks Stuti ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall
Re: Test multiple similarities using the same data
You can try resetting all the random seeds with RandomUtils.useTestSeed() On Jan 16, 2013 4:01 PM, Zia mel ziad.kame...@gmail.com wrote: Hi How to evaluate a recommender using different similarities ? Once we call evaluator.evaluate(recommenderBuilder,..) it will decide the training and test data for that recommender and if we call it again for another setting (similarity,neighborhood) the data will be different. So how can we be consistent ? Thanks !
Re: Recommend to a group of users
Not really directly, no. You can make N individual recommendations and combine them, and there are many ways to do that. You can blindly rank them on their absolute scores. You can interleave rankings so each gets every Nth slot in the recommendation. A popular metric is to rank by least-aversion -- the best recommendation one is the one most acceptable to the person who will like it least in the group. You're minimizing maximum unhappiness: often how it works in groups! On Wed, Jan 16, 2013 at 4:56 PM, Zia mel ziad.kame...@gmail.com wrote: Hi Can we use Mahout to recommend to a group of users that share similar interests? Maybe some clustering or so. Thanks
Re: threshold assignment / selection
It's fairly arbitrary. Strong positive ratings are probably more than merely above average, but you could define the threshold higher or lower if you wanted. It's a good default. On Tue, Jan 15, 2013 at 3:58 PM, Zia mel ziad.kame...@gmail.com wrote: Hi Why in recommender the threshold is considered the user’s average preferences value plus one standard deviation ? Can we asssume that the good recommendations are anything above the user's average preferences? Many thanks
Re: Choosing precision
Precision is not a great metric for recommenders, but it exists. There is no best value here; I would choose something that mirrors how you will use the results. If you show top 3 recs, use 3. On Tue, Jan 15, 2013 at 4:51 PM, Zia mel ziad.kame...@gmail.com wrote: Hello, If I have users that have items between 1-20 , what would be the ideal way to evaluate the recommender using precisoion? Is there any recommended precision to choose such as p@2 , p@5 p@10 or others and why? Many thanks
Re: Choosing precision
The best tests are really from real users. A/B test different recommenders and see which has better performance. That's not quite practical though. The problem is that you don't even know what the best recommendations are. Splitting the data by date is reasonable, but recent items aren't necessarily most-liked. Splitting by rating is more reasonable on this point, but you still can't conclude that there aren't better recommendations from among the un-rated items. Still it out to correlate. I think you will find precision/recall are very low in most cases, often a few percent. The result is noisy. AUC will tell you about where all of those best recommendations in the test set fell into the list, rather than only measuring the top N's performance. This tells you more, and I think that's generally good. However it is measuring performance over the entire list of recs, when you are unlikely to use more than the top N. Go ahead and use it since there's not a lot better you can do in the lab, but be aware of the issues.
Re: RMSRecommenderEvaluator RMSE
You have the definition there already, what are you asking? On Jan 15, 2013 5:58 PM, Zia mel ziad.kame...@gmail.com wrote: Hi again , When evaluting preferences in recommenders and using RMSRecommenderEvaluator, is it RMSE/RMSD http://en.wikipedia.org/wiki/Root_mean_square_deviation If we get a value of 1 or 10 for RMSE what does that really mean ? Can we represent RMSE by a % by dividing it on the range of preferences to get a % of the error. For example if the RMSE is 1 and range is from 0-5 can we say that the error of predicting is 1/5= 20% ? Thanks
Re: Failed to create /META-INF/license file on Mac system
http://stackoverflow.com/questions/10522835/hadoop-java-io-ioexception-mkdirs-failed-to-create-some-path On Tue, Jan 15, 2013 at 9:42 PM, Yunming Zhang zhangyunming1...@gmail.com wrote: Hi, I was trying to set up Mahout 0.8 on my Macbook Pro with OSX so I could do some local testing, I am running Hadoop 1.0.3 (it worked fine with mahout in my cluster) I have set up Pseudo distribution Hadoop, and I could put testdata direction into HDFS, But when I try to execute $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job I get Exception in thread main java.io.IOException: Mkdirs failed to create /PATH-TO-TMP/hadoop-unjar6845980999143023006/META-INF/license at org.apache.hadoop.util.RunJar.unJar(RunJar.java:47) at org.apache.hadoop.util.RunJar.main(RunJar.java:132) It seems to be a really similar issue to this bug https://issues.apache.org/jira/browse/MAHOUT-780 but I am using Mahout 0.8, so I am not sure what is happening here, I have checked, there should be permission to the PATH-TO-TMP directory, so I don't think it is a permission issue Thanks Yunming
Re: MatrixMultiplicationJob runs with 1 mapper only ?
It's up to Hadoop in the end. Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value, like your 10MB (1000). I don't know if Hadoop params can be set as sys properties like that anyway? On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created. So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT. Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched. Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output. Mahout Command: mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 --numColsA 10 --inputPathB /test/matrixA --numRowsB 100 --numColsB 10 --tempDir /test/temp Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] . Thanks Stuti ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.