Re: How to map UUID to userId in Preference class to use mahout recommender?

2013-04-07 Thread Sean Owen
You can use the low-order bits, or have a look at what the UUID class
does to hash itself to 32 bits in hashCode() and emulate that for 64
bits. Collisions in a 64-bit space are very very very rare, enough to
not care about here by a wide margin. A collision only means you
confuse prefs from two users -- it still mostly works anyway.

Yes keys were originally Comparable. It was just too much memory /
performance overhead. Instead, you can use a mapping to/from 64-bit
values. See IDMigrator for instance.

On Mon, Apr 8, 2013 at 3:51 AM, Phoenix Bai baizh...@gmail.com wrote:
 Hi All,

 the input format required for mahout recommender is :

 *userId (long), itemId (long), rating (optional)*

 while, currently, my input format is:

 *userId (UUID, which is 128bit long), itemId (long), boolean*

 so, my question is, how could I convert userId in UUID format to long
 datatype?
 e.g. how to map value like *550e8400-e29b-41d4-a716-44665544* to long
 datatype?

 My current solution is to convert it to java UUID instance and extract the
 least significant bits and use it as long type userId.
 But I am worried about the collision that is not supposed to exist with
 uuid.

 I am wondering two things:
 1) if the collision is low, could I use above approach? what`s the possible
 pros and cons?
 2) is it possible to change or extend Preference class to modify userId to
 String datatype? is it feasible?

 thanks


Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-06 Thread Sean Owen
For example, here's Y:

Y =

  -0.278098  -0.256438   0.127559  -0.045869  -0.769172  -0.255599
0.150450  -0.436548   0.209881  -0.526238
   0.613175  -0.600739  -0.291662  -1.142282   0.277204  -0.296846
-0.175122   0.031656  -0.202138  -0.254480
  -0.187816  -0.889571   0.052191  -0.304053   0.498097  -0.049822
-0.972282  -0.240532   0.155711  -0.627668
  -0.065179  -0.055424   0.977480   0.104342   0.594501   0.033205
-0.896222  -0.345715  -0.371288  -0.489602
  -0.434807  -0.403650   0.264583  -0.110285  -1.318951  -0.452470
0.274445  -0.755704   0.313150  -0.903234

and R from the QR decomposition of Y' * Y:

R =

   2.56259  -1.35164  -2.43837   1.27844  -0.17692  -0.30514   1.09366
 -0.84664   0.58601   1.06875
   0.0   1.03316   2.61600  -0.46070  -1.46785  -0.10841   0.24828
 -2.32186  -2.00163  -0.71470
   0.0   0.0   2.11507   1.15523   1.10757   0.36407  -0.31567
  2.77361   0.77367  -0.84055
   0.0   0.0   0.0   0.54242  -0.01545   0.21761   0.26630
  0.13972   0.44089   0.02783
   0.0   0.0   0.0   0.0   0.0  -0.0  -0.0
  0.0   0.0  -0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 -0.0   0.0   0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0  -0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0  -0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
  0.0   0.0   0.0


Separately I tried avoiding the inverse altogether here and just using
the QR decomposition to solve a system where necessary. Probably a
better move anyway. But same result. I think I'm not really
quantifying the problem properly, but it's not really a matter of
condition number or machine precision. Condition numbers are 1 in
these cases but not that large.


On Sun, Apr 7, 2013 at 12:19 AM, Koobas koo...@gmail.com wrote:
 I don't see why the inverse of Y'*Y does not exist.
 What Y do you end up with?


Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-05 Thread Sean Owen
(On this aside -- the Commons Math version uses Householder
reflections but operates on a transposed representation for just this
reason.)

On Thu, Apr 4, 2013 at 11:11 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 But then I started trying to build a HH version using vector ops and
 realized  that the likely reason for the speed is actually just due to the
 fact that the matrix is stored in row major form.  The operations in my GS
 implementation are very much row oriented.  The operations in the old HH
 implementation were very column oriented.

 It is hard to frame HH in a row major fashion.  I might be able to figure
 out a Given's rotation method that is row oriented.

 The payoff is that doing HH well (or Givens) should give about another 2x
 speedup.

 The downside is that nobody has time to fix stuff that isn't broken.


Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-05 Thread Sean Owen
OK yes you're on to something here. I should clarify. Koobas you are
right that the ALS algorithm itself is fine here as far as my
knowledge takes me. The thing it inverts to solve for a row of X is
something like (Y' * Cu * Y + lambda * I). No problem there, and
indeed I see why the regularization term is part of that.

I'm talking about a later step, after the factorization. You get a new
row in A and want to solve A = X*Y' for X, given the current Y. (And
vice versa). I'm using a QR decomposition for that, but not to
directly solve the system (and this may be the issue), but instead to
compute and save off (Y' * Y)^-1 so that we can figure A * Y *
(Y'*Y)^-1 very fast at runtime. That is to say the problem centers
around the inverse of Y'*Y and in this example, it does not even
exist.

I am not sure it's just a numerical precision thing since using an SVD
to get the inverse gives the same result.

But I certainly have examples where the data (A) is most certainly
rank  k and get this bad behavior -- for example, when lambda is
very *high*.


On Fri, Apr 5, 2013 at 6:57 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 On Fri, Apr 5, 2013 at 2:40 AM, Koobas koo...@gmail.com wrote:

 Anyways, I saw no particular reason for the method to fail with k
 approaching or exceeding m and n.
 It does if there is no regularization.
 But with regularization in place, k can be pretty much anything.


 Ahh... this is an important point and it should handle all of the issues of
 poor conditioning.

 The regularizer takes the rank deficient A and makes it reasonably well
 conditioned.  How well conditioned depends on the choice of lambda, the
 regularizing scale constant.


Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
This is more of a linear algebra question, but I thought it worth
posing to the group --

As part of a process like ALS, you solve a system like A = X * Y' for
X or for Y, given the other two. A is sparse (m x n); X and Y are tall
and skinny (m x k, m x n, where k  m,n)

For example to solve for X, just:   X = A * Y * (Y' * Y)^-1

This fails if the k x k matrix Y' * Y is not invertible of course.
This can happen if the data is tiny and k is actually large relative
to m,n.

It also goes badly if it is nearly not invertible. The solution for X
can become very large, for example, for a small A, which is obviously
wrong. You can -- often -- detect this by looking at the diagonal of
R in a QR decomposition, looking for near-zero values.

However I find a similar behavior even when the rank k seems
intuitively fine (easily low enough given the data), but when, for
example, the regularization term is way too high. X and Y are so
constrained that the inverse above becomes a badly behaved operator
too.

I think I understand the reasons for this intuitively. The goal isn't
to create a valid solution since there is none here; the goal is to
define and detect this bad situation reliably and suggest a fix to
parameters if possible.

I have had better success looking at the operator norm of (Y' * Y)^-1
(its largest singular value) to get a sense of when it is going to
potentially scale its input greatly, since that's a sign it's bad, but
I feel like I'm missing the rigorous understanding of what to do with
that info. I'm looking for a way to think about a cutoff or threshold
for that singular value that will make it be rejected (1?) but think
I have some unknown-unknowns in this space.

Any insights or pointers into the next concept that's required here
are appreciated.

Sean


Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
I think that's what I'm saying, yes. Small rows X shouldn't become
large rows of A -- and similarly small changes in X shouldn't mean
large changes in A. Not quite the same thing but both are relevant. I
see that this is just the ratio of largest and smallest singular
values. Is there established procedure for evaluating the
ill-conditioned-ness of matrices -- like a principled choice of
threshold above which you say it's ill-conditioned, based on k, etc.?

On Thu, Apr 4, 2013 at 3:19 PM, Koobas koo...@gmail.com wrote:
 So, the problem is that the kxk matrix is ill-conditioned, or is there more
 to it?



Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
Does it complete without problems? It may complete without error but
the result may be garbage. The matrix that's inverted is not going to
be singular due to round-off. Even if it's not you may find that the
resulting vectors are infinite or very large. In particular I at least
had to make the singularity threshold a lot larger than
Double.MIN_VALUE in the QR decomposition.

Try some simple dummy data like below, without maybe k=10. If it
completes with error that's a problem!

0,0,1
0,1,4
0,2,3
1,2,3
2,1,4
2,3,3
2,4,2
3,0,5
3,2,2
3,4,3
4,3,5
5,0,2
5,1,4

On Thu, Apr 4, 2013 at 7:05 PM, Koobas koo...@gmail.com wrote:
 I took Movie Lens 100K data without ratings and ran non-weighted ALS in
 Matlab.
 I set number of features k=2000, which is larger than the input matrix
 (1000 x 1700).
 I used QR to do the inversion.
 It runs without problems.
 Can you share your data?



 On Thu, Apr 4, 2013 at 1:10 PM, Koobas koo...@gmail.com wrote:

 Just to throw another bit.
 Just like Ted was saying.
 If you take the largest singular value over the smallest singular value,
 you get your condition number.
 If it turns out to be 10^16, then you're loosing all the digits of double
 precision accuracy,
 meaning that your solver is nothing more than a random number generator.




 On Thu, Apr 4, 2013 at 12:21 PM, Dan Filimon 
 dangeorge.fili...@gmail.comwrote:

 For what it's worth, here's what I remember from my Numerical Analysis
 course.

 The thing we were taught to use to figure out whether the matrix is ill
 conditioned is the condition number of a matrix (k(A) = norm(A) *
 norm(A^-1)). Here's a nice explanation of it [1].

 Suppose you want to solve Ax = b. How much worse results will you get
 using
 A if you're not really solving Ax = b but A(x + delta) = b + epsilon (x is
 still a solution for Ax = b).
 So, by perturbing the b vector by epsilon, how much worse is delta going
 to
 be? There's a short proof [1, page 4] but the inequality you get is:

 norm(delta) / norm(x) = k(A) * norm(epsilon) / norm(b)

 The rule of thumb is that if m = log10(k(A)), you lose m digits of
 accuracy. So, equivalently, if m' = log2(k(A)) you lose m' bits of
 accuracy.
 Since floats are 32bits, you can decide that say, at most 2 bits may be
 lost, therefore any k(A)  4 is not acceptable.

 Anyway there are lots of possible norms and you need to look at ways of
 actually interpreting the condition number but from what I learned this is
 probably the thing you want to be looking at.

 Good luck!

 [1] http://www.math.ufl.edu/~kees/ConditionNumber.pdf
 [2] http://www.rejonesconsulting.com/CS210_lect07.pdf


 On Thu, Apr 4, 2013 at 5:26 PM, Sean Owen sro...@gmail.com wrote:

  I think that's what I'm saying, yes. Small rows X shouldn't become
  large rows of A -- and similarly small changes in X shouldn't mean
  large changes in A. Not quite the same thing but both are relevant. I
  see that this is just the ratio of largest and smallest singular
  values. Is there established procedure for evaluating the
  ill-conditioned-ness of matrices -- like a principled choice of
  threshold above which you say it's ill-conditioned, based on k, etc.?
 
  On Thu, Apr 4, 2013 at 3:19 PM, Koobas koo...@gmail.com wrote:
   So, the problem is that the kxk matrix is ill-conditioned, or is there
  more
   to it?
  
 





Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
It might make a difference that you're just running 1 iteration. Normally
it's run to 'convergence' -- or here let's say, 10+ iterations to be safe.

This is the QR factorization of Y' * Y at the finish? This seems like it
can't be right... Y has only 5 vectors in 10 dimensions and Y' * Y is
certainly not invertible. I get:

   1.20857  -0.20462   0.08707  -0.16972   0.17038   0.00342   0.24459
 -0.23287   0.51142  -0.06083
   0.0   1.13242   0.23155   0.24354   0.32995   0.47781  -0.02832
0.43071  -0.24968   0.41470
   0.0   0.0   0.91070   0.37732   0.05296   0.39886  -0.62426
0.07809   0.53891   0.24877
   0.0   0.0   0.0   0.69369  -0.21648  -0.10501   0.09706
 -0.03683  -0.10512   0.02849
   0.0   0.0   0.0   0.0   0.60165   0.37106  -0.00193
 -0.23392   0.10109  -0.09897
   0.0   0.0   0.0   0.0   0.0   0.0  -0.0
 -0.0  -0.0   0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 -0.0  -0.0  -0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
0.0  -0.0   0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
0.0   0.0  -0.0
   0.0   0.0   0.0   0.0   0.0   0.0   0.0
0.0   0.0   0.0

I think there are some other differences here but probably not meaningful
in this context. For example I was doing implicit-feedback ALS. (But the
result above is from an Octave implementation of regular ALS like what
your'e running)

There are a bunch of useful thoughts here I am going to both read up and
explore as conditions.


On Thu, Apr 4, 2013 at 8:54 PM, Koobas koo...@gmail.com wrote:

 BTW, my initialization of X and Y is simply random:
 X = rand(m,k);
 Y = rand(k,n);



 On Thu, Apr 4, 2013 at 3:51 PM, Koobas koo...@gmail.com wrote:

 It's done in one iteration.
  This is the R from QR factorization:

 5.06635.81224.97044.39876.34004.59705.0334
 4.25813.38085.3250
  02.40361.17222.32961.65800.45751.1706
 2.10401.67381.4839
  0 01.50850.09661.25810.52360.4712
 -0.04110.31430.5957
  0 0 01.86820.1834   -0.3244   -0.0073
 0.38171.16730.4783
  0 0 0 01.95690.86660.3201
 -0.41670.07320.3114
  0 0 0 0 01.35200.2326
 -0.1156   -0.27930.0103
  0 0 0 0 0 01.1689
 0.31510.05900.0435
  0 0 0 0 0 0 0
 1.6296   -0.3494   -0.0024
  0 0 0 0 0 0
 0 01.43070.1803
  0 0 0 0 0 0
 0 0 01.1404






Re: Parallel GenericRecommenderIRStatsEvaluator?

2013-04-01 Thread Sean Owen
No, just was never written I suppose back in the day. The way it is
structured now it creates a test split for each user, which is also
slow, and may be challenging to memory limitations as that's N data
models in memory. You could take a crack at a patch.

When I rewrote this aspect in a separate project it was certainly
threaded and relied on a single test split. It's much faster indeed.

On Mon, Apr 1, 2013 at 11:26 AM, Gabor Bernat ber...@primeranks.net wrote:
 Hello,

 Is there any good reason why the *GenericRecommenderIRStatsEvaluator* does
 not support parallel (multi-CPU) evaluation. Today is quite common to have
 CPUs with more than one core, and IR evaluation on any reasonably sized
 data set takes forever to finish. I'm thinking if we could parallelize the
 evaluation, by breaking down the input into subsets, and accumulating at
 the end the measurements of each subset, the evaluation time could be
 heavily improved.

 For example I have a data set with 2+ million ratings, and evaluating IR
 with even 10% of this with a simple recommender takes more than 3 hours
 with just a single core of my CPU being kept busy...

 So?


 Bernát GÁBOR


Re: Reproducibility, and Recommender Algorithms in Mahout

2013-03-30 Thread Sean Owen
You should be able to get reproducible random seed values by calling
RandomUtils.useTestSeed() at the very start of your program. But if
your goal is to get an unbiased view of the quality of results, you
want to run several times and take the average yes.

On Sat, Mar 30, 2013 at 3:57 PM, Reinhard Denis Najogie
najo...@gmail.com wrote:
 Dear all,

 I am doing experiments as a part of my final project. I'm comparing the
 performance of Mahout's implementations of recommender algorithms on some
 public dataset (so far bookcross and grouplens). I want to ask 2 questions:

 1. The score (RMSE) results quite vary each time I run an algorithm
 (sometimes +- 0.5 difference on some algorithms). Is there any way that I
 can make it produce the same result on each run? Maybe by setting a seed
 somewhere on the code? Or should I just do like 10 run and take the average
 score?

 2. Where can I see the list of all recommender algorithms already
 implemented by Mahout? From what I read on Mahout in Action book, there are
 6 algorithms: UserBased, ItemBased, Slope One, SVD, KnnItemBased, and
 TreeClustering. Are there new algorithms since then? Oh, and I found both
 KnnItem and TreeClustering are deprecated on the newest version of Mahout
 (0.8-SNAPSHOT) ? Why is this the case?

 --
 Regards,
 Reinhard Denis Najogie


Re: Setting preferences in GenericDataModel.

2013-03-29 Thread Sean Owen
Yes it's OK. You need to care for thread safety though, which will be
hard. The other problem is that changing the underlying data doesn't
necessarily invalidate caches above it. You'll have to consider that
part as well. I suppose this is part of why it was conceived as a
model where the data is only periodically re-read -- you gain speed
from immutability and cacheability. But you lose, of course, real-time
updates.

On Fri, Mar 29, 2013 at 5:46 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote:
 Hello,

 I checked the implementation of GenericDataModel for adding and removing
 preferences after instantiation. Those methods (setPreference(long, long,
 float) and removePreference(long, long)) throw
 UnsupportedOperationException s. I'd like to know whether there is an
 important reason for not altering content of a GenericDataModel, since in
 our application data can fit into memory and we want our data to be up to
 date. DataModel interface have those methods, and GenericDataModel is just
 an in-memory implementation of it.

 Would it be ok if I write an implementation of DataModel like
 GenericDataModel, but with setPreference and removePreference methods not
 throwing exceptions?

 Thanks,
 Ceyhun Can ULKER


Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sean Owen
This is really a Hadoop-level thing. I am not sure I have ever
successfully induced M/R to run multiple mappers on less than one
block of data, even with a low max split size. Reducers you can
control.

On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister
sebastian.briesemeis...@unister-gmbh.de wrote:
 Thank you.

 Splitting the files leads to multiple MR-tasks!

 Only changing the MR settings of hadoop did not help. In the future it
 would be nice if the drivers would scale themself and would split the
 data according to the dataset size and the number of available MR-slots.


Re: sql data model w/where clause

2013-03-25 Thread Sean Owen
Modify the existing code to change the SQL -- it's just a matter of
copying a class that only specifies SQL and making new SQL statements.
I think there's a version that even reads from a Properties object.

On Mon, Mar 25, 2013 at 12:11 AM, Matt Mitchell goodie...@gmail.com wrote:
 Hi,

 I have a table of user preferences with the following columns:

 user_id
 item_id
 tag

 I want to build a data model in mahout, but not use the entire table. I'd
 like to add a where clause like where tag = 'A' when building the model
 instance. Is this possible? If not, any way around this besides creating a
 view or new table?

 Thanks,
 Matt


Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
Points from across several e-mails --

The initial item-feature matrix can be just random unit vectors too. I
have slightly better results with that.

You are finding the least-squares solution of A = U M' for U given A
and M. Yes you can derive that analytically as the zero of the
derivative of the error function.

With m users and n items, and k features, where k=n, then I suppose
you don't need any iterations at all since there is a trivial
solution: U = A, M = I(n) (the nxn identity matrix). You would not
find this on the first iteration, however, if you followed the
algorithm, because you would be starting from some random starting
point. But if you initialized M to the identity matrix, yes you'd find
the exact solution immediately.

Yes it is an iterative algorithm and you have to define a convergence
criterion. I use average absolute difference in (U M') entries from
one iteration to the next. (Well, a sample.) It's possible that you
reach your criterion in 1 iteration, or, not. It depends on the
criterion. Usually when you restart ALS on updated data, you use the
previous M matrix as a starting point. If the change in data is small,
one iteration should usually leave you still converged actually.
But, from random starting point -- not typical.

ALS is similar to SVD only in broad terms. The SVD is not always used
to make a low-rank factorization, and, its outputs carry more
guarantees -- they are orthonormal bases because it has factored out
scaling factors into the third matrix Sigma. I think of ALS as more
simplistic and therefore possibly faster. WIth k features I believe
(?) the SVD is necessarily a k-iteration process at least, whereas ALS
might be of use after 1 iteration. The SVD is not a shortcut for
ALS. If you go to the trouble of a full SVD, you can certainly use
that factorization as-is though.

You need regularization.


It should be pointed out that the ALS often spoken of here is not
one that factors the input matrix A. There's a variant, that I have
had good results with, for 'implicit' feedback. There, you are
actually factoring the matrix P = (1 : A != 0, 0 : A == 0), and using
values in A as weights in the loss function. You're reconstructing
interacted or not and using input value as a confidence measure.
This works for ratings although the interpretation in this variant
doesn't line up with the nature of ratings. It works quite nicely for
things like clicks, etc.

(Much more can be said on this point.)



On Mon, Mar 25, 2013 at 2:19 AM, Dominik Huebner cont...@dhuebner.com wrote:
 It's quite hard for me to get the mathematical concepts of the ALS
 recommenders. It would be great if someone could help me to figure out
 the details. This is my current status:

 1. The item-feature (M) matrix is initialized using the average ratings
 and random values (explicit case)

 2. The user-feature (U) matrix is solved using the partial derivative of
 the error function with respect to u_i (the columns of row-vectors of U)

 Supposed we use as many features as items are known and the error
 function does not use any regularization. Would U be solved within the
 first iteration? If not, I do not understand why more than one iteration
 is needed.
 Furthermore, I believe to have understood that using fewer features than
 items and also applying regularization, does not allow to solve U in a
 way that the stopping criterion can be met after only one iteration.
 Thus, iteration is required to gradually converge to the stopping
 criterion.

 I hope I have pointed out my problems clearly enough.



Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
OK, the 'k iterations' happen inline in one job? I thought the Lanczos
algorithm found the k eigenvalues/vectors one after the other. Yeah I
suppose that doesn't literally mean k map/reduce jobs. Yes the broader
idea was whether or not you might get something useful out of ALS
earlier.

On Mon, Mar 25, 2013 at 11:06 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 SVD need not be iterative at all. The SSVD code uses roughly 5 map-reduces
 to give you a high quality SVD approximation.  There is the option to add
 0, 1 or more extra iterations, but it is rare to need more than 1.

 ALS could well be of use after less work.  This is especially try for
 incremental solutions.


Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter s...@apache.org wrote:
 Well in LSI it is ok to do that, as a missing entry means that the
 document contains zero occurrences of a given term which is totally fine.

 In Collaborative Filtering with explicit feedback, a missing rating is
 not automatically a rating of zero, it is simply unknown what the user
 would give as rating.

 fOR implicit data (number of interactions), a missing entry is indeed
 zero, but in most cases you might not have the same confidence in that
 observation as if you observed an interaction. Koren's ALS paper
 discusses this and introduces constructs to handle this, by putting more
 weight on minimizing the loss over observed interactions.

 In matrix factorization for CF, the factorization usually has to
 minimize the regularized loss over the known entries only. If all
 unknown entries were simply considered zero, I'd assume that the
 factorization that resulted would not generalize very well to unseen
 data. Some researchers title matrix factorization for CF as matrix
 completion which IMHO better describes the problem.

Yes it's just that you shouldn't if inputs are rating-like, not that
you literally couldn't. If your input is ratings on a scale of 1-5
then reconstructing a 0 everywhere else means you assume everything
not viewed is hated, which doesn't work at all. You can subtract the
mean from observed ratings, and then you assume everything unobserved
has an average rating.

But the assumption works nicely for click-like data. Better still when
you can weakly prefer to reconstruct the 0 for missing observations
and much more strongly prefer to reconstruct the 1 for observed
data.


Re: postgres recommendation adapter

2013-03-25 Thread Sean Owen
Are you using the 'integration' artifact? this is not in 'core'.

On Mon, Mar 25, 2013 at 12:43 PM, Matt Mitchell goodie...@gmail.com wrote:
 Yeah sorry. I'm attempting to load this class:

 org.apache.mahout.cf.taste.impl.model.jdbc.PostgreSQLBooleanPrefJDBCDataModel

 but getting a ClassNotFoundException

 I'm using version 0.7 of mahout-core and mahout-math, and version 0.5 of
 mahout-utils.

 - Matt


 On Mon, Mar 25, 2013 at 6:21 AM, Sean Owen sro...@gmail.com wrote:

 I think you'd have to define not working first

 On Mon, Mar 25, 2013 at 1:32 AM, Matt Mitchell goodie...@gmail.com
 wrote:
  Hi,
 
  I've seen references to a postgres, user pref class via google searches,
  but can't seem to get this to work using mahout-core version 0.7. Could
  someone describe how to get postgres working with Mahout CF?



Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
(The unobserved entries are still in the loss function, just with low
weight. They are also in the system of equations you are solving for.)

On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 Classic als wr is bypassing underlearning problem by cutting out unrated
 entries from linear equations system. It also still has a fery defined
 regularization technique which allows to find optimal fit in theory (but
 still not in mahout, not without at least some additional sweat, i heard).


Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote:
 But the assumption works nicely for click-like data. Better still when
 you can weakly prefer to reconstruct the 0 for missing observations
 and much more strongly prefer to reconstruct the 1 for observed
 data.


 This does seem intuitive.
 How does the benefit manifest itself?
 In lowering the RMSE of reconstructing the interaction matrix?
 Are there any indicators that it results in better recommendations?
 Koobas

In this approach you are no longer reconstructing the interaction
matrix, so there is no RMSE vs the interaction matrix. You're
reconstructing a matrix of 0 and 1. Because entries are weighted
differently, you're not even minimizing RMSE over that matrix -- the
point is to take some errors more seriously than others. You're
minimizing a *weighted* RMSE, yes.

Yes of course the goal is better recommendations.  This broader idea
is harder to measure. You can use mean average precision to measure
the tendency to predict back interactions that were held out.

Is it better? depends on better than *what*. Applying algorithms that
treat input like ratings doesn't work as well on click-like data. The
main problem is that these will tend to pay too much attention to
large values. For example if an item was clicked 1000 times, and you
are trying to actually reconstruct that 1000, then a 10% error
costs (0.1*1000)^2 = 1. But a 10% error in reconstructing an
item that was clicked once costs (0.1*1)^2 = 0.01. The former is
considered a million times more important error-wise than the latter,
even though the intuition is that it's just 1000 times more important.

Better than algorithms that ignore the weight entirely -- yes probably
if only because you are using more information. But as in all things
it depends.


Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
If your input is clicks, carts, etc. yes you ought to get generally
good results from something meant to consume implicit feedback, like
ALS (for implicit feedback, yes there are at least two main variants).
I think you are talking about the implicit version since you mention
0/1.

lambda is the regularization parameter. It is defined a bit
differently in the various papers though. Test a few values if you
can.
But you said no weights in the regularization... what do you mean?
you don't want to disable regularization entirely.

On Mon, Mar 25, 2013 at 2:14 PM, Koobas koo...@gmail.com wrote:
 On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen sro...@gmail.com wrote:

 On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote:
  But the assumption works nicely for click-like data. Better still when
  you can weakly prefer to reconstruct the 0 for missing observations
  and much more strongly prefer to reconstruct the 1 for observed
  data.
 
 
  This does seem intuitive.
  How does the benefit manifest itself?
  In lowering the RMSE of reconstructing the interaction matrix?
  Are there any indicators that it results in better recommendations?
  Koobas

 In this approach you are no longer reconstructing the interaction
 matrix, so there is no RMSE vs the interaction matrix. You're
 reconstructing a matrix of 0 and 1. Because entries are weighted
 differently, you're not even minimizing RMSE over that matrix -- the
 point is to take some errors more seriously than others. You're
 minimizing a *weighted* RMSE, yes.

 Yes of course the goal is better recommendations.  This broader idea
 is harder to measure. You can use mean average precision to measure
 the tendency to predict back interactions that were held out.

 Is it better? depends on better than *what*. Applying algorithms that
 treat input like ratings doesn't work as well on click-like data. The
 main problem is that these will tend to pay too much attention to
 large values. For example if an item was clicked 1000 times, and you
 are trying to actually reconstruct that 1000, then a 10% error
 costs (0.1*1000)^2 = 1. But a 10% error in reconstructing an
 item that was clicked once costs (0.1*1)^2 = 0.01. The former is
 considered a million times more important error-wise than the latter,
 even though the intuition is that it's just 1000 times more important.

 Better than algorithms that ignore the weight entirely -- yes probably
 if only because you are using more information. But as in all things
 it depends.


 Let's say the following.
 Classic market basket.
 Implicit feedback.
 Ones and zeros in the input matrix, no weights in the regularization,
 lambda=1.
 What I will get is:
 A) a reasonable recommender,
 B) a joke of a recommender.


Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
You would have to make up the similarity metric separately since it depends
entirely on how you want to define it.
The part of the book you are talking about concerns rescoring, which is not
the same thing.
Combine the similarity metrics, I mean, not make two recommenders. Make a
metric that is the product of two other metrics. Normalize both of those
metrics to the range [0,1].

Sean


On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana a.filian...@gmail.comwrote:

 Hi,

 Thank Sean for the response. I like the idea of multiplying the similarity
 metric based on
 user properties with the one based on CF data.
 I understand that I have to create a seperate similarity metric - can I do
 this with the help of Mahout or does this have to be done seperately, as in
 I have to implement my own similarity measure? It would be great if there
 is some clue on how I get this started.
 Is this somehow similar to the subject of *Injecting domain-specific
 information* in the book Mahout in Action (with the example of the
 gender-based item similarity metric)?

 And also how can I multiply the two results - will this affect the result
 of the evaluation of the recommender system? Or it should be normalized in
 a way?

 Thank you and sorry for the basic questions.

 Regards,

 Agata Filiana


 On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote:

  There are many ways to think about combining these two types of data.
 
  If you can make some similarity metric based on age, gender and
 interests,
  then you can use it as the similarity metric in
  GenericBooleanPrefUserBasedRecommender. You would be using both data sets
  in some way. Of course this means learning a whole different similarity
  metric somehow. A variant on this is to make a similarity metric based on
  user properties, and also use one based on CF data, and multiply them
  together to make a new combined similarity metric for this approach. This
  might work OK.
 
  It can also work to treat age and gender and other features as
 categorical
  features, and then model them as 'items' that the user interacts with.
 They
  would not have much of an effect here given how many items there are. In
  other models like ALS-WR you can weight these pseudo-items much more
 highly
  and get the desired effect to a degree.
 
 
 
  On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.com
  wrote:
 
   Hi,
  
   I'm fairly new to Mahout. Right now I am experimenting Mahout by trying
  to
   build a simple recommendation system. What I have is just a boolean
 data
   set, with only the userID and itemID. I understand that for this case I
   have to use GenericBooleanPrefUserBasedRecommender - which I have and
  works
   fine.
  
   Apart from the userID and itemID data, I also have the user's
 attributes
   (their age, gender, list of interests). I would like to combine this
 into
   the recommendation system to increase the performance of the
 recommender.
   Is this possible to do or am I trying something that does not make
 sense?
  
   It would be great if you can give me any inputs or ideas for this. (Or
  any
   good read based on this matter)
  
   Thank you!
  
   Regards,
  
   *Agata Filiana*
   Erasmus Mundus Student
  
 



 --
 *Agata Filiana
 *



Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
There is a difference between the recommender and the similarity metric it
uses. My suggestion was to either use your item data with the recommender
and hobby data with the similarity metric, or, use both in the similarity
metric by making a combined metric.


On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana a.filian...@gmail.comwrote:

 I understand how it works logically. However I am having problem
 understanding about the implementation of it and how to get the final
 outcome.
 Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
 So I would make the similarity metric of the users and hobbies.

 Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender with
 the boolean data set (userID and itemID).

 Then somehow combine the two?

 But at the end, my goal is to recommend the items in the second data set
 (the itemID, not recommend the hobbies) - does this make sense? Or am I
 confusing myself?

 Agata


 On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote:

  You would have to make up the similarity metric separately since it
 depends
  entirely on how you want to define it.
  The part of the book you are talking about concerns rescoring, which is
 not
  the same thing.
  Combine the similarity metrics, I mean, not make two recommenders. Make a
  metric that is the product of two other metrics. Normalize both of those
  metrics to the range [0,1].
 
  Sean
 
 
  On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana a.filian...@gmail.com
  wrote:
 
   Hi,
  
   Thank Sean for the response. I like the idea of multiplying the
  similarity
   metric based on
   user properties with the one based on CF data.
   I understand that I have to create a seperate similarity metric - can I
  do
   this with the help of Mahout or does this have to be done seperately,
 as
  in
   I have to implement my own similarity measure? It would be great if
 there
   is some clue on how I get this started.
   Is this somehow similar to the subject of *Injecting domain-specific
   information* in the book Mahout in Action (with the example of the
   gender-based item similarity metric)?
  
   And also how can I multiply the two results - will this affect the
 result
   of the evaluation of the recommender system? Or it should be normalized
  in
   a way?
  
   Thank you and sorry for the basic questions.
  
   Regards,
  
   Agata Filiana
  
  
   On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote:
  
There are many ways to think about combining these two types of data.
   
If you can make some similarity metric based on age, gender and
   interests,
then you can use it as the similarity metric in
GenericBooleanPrefUserBasedRecommender. You would be using both data
  sets
in some way. Of course this means learning a whole different
 similarity
metric somehow. A variant on this is to make a similarity metric
 based
  on
user properties, and also use one based on CF data, and multiply them
together to make a new combined similarity metric for this approach.
  This
might work OK.
   
It can also work to treat age and gender and other features as
   categorical
features, and then model them as 'items' that the user interacts
 with.
   They
would not have much of an effect here given how many items there are.
  In
other models like ALS-WR you can weight these pseudo-items much more
   highly
and get the desired effect to a degree.
   
   
   
On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana 
 a.filian...@gmail.com
wrote:
   
 Hi,

 I'm fairly new to Mahout. Right now I am experimenting Mahout by
  trying
to
 build a simple recommendation system. What I have is just a boolean
   data
 set, with only the userID and itemID. I understand that for this
  case I
 have to use GenericBooleanPrefUserBasedRecommender - which I have
 and
works
 fine.

 Apart from the userID and itemID data, I also have the user's
   attributes
 (their age, gender, list of interests). I would like to combine
 this
   into
 the recommendation system to increase the performance of the
   recommender.
 Is this possible to do or am I trying something that does not make
   sense?

 It would be great if you can give me any inputs or ideas for this.
  (Or
any
 good read based on this matter)

 Thank you!

 Regards,

 *Agata Filiana*
 Erasmus Mundus Student

   
  
  
  
   --
   *Agata Filiana
   *
  
 



 --
 *Agata Filiana
 *



Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about Collaborative
Filtering for Implicit Feedback Datasets.

I've been working with some folks who point out that alpha=40 seems to be
too high for most data sets. After running some tests on common data sets,
alpha=1 looks much better. YMMV.

In the end you have to evaluate these two parameters, and the # of
features, across a range to determine what's best.

Is this data set not a bunch of audio features? I am not sure it works for
ALS, not naturally at least.


On Mon, Mar 18, 2013 at 12:39 PM, Han JU ju.han.fe...@gmail.com wrote:

 Hi,

 I'm wondering has someone tried ParallelALS with implicite feedback job on
 million song dataset? Some pointers on alpha and lambda?

 In the paper alpha is 40 and lambda is 150, but I don't know what are their
 r values in the matrix. They said is based on time units that users have
 watched the show, so may be it's big.

 Many thanks!
 --
 *JU Han*

 UTC   -  Université de Technologie de Compiègne
 * **GI06 - Fouille de Données et Décisionnel*

 +33 061960



Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
Yes that's fine input then.

Large alpha should go with small R values, not large R. Really alpha
controls how much observed input (R != 0) is weighted towards 1 versus how
much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to
complete this effect.


On Mon, Mar 18, 2013 at 1:06 PM, Han JU ju.han.fe...@gmail.com wrote:

 Thanks for quick responses.

 Yes it's that dataset. What I'm using is triplets of user_id song_id
 play_times, of ~ 1m users. No audio things, just plein text triples.

 It seems to me that the paper about implicit feedback matchs well this
 dataset: no explicit ratings, but times of listening to a song.

 Thank you Sean for the alpha value, I think they use big numbers is because
 their values in the R matrix is big.


 2013/3/18 Sebastian Schelter ssc.o...@googlemail.com

  JU,
 
  are you refering to this dataset?
 
  http://labrosa.ee.columbia.edu/millionsong/tasteprofile
 
  On 18.03.2013 17:47, Sean Owen wrote:
   One word of caution, is that there are at least two papers on ALS and
  they
   define lambda differently. I think you are talking about Collaborative
   Filtering for Implicit Feedback Datasets.
  
   I've been working with some folks who point out that alpha=40 seems to
 be
   too high for most data sets. After running some tests on common data
  sets,
   alpha=1 looks much better. YMMV.
  
   In the end you have to evaluate these two parameters, and the # of
   features, across a range to determine what's best.
  
   Is this data set not a bunch of audio features? I am not sure it works
  for
   ALS, not naturally at least.
  
  
   On Mon, Mar 18, 2013 at 12:39 PM, Han JU ju.han.fe...@gmail.com
 wrote:
  
   Hi,
  
   I'm wondering has someone tried ParallelALS with implicite feedback
 job
  on
   million song dataset? Some pointers on alpha and lambda?
  
   In the paper alpha is 40 and lambda is 150, but I don't know what are
  their
   r values in the matrix. They said is based on time units that users
 have
   watched the show, so may be it's big.
  
   Many thanks!
   --
   *JU Han*
  
   UTC   -  Université de Technologie de Compiègne
   * **GI06 - Fouille de Données et Décisionnel*
  
   +33 061960
  
  
 
 


 --
 *JU Han*

 Software Engineer Intern @ KXEN Inc.
 UTC   -  Université de Technologie de Compiègne
 * **GI06 - Fouille de Données et Décisionnel*

 +33 061960



Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
I'm not sure what you mean. The only thing I am suggesting to combine are
two similarity metrics, not data or recommendations.
You combine metrics by multiplying their values.


On Mon, Mar 18, 2013 at 12:54 PM, Agata Filiana a.filian...@gmail.comwrote:

 In this case, would be correct if I somehow loop through the item data
 and the hobby data and then combine the score for a pair of users?

 I am having trouble in how to combine both similarity into one metric,
 could you possibly point me out a clue?

 Thank you

 On 18 March 2013 14:54, Sean Owen sro...@gmail.com wrote:

  There is a difference between the recommender and the similarity metric
 it
  uses. My suggestion was to either use your item data with the recommender
  and hobby data with the similarity metric, or, use both in the similarity
  metric by making a combined metric.
 
 
  On Mon, Mar 18, 2013 at 9:44 AM, Agata Filiana a.filian...@gmail.com
  wrote:
 
   I understand how it works logically. However I am having problem
   understanding about the implementation of it and how to get the final
   outcome.
   Say the user's attribute is Hobbies: hobby1,hobby2,hobby3
   So I would make the similarity metric of the users and hobbies.
  
   Then for the CF, using Mahout's GenericBooleanPrefUserBasedRecommender
  with
   the boolean data set (userID and itemID).
  
   Then somehow combine the two?
  
   But at the end, my goal is to recommend the items in the second data
 set
   (the itemID, not recommend the hobbies) - does this make sense? Or am I
   confusing myself?
  
   Agata
  
  
   On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote:
  
You would have to make up the similarity metric separately since it
   depends
entirely on how you want to define it.
The part of the book you are talking about concerns rescoring, which
 is
   not
the same thing.
Combine the similarity metrics, I mean, not make two recommenders.
  Make a
metric that is the product of two other metrics. Normalize both of
  those
metrics to the range [0,1].
   
Sean
   
   
On Mon, Mar 18, 2013 at 6:51 AM, Agata Filiana 
 a.filian...@gmail.com
wrote:
   
 Hi,

 Thank Sean for the response. I like the idea of multiplying the
similarity
 metric based on
 user properties with the one based on CF data.
 I understand that I have to create a seperate similarity metric -
  can I
do
 this with the help of Mahout or does this have to be done
 seperately,
   as
in
 I have to implement my own similarity measure? It would be great if
   there
 is some clue on how I get this started.
 Is this somehow similar to the subject of *Injecting
 domain-specific
 information* in the book Mahout in Action (with the example of the
 gender-based item similarity metric)?

 And also how can I multiply the two results - will this affect the
   result
 of the evaluation of the recommender system? Or it should be
  normalized
in
 a way?

 Thank you and sorry for the basic questions.

 Regards,

 Agata Filiana


 On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote:

  There are many ways to think about combining these two types of
  data.
 
  If you can make some similarity metric based on age, gender and
 interests,
  then you can use it as the similarity metric in
  GenericBooleanPrefUserBasedRecommender. You would be using both
  data
sets
  in some way. Of course this means learning a whole different
   similarity
  metric somehow. A variant on this is to make a similarity metric
   based
on
  user properties, and also use one based on CF data, and multiply
  them
  together to make a new combined similarity metric for this
  approach.
This
  might work OK.
 
  It can also work to treat age and gender and other features as
 categorical
  features, and then model them as 'items' that the user interacts
   with.
 They
  would not have much of an effect here given how many items there
  are.
In
  other models like ALS-WR you can weight these pseudo-items much
  more
 highly
  and get the desired effect to a degree.
 
 
 
  On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana 
   a.filian...@gmail.com
  wrote:
 
   Hi,
  
   I'm fairly new to Mahout. Right now I am experimenting Mahout
 by
trying
  to
   build a simple recommendation system. What I have is just a
  boolean
 data
   set, with only the userID and itemID. I understand that for
 this
case I
   have to use GenericBooleanPrefUserBasedRecommender - which I
 have
   and
  works
   fine.
  
   Apart from the userID and itemID data, I also have the user's
 attributes
   (their age, gender, list of interests). I would like to combine
   this
 into
   the recommendation system to increase

Re: reproducibility

2013-03-17 Thread Sean Owen
What's your question? ALS has a random starting point which changes the
results a bit. Not sure about KNN though.


On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote:

 Can anybody shed any light on the issue of reproducibility in Mahout,
 with and without Hadoop, specifically in the context of kNN and ALS
 recommenders?



Re: reproducibility

2013-03-17 Thread Sean Owen
If an algorithm has a stochastic/random element, no it won't necessarily
produce the same result, by design. If you can fix the seed of the random
number generator, you should get the same result. Except that if the
process is multi-threaded or distributed, even that doesn't guarantee it --
the RNG could be accessed in a different order. Even if you can control
your code it can be hard to control the RNGs in third-party libraries. Even
in a deterministic single-threaded program Java's floating point results
are not guaranteed to be the same across platforms (unless you use
strictfp).

ALS definitely has a random starting point, so reproducibility is not
guaranteed even from the top. If you fix the random seed in the context of
this project's unit tests, you *should* get the same result since I think
it manages to use no third-party RNGs and runs a test from a fixed starting
point in 1 thread.

KNN does not have a stochastic element. I think you would get the same
results on one platform, unless I'm missing something.

I don't think exact reproducibility is an issue. Certainly at scale where
the entire computation is distributed over such a complex cluster
environment. Most ML is about guessing at what's not known anyway. As long
as very small differences make only very small differences in the outcome,
differing FP behavior will make no or vanishingly small difference.

The only place where I think FP reproducibility matters -- of the sort that
numerical libraries care about -- is in under/overflow issues. But that is
solved by moving into a log space or something. You would never want to
depend on the nth significant digit of a float mattering.




On Sun, Mar 17, 2013 at 1:43 PM, Koobas koo...@gmail.com wrote:

 I am asking the basic reproducibility question.
 If I run twice on the same dataset, with the same hardware setup, will I
 always get the same resuts?
 Or is there any chance that on two different runs, the same user will get
 slightly different suggestions?
 I am mostly revolving in the space of numerical libraries, where
 reproducibility is, sort of, a big deal.
 Maybe it's not much of a concern in machine learning.
 I am just curious.


 On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen sro...@gmail.com wrote:

  What's your question? ALS has a random starting point which changes the
  results a bit. Not sure about KNN though.
 
 

  On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote:
 
   Can anybody shed any light on the issue of reproducibility in Mahout,
   with and without Hadoop, specifically in the context of kNN and ALS
   recommenders?
  
 



Re: Boosting User-Based with the user's attributes

2013-03-16 Thread Sean Owen
There are many ways to think about combining these two types of data.

If you can make some similarity metric based on age, gender and interests,
then you can use it as the similarity metric in
GenericBooleanPrefUserBasedRecommender. You would be using both data sets
in some way. Of course this means learning a whole different similarity
metric somehow. A variant on this is to make a similarity metric based on
user properties, and also use one based on CF data, and multiply them
together to make a new combined similarity metric for this approach. This
might work OK.

It can also work to treat age and gender and other features as categorical
features, and then model them as 'items' that the user interacts with. They
would not have much of an effect here given how many items there are. In
other models like ALS-WR you can weight these pseudo-items much more highly
and get the desired effect to a degree.



On Fri, Mar 15, 2013 at 4:37 PM, Agata Filiana a.filian...@gmail.comwrote:

 Hi,

 I'm fairly new to Mahout. Right now I am experimenting Mahout by trying to
 build a simple recommendation system. What I have is just a boolean data
 set, with only the userID and itemID. I understand that for this case I
 have to use GenericBooleanPrefUserBasedRecommender - which I have and works
 fine.

 Apart from the userID and itemID data, I also have the user's attributes
 (their age, gender, list of interests). I would like to combine this into
 the recommendation system to increase the performance of the recommender.
 Is this possible to do or am I trying something that does not make sense?

 It would be great if you can give me any inputs or ideas for this. (Or any
 good read based on this matter)

 Thank you!

 Regards,

 *Agata Filiana*
 Erasmus Mundus Student



Re: QR decomposition in ALS-WR code

2013-03-15 Thread Sean Owen
I think you are referring to the same step? QR decomposition is how you
solve for u_i which is what I imagine the same step you have in mind.


Re: Mahout and Hadoop 2

2013-03-13 Thread Sean Owen
I think someone submitted a different build profile that changes the
dependencies for you. I believe the issue is using hadoop-common and not
hadoop-core as well as changing versions. I think the rest is compile
compatible and probably runtime compatible. But I've not tried.


On Wed, Mar 13, 2013 at 7:58 PM, Jian Fang jian.fang.subscr...@gmail.comwrote:

 Hi,

 Is there anyway to make mahout 0.7 or 0.8 work with Hadoop 2.0.2-alpha?

 Seems Mahout builds against Hadoop 1.X by default in the pom.xml and it
 also requires hadoop-core.jar, which only exists in Hadoop 1.x if I
 remember correctly.

 Thanks,

 Jian



Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
Yeah that's right, he said 20 features, oops. And yes he says he's talking
about the recs only too, so that's not right either. That seems way too
long relative to factorization. And the factorization seems quite fast; how
many machines, and how many iterations?

I thought the shape of the computation was to cache B' (yes whose columns
are B rows) and multiply against the rows of A. There again probably wrong
given the latest timing info.


On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins h...@joshdevins.com wrote:

 So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
 the factorization. Factorization is on the order of 2-hours. For the
 interested, here's the pertinent code from the ALS `RecommenderJob`:


 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148

 I'm sure this can be optimised, but by an order of magnitude? Something to
 try out, I'll report back if I find anything concrete.



 On 6 March 2013 11:13, Ted Dunning ted.dunn...@gmail.com wrote:

  Well, it would definitely not be the for time I counted incorrectly.
   Anytime I do arithmetic the result should be considered suspect.  I do
  think my numbers are correct, but then again, I always do.
 
  But the OP did say 20 dimensions which gives me back 5x.
 
  Inclusion of learning time is a good suspect.  In the other side of the
  ledger, if the multiply is doing any column wise access it is a likely
  performance bug.  The computation is AB'. Perhaps you refer to rows of B
  which are the columns of B'.
 
  Sent from my sleepy thumbs set to typing on my iPhone.
 
  On Mar 6, 2013, at 4:16 AM, Sean Owen sro...@gmail.com wrote:
 
   If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
 Tflops
  --
   I think you're missing an M, and the features by an order of
 magnitude.
   That's still 1 day on an 8-core machine by this rule of thumb.
  
   The 80 hours is the model building time too (right?), not the time to
   multiply U*M'. This is dominated by iterations when building from
  scratch,
   and I expect took 75% of that 80 hours. So if the multiply was 20 hours
  --
   on 10 machines -- on Hadoop, then that's still slow but not out of the
   question for Hadoop, given it's usually a 3-6x slowdown over a parallel
   in-core implementation.
  
   I'm pretty sure what exists in Mahout here can be optimized further at
  the
   Hadoop level; I don't know that it's doing the multiply badly though.
 In
   fact I'm pretty sure it's caching cols in memory, which is a bit of
   'cheating' to speed up by taking a lot of memory.
  
  
   On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
  
   Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
  60M
   Flops.  You should get about a Gflop per core in Java so this should
  about
   60 ms.  You can make this faster with more cores or by using ATLAS.
  
   Are you expecting 3 million unique people every 80 hours?  If no, then
  it
   is probably more efficient to compute the recommendations on the fly.
  
   How many recommendations per second are you expecting?  If you have 1
   million uniques per day (just for grins) and we assume 20,000 s/day to
   allow for peak loading, you have to do 50 queries per second peak.
  This
   seems to require 3 cores.  Use 16 to be safe.
  
   Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
   I
   think that your map-reduce is under performing by about a factor of
 10.
   This is quite plausible with bad arrangement of the inner loops.  I
  think
   that you would have highest performance computing the recommendations
  for a
   few thousand items by a few thousand users at a time.  It might be
 just
   about as fast to do all items against a few users at a time.  The
 reason
   for this is that dense matrix multiply requires c n x k + m x k memory
  ops,
   but n x k x m arithmetic ops.  If you can re-use data many times, you
  can
   balance memory channel bandwidth against CPU speed.  Typically you
 need
  20
   or more re-uses to really make this fly.
  
  
 



Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
OK and he mentioned that 10 mappers were running, when it ought to be able
to use several per machine. The # of mappers is a function of the input
size really, so probably needs to turn down the max file split size to
induce more mappers?


On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter ssc.o...@googlemail.com
 wrote:

 Btw, all important jobs in ALS are map-only, so its the number of map
 slotes that counts.




Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
That too, even better. Isn't that already done? Could be in one place but
not another. IIRC there were also cases where it was a lot easier to pass
around an object internally and mutability solved the performance issue,
without much risk since it was only internal. You can (nay, must) always
copy the objects before being returned.



On Wed, Mar 6, 2013 at 4:01 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I would recommend against a mutable object on maintenance grounds.

 Better is to keep the threshold that a new score must meet and only
 construct the object on need.  That cuts the allocation down to negligible
 levels.

 On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen sro...@gmail.com wrote:

  OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
  probably, as most machines can handle 2 reducers at once).
  I think the recommendation step loads one whole matrix into memory.
 You're
  not running out of memory but if you're turning up the heap size to
  accommodate, you might be hitting swapping, yes. I think (?) the
  conventional wisdom is to turn off swap for Hadoop.
 
  Sebastian yes that is probably a good optimization; I've had good results
  reusing a mutable object in this context.
 
 
  On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins h...@joshdevins.com wrote:
 
   The factorization at 2-hours is kind of a non-issue (certainly fast
   enough). It was run with (if I recall correctly) 30 reducers across a
 35
   node cluster, with 10 iterations.
  
   I was a bit shocked at how long the recommendation step took and will
  throw
   some timing debug in to see where the problem lies exactly. There were
 no
   other jobs running on the cluster during these attempts, but it's
  certainly
   possible that something is swapping or the like. I'll be looking more
   closely today before I start to consider other options for calculating
  the
   recommendations.
  
  
 



Re: Top-N recommendations from SVD

2013-03-05 Thread Sean Owen
Without any tricks, yes you have to do this much work to really know which
are the largest values in UM' for every row. There's not an obvious twist
that speeds it up.

(Do you really want to compute all user recommendations? how many of the
2.6M are likely to be active soon, or, ever?)

First, usually it's only a subset of all items that are recommendable
anyway. You don't want them out of the model but don't need to consider
them. This is domain specific of course, but, if 90% of the items are out
of stock or something, of course you can not bother to score them in the
first place

Yes, LSH is exactly what I do as well. You hash the item feature vectors
into buckets and then only iterate over nearby buckets to find candidates.
You can avoid looking at 90+% of candidates this way without much if any
impact on top N.

Pruning is indeed third on the list but usually you get the problem to a
pretty good size from the points above.



On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins h...@joshdevins.com wrote:

 Hi all,

 I have a conceptually simple problem. A user-item matrix, A, whose
 dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
 20 features reduces this in the usual way to A = UM'. Trying to generate
 top-n (where n=100) recommendations for all users in U is quite a long
 process though. Essentially, for every user, it's generating a prediction
 for all unrated items in M then taking the top-n (all in-memory). I'm using
 the standard ALS `RecommenderJob` for this.

 Considering that there are ~2.6M users and ~2.8M items, this is a really,
 really, time consuming way to find the top-n recommendations for all users
 in U. I feel like there could be a tricky way to avoid having to compute
 all item predictions of a user though. I can't find any reference in papers
 about improving this but at the moment, the estimate (with 10 mappers
 running the `RecommenderJob`) is ~80 hours. When I think about this problem
 I wonder if applying kNN or local sensitive min-hashing would somehow help
 me. Basically find the nearest neighbours directly and calculate
 predictions on those items only and not every item in M. On the flip side,
 I could start to reduce the item space, since it's quite large, basically
 start removing items that have low in-degrees since these probably don't
 contribute too much to the final recommendations. I don't like this so much
 though as it could remove some of the long-tail recommendations. At least,
 that is my intuition :)

 Thoughts anyone?

 Thanks in advance,

 Josh



Re: Top-N recommendations from SVD

2013-03-05 Thread Sean Owen
Ah OK, so this is quite a big problem. Still, it is quite useful to be able
to make recommendations in real-time, or near-real-time. It saves the
relatively quite large cost of precomputing, and lets you respond
immediately to new data. If the site has a lot of occasional or new users,
that can make a huge difference -- if I visit once, or once a month,
precomputing recommendations every day from tomorrow doesn't help much.

Of course, that can be difficult to reconcile with 100ms response times,
but with some tricks like LSH and some reasonable hardware I think you'd
find it possible at this scale. It does take a lot of engineering.



On Tue, Mar 5, 2013 at 9:43 PM, Josh Devins h...@joshdevins.com wrote:

 Thanks Sean, at least I know I'm mostly on the right track ;)

 So in our case (a large, social, consumer website), this is already a small
 subset of all users (and items for that matter) and is really only the
 active users. In fact, in future iterations, the number of users will
 likely grow by around 3x (or at least, that's my optimistic target). So
 it's not very likely to be able to calculate recommendations for fewer
 users, but I like the idea of leaving all items in the matrix but not
 computing preference predictions for all of them. I will think on this and
 see if it fits for our domain (probably will work), and maybe a pull
 request to Mahout if I can make this generic in some way! LSH was my
 instinctual approach also but wasn't totally sure if this was sane! I'll
 have a look into this as well if needed.

 Thanks for the advice!

 Josh



 On 5 March 2013 22:23, Sean Owen sro...@gmail.com wrote:

  Without any tricks, yes you have to do this much work to really know
 which
  are the largest values in UM' for every row. There's not an obvious twist
  that speeds it up.
 
  (Do you really want to compute all user recommendations? how many of the
  2.6M are likely to be active soon, or, ever?)
 
  First, usually it's only a subset of all items that are recommendable
  anyway. You don't want them out of the model but don't need to consider
  them. This is domain specific of course, but, if 90% of the items are
 out
  of stock or something, of course you can not bother to score them in the
  first place
 
  Yes, LSH is exactly what I do as well. You hash the item feature vectors
  into buckets and then only iterate over nearby buckets to find
 candidates.
  You can avoid looking at 90+% of candidates this way without much if any
  impact on top N.
 
  Pruning is indeed third on the list but usually you get the problem to a
  pretty good size from the points above.
 
 
 
  On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins h...@joshdevins.com wrote:
 
   Hi all,
  
   I have a conceptually simple problem. A user-item matrix, A, whose
   dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS
 with
   20 features reduces this in the usual way to A = UM'. Trying to
 generate
   top-n (where n=100) recommendations for all users in U is quite a long
   process though. Essentially, for every user, it's generating a
 prediction
   for all unrated items in M then taking the top-n (all in-memory). I'm
  using
   the standard ALS `RecommenderJob` for this.
  
   Considering that there are ~2.6M users and ~2.8M items, this is a
 really,
   really, time consuming way to find the top-n recommendations for all
  users
   in U. I feel like there could be a tricky way to avoid having to
 compute
   all item predictions of a user though. I can't find any reference in
  papers
   about improving this but at the moment, the estimate (with 10 mappers
   running the `RecommenderJob`) is ~80 hours. When I think about this
  problem
   I wonder if applying kNN or local sensitive min-hashing would somehow
  help
   me. Basically find the nearest neighbours directly and calculate
   predictions on those items only and not every item in M. On the flip
  side,
   I could start to reduce the item space, since it's quite large,
 basically
   start removing items that have low in-degrees since these probably
 don't
   contribute too much to the final recommendations. I don't like this so
  much
   though as it could remove some of the long-tail recommendations. At
  least,
   that is my intuition :)
  
   Thoughts anyone?
  
   Thanks in advance,
  
   Josh
  
 



Re: FileDataModel

2013-03-03 Thread Sean Owen
That's true, it does now. Depending on the implementation, you may still
need to rebuild things to reflect the changes. Also note that this wouldn't
invalidate caches you put on top.


On Sun, Mar 3, 2013 at 7:55 AM, Nadia Najjar ned...@gmail.com wrote:

 Thanks, Sean!
 The remove/setPreference methods throw an UnsupportedOperationException. I
 read in an old thread that you had updated these methods to work.  I'm not
 sure what I'm missing here. Can you point me in the right direction?


 On Mar 2, 2013, at 6:42 AM, Sean Owen wrote:

  Yes to integrate any new data everything must be reloaded.
  On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote:
 
  I am using a FileDataModel and remove and insert preferences before
  estimating preferences. Do I need to rebuild the recommender after these
  methods are called for it to be reflected in the prediction?




Re: FileDataModel

2013-03-02 Thread Sean Owen
Yes to integrate any new data everything must be reloaded.
On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote:

 I am using a FileDataModel and remove and insert preferences before
 estimating preferences. Do I need to rebuild the recommender after these
 methods are called for it to be reflected in the prediction?


Re: Hadoop version compatibility

2013-03-02 Thread Sean Owen
Although I don't know of any specific incompatibility, I would not be
surprised. 0.18 is pretty old. As you can see in pom.xml it currently works
against the latest stable version, 1.1.1.


On Sat, Mar 2, 2013 at 6:16 PM, MARCOS UBIRAJARA
marcosubiraj...@ig.com.brwrote:

 Dear Gentleman,

 First of all, many thanks for this active and vibrant community, and for
 the Mahout creators as well.

 I'm giving the first steps with mahout and hadoop, in order I can go ahead
 with my research.

 I'm facing some problems with mahout 0.7 and hadoop 0.18.


 Please let me know if both are compatible, and if not, what hadoop version
 is compatible with mahout 0.7?


 Thanks in advance for your help, for sure will be very helpfull,


 Marcos
 Manaus
 Amazon - Brasil



Re: How to remove popular items?

2013-02-27 Thread Sean Owen
It's true, although many of the algorithms will by nature not emphasize
popular items.
There is an old and semi-deprecated class in the project
called InverseUserFrequency, which you can use to manually de-emphasize
popular items internally. I wouldn't really recommend it.

You can always use IDRescorer yes. If you have business rules that dictate
some things must be filtered, that's the right way to go. As purely a tool
to demote popular items.. it's a bit heavy-handed and not the ideal way to
solve it.


On Wed, Feb 27, 2013 at 1:39 PM, Aleksei Udatšnõi a.udac...@gmail.comwrote:

 Consider using IDRescorer to penalize or skip items.


 On Mon, Feb 4, 2013 at 6:54 PM, Zia mel ziad.kame...@gmail.com wrote:

  Hi , is there a current way to remove the popular items in the
  recommendations? Something like STOP words.
  Thanks !
 



Re: Vector distance within a cluster

2013-02-27 Thread Sean Owen
A common measure of cluster coherence is the mean distance or mean squared
difference between the members and the cluster centroid. It sounds like
this is the kind of thing you're measuring with this all-pairs distances.
That could be a measure too; I've usually seen that done by taking the
maximum such intracluster distance, the 'diameter'.

To answer Ted's question -- you're measuring internal consistency. You're
not trying to find clusters that match some external standard that says
these 100 docs should cluster together, etc.

I'm speaking off the cuff, but I think the idea was that L1/Manhattan
distance may give you clusters that tend to spread out over few rather than
more dimensions, and so that may make them more interpretable -- because
they will tend to be nearly identical in the other several dimensions and
those homogenous dimensions tell you what they're about.

The reason is that L1 is indifferent across dimensions -- moving a unit
in any dimension makes you a unit further/closer from another point --
while in L2 moving along a dimension where you are already close does
little.

On Wed, Feb 27, 2013 at 3:23 PM, Chris Harrington ch...@heystaks.comwrote:

 Hmmm, you may have to dumb things down for me here. I have don't have much
 of a background in the area of ML and I'm just piecing things together and
 learning as I go.
 So I don't really understand what you mean by Coherence against an
 external standard?  Or internal consistency/homogeneity? or One thought
 along these lines is to add L_1 regularization to the k-means algorithm.
 Is L_1 regularization the same as manhattan distance?

 That aside I'm outputting a file with the top terms and the text of 20
 random documents that ended up in that cluster and eyeballing that, not
 very high-tech or efficient but it was the only way I knew to make a
 relevance judgment on a cluster topic. For example If the majority of the
 samples are sport related and 82.6% of the vector distances in my cluster
 are quite similar I'm happy to call that cluster sport.



Re: Cross recommendation

2013-02-24 Thread Sean Owen
I may not be 100% following the thread, but:

Similarity metrics won't care whether some items are really actions and
some items are items. The math is the same. The problem which you may be
alluding to is the one I mentioned earlier -- there is no connection
between item and item-action in the model, when there plainly is in real
life. The upside is what Ted mention: you get to treat actions like views
separately from purchases, and yes it's also certain those aren't the same
thing in real life. YMMV.

The piece of code you're playing with has nothing to do with latent factor
models and won't learn weights. It's going to assume by default that all
items (+actions) are equal.

(user+action,item) doesn't make sense. You compute item-item similarity
from (user,item+action) data. Some of the results are really item-action
similarities or action-action. It may be useful, maybe not, to know these
things too but you can just look at item-item if you want.



On Sun, Feb 24, 2013 at 4:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yes I understand that you need (user, item+action) input for user based
 recs returned from recommender.recommend(userID, n).

 But can you expect item similarity to work with the same input? I am fuzzy
 about how item similarity is calculated in cf/taste.

 I was expecting to train one recommender with (user, item+action) and call
 recommender1.recommend(userID, n) to get recs but also train another
 recommender with (user+action, item) to get recommender2.mostSimilarItems(
 itemID, n). I realize it's a hack but that aside is this second recommender
 required? I'd expect it to return items that use all actions to calculate
 similarity and therefore will use view information to improve the
 similarity calculation.

 No?


 On Feb 23, 2013, at 10:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 No.

 It is uniformly better to have (item+action, user).  In fact, I would
 prefer to have it the other way around when describing it to match the
 matrix row x column convention.

 (user, item+action) where action is binary leads to A = [A_1 | A_2] = user
 by 2xitem.  The alternative of (user+action, item) leads to

[ A_1 ]
 A = [ ] = 2xuser by item
[ A_2 ]

 This last form doesn't have a uniform set of users to connect the items
 together.  When you compute the cooccurrence matrix you get A_1' A_1 + A_2'
 A_2 which gives you recommendations from 1=1 and from 2=2, but no
 recommendations 1=2 or 2=1.  Thus, no cross recommendations.



 On Sat, Feb 23, 2013 at 10:39 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  But the discussion below lead me to realize that cf/taste is doing
  something in addition to [B'B] h_p, which returns user history based
 recs.
  I'm getting better results currently from item similarity based recs,
 which
  I blend with user-history based recs. To get item similarity based recs
  cf/taste is using a similarity metric and I'd guess that it uses the
 input
  matrix to get these results (something like the dot product for cosine).
  For item similarity should I create a training set of (item,
 user+action)?




Re: GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Sean Owen
It's also valid, yes. The difference is partly due to asymmetry, but also
just historical (i.e. no great reason). The item-item system uses a
different strategy for picking candidates based on CandidateItemStrategy.


On Thu, Feb 21, 2013 at 2:37 PM, Koobas koo...@gmail.com wrote:

 In the GenericUserBasedRecommender the concept of a neighborhood seems to
 be fundamental.
 I.e., it is a classic implementation of the kNN algorithm.

 But it is not the case with the GenericItemBasedRecommender.
 I understand that the two approaches are not meant to be completely
 symmetric,
 but still, wouldn't it make sense, from the performance perspective, to
 compute items' neighborhoods first,
 and then use them to compute recommendations?

 If kNN was run on items first, then every item-item similarity would be
 computed once.
 It looks like in the GenericItemBasedRecommender each item-item similarity
 will be computed multiple times.
 (How much, depends on the data, but still.)

 I am wondering if anybody has any thoughts on the validity of doing
 item-item kNN in the context of:
 1) performance,
 2) quality of recommendations.



Re: Precision used by mahout

2013-02-20 Thread Sean Owen
I think all of the code uses double-precision floats. I imagine much of it
could work as well with single-precision floats.

MapReduce and a GPU are very different things though, and I'm not sure how
you would use both together effectively.


On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade shrutiranad...@gmail.comwrote:

 Hi,

 I am a beginner in mahout. I am working on k-means MR implementation and
 trying to run it on a GPGPU.* I wanted to know if mahout computations are
 all double precision or single precision. *

 Suggest me any documentation that I need to refer to.

 Thanks,
 Shruti



Re: Precision used by mahout

2013-02-20 Thread Sean Owen
I think this is quite possible too. I just think there's little point in
matching this up with Hadoop. They represent entirely different
architectures for large-scale computation. I mean, you can probably write
an M/R job that uses GPUs on workers, but I imagine it would be an
artificial marriage of technologies. Probably Hadoop being used simply to
distribute data.

If you want to use a GPU, and want to use it properly, most of your work is
to create an effective in-core parallel implementation, not distributed
across computers and distributed file systems. You use JNI or CUDA bindings
in Java to push computations into hardware from Java.

This is an exercise in a) modifying a matrix/vector library to use native
hardware, then b) writing algorithms that use that library. I think your
best starting point in Java may be something more general like Commons Math.




On Wed, Feb 20, 2013 at 10:22 AM, 万代豊 20525entrad...@gmail.com wrote:

 This is the agenda that I'm interested in too.
 I believe Item-Based Recomemndation in Mahout (Not only about Mahout
 though) should spend sometime
 doing multiplication of cooccurrence matrix and user preference vector.
 If we could pass this multiplication task off loaded to GGPU, then that
 will be a great acceleration.
 What I'm not really clear is how double precision multiplication task
 inside Java Virtual Machine can take advantage of the HW accelerator.(I
 mean how can you make GGPU visible to Mahout through JVM?)

 If we could get over this in addition to what Ted Dunning presented the
 other day on Solr involment in building/loading cooccurrence matrix for
 Mahout recommendation, it should be a big leap in innovating Mahout
 recommendation.

 Am I missing sothing or just dreamig?
 Regards,,,
 Y.Mandai

 2013/2/20 Sean Owen sro...@gmail.com

  I think all of the code uses double-precision floats. I imagine much of
 it
  could work as well with single-precision floats.
 
  MapReduce and a GPU are very different things though, and I'm not sure
 how
  you would use both together effectively.
 
 
  On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade shrutiranad...@gmail.com
  wrote:
 
   Hi,
  
   I am a beginner in mahout. I am working on k-means MR implementation
 and
   trying to run it on a GPGPU.* I wanted to know if mahout computations
 are
   all double precision or single precision. *
  
   Suggest me any documentation that I need to refer to.
  
   Thanks,
   Shruti
  
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-17 Thread Sean Owen
I agree with that explanation. Is it why it's unsupervised.. well I think
of recommendation in the context of things like dimension reduction, which
are just structure-finding exercises. Often the input has no positive or
negative label (a click); everything is 'positive'. If you're predicting
anything, it's not one target, but many targets, one per item, as if you
have many small supervised problems.

Whatever that is called -- I was just saying that it's not a simple
supervised problem, and so it's not necessarily true that the things you do
when testing that kind of thing apply here.

Viewed through the supervised lens, I suppose you could say that this
process only ever predicts the positive class, and that's different. In
fact it is not classifying given test examples at all... it's like it is
telling you which of many classifiers (items) would be most likely to
return the positive class

On Sun, Feb 17, 2013 at 11:56 AM, Osman Başkaya
osman.bask...@computer.orgwrote:

 I am sorry to extend the unsupervised/supervised discussion which is not
 the main question here but I need to ask.

 Sean, I don't understand your last answer. Let's assume our rating scale is
 from 1 to 5. We can say that those movies which a particular user rates as
 5 are relevant for him/her. 5 is just a number, we can use *relevance
 threshold *like you did and we can follow the method described in Cremonesi
 et al. Performance of Recommender Algorithms on Top-N Recommendation
 Taskshttp://goo.gl/pejO7(
 *2. Testing Methodology - p.2*).

 Are you saying that this job is unsupervised since no user can rate all of
 the movies. For this reason, we won't be sure that our predicted top-N list
 contains no relevant item because it can be possible that our top-N
 recommendation list has relevant movie(s) which hasn't rated by the user *
 yet* as relevant. By using this evaluation procedure we miss them.

 In short, The following assumption can be problematic:

 We randomly select 1000 additional items unrated by
  user u. We may assume that most of them will not be
  of interest to user u.


 Although bigger N values overcomes this problem mostly, still it does not
 seem totally supervised.


 On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen sro...@gmail.com wrote:

  The very question at hand is how to label the data as relevant and not
  relevant results. The question exists because this is not given, which
 is
  why I would not call this a supervised problem. That may just be
 semantics,
  but the point I wanted to make is that the reasons choosing a random
  training set are correct for a supervised learning problem are not
 reasons
  to determine the labels randomly from among the given data. It is a good
  idea if you're doing, say, logistic regression. It's not the best way
 here.
  This also seems to reflect the difference between whatever you want to
 call
  this and your garden variety supervised learning problem.
 
  On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Sean
  
   I think it is still a supervised learning problem in that there is a
   labelled training data set and an unlabeled test data set.
  
   Learning a ranking doesn't change the basic dichotomy between
 supervised
   and unsupervised.  It just changes the desired figure of merit.
  
 



 --
 Osman Başkaya
 Koc University
 MS Student | Computer Science and Engineering



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
No, this is not a problem.

Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was more useful to have
larger tests.

There are several ways to choose the test data. One common way is by time,
but there is no time information here by default. The problem is that, for
example, recent ratings may be low -- or at least not high ratings. But the
evaluation is of course asking the recommender for items that are predicted
to be highly rated. Random selection has the same problem. Choosing by
rating at least makes the test coherent.

It does bias the training set, but, the test set is supposed to be small.

There is no way to actually know, a priori, what the top recommendations
are. You have no information to evaluate most recommendations. This makes a
precision/recall test fairly uninformative in practice. Still, it's better
than nothing and commonly understood.

While precision/recall won't be high on tests like this, because of this, I
don't get these values for movielens data on any normal algo, but, you may
be, if choosing an algorithm or parameters that don't work well.




On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.comwrote:

 Hi,

 I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
 code. I think that there are two important problems here.

 According to my understanding the experimental protocol used in this code
 is something like this:

 It takes away a certain percentage of users as test users.
 For
  each test user it builds a training set consisting of ratings given by
 all other users + the ratings of the test user which are below the
 relevanceThreshold.
 It then builds a model and makes a
 recommendation to the test user and finds the intersection between this
 recommendation list and the items which are rated above the
 relevanceThreshold by the test user.
 It then calculates the precision and recall in the usual way.

 Probems:
 1. (mild) It builds a model for every test user which can take a lot of
 time.

 2. (severe) Only the ratings (of the test user) which are below the
 relevanceThreshold are put into the training set. This means that the
 algorithm
 only knows the preferences of the test user about the items which s/he
 don't like. This is not a good representation of user ratings.

 Moreover when I run this evaluator on movielens 1m data, the precision and
 recall turned out to be, respectively,

 0.011534185658699288
 0.007905982905982885

 and the run took about 13 minutes on my intel core i3. (I used user based
 recommendation with k=2)


 Altgough I know that it is not ok to judge the performance of a
 recommendation algorithm by looking at these absolute precision and recall
 values, still these numbers seems to me too low which might be the result
 of the second problem I mentioned above.

 Am I missing something?

 Thanks
 Ahmet



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
Yes. But: the test sample is small. Using 40% of your data to test is
probably quite too much.

My point is that it may be the least-bad thing to do. What test are you
proposing instead, and why is it coherent with what you're testing?




On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.comwrote:

 But modeling a user only by his/her low ratings can be problematic since
 people generally are more precise (I believe) in their high ratings.
 Another problem is that recommender algorithms in general first mean
 normalize the ratings for each user. Suppose that we have the following
 ratings of 3 people (A, B, and C) on 5 items.

 A's ratings: 1 2 3 4 5
 B's ratings: 1 3 5 2 4
 C's ratings: 1 2 3 4 5


 Suppose that A is the test user. Now if we put only the low ratings of A
 (1, 2, and 3) into the training set and mean normalize the ratings then A
 will be
 more similar to B than C, which is not true.




 
  From: Sean Owen sro...@gmail.com
 To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz 
 ahmetyilmazefe...@yahoo.com
 Sent: Saturday, February 16, 2013 8:41 PM
 Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator

 No, this is not a problem.

 Yes it builds a model for each user, which takes a long time. It's
 accurate, but time-consuming. It's meant for small data. You could rewrite
 your own test to hold out data for all test users at once. That's what I
 did when I rewrote a lot of this just because it was more useful to have
 larger tests.

 There are several ways to choose the test data. One common way is by time,
 but there is no time information here by default. The problem is that, for
 example, recent ratings may be low -- or at least not high ratings. But the
 evaluation is of course asking the recommender for items that are predicted
 to be highly rated. Random selection has the same problem. Choosing by
 rating at least makes the test coherent.

 It does bias the training set, but, the test set is supposed to be small.

 There is no way to actually know, a priori, what the top recommendations
 are. You have no information to evaluate most recommendations. This makes a
 precision/recall test fairly uninformative in practice. Still, it's better
 than nothing and commonly understood.

 While precision/recall won't be high on tests like this, because of this, I
 don't get these values for movielens data on any normal algo, but, you may
 be, if choosing an algorithm or parameters that don't work well.




 On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com
 wrote:

  Hi,
 
  I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
  code. I think that there are two important problems here.
 
  According to my understanding the experimental protocol used in this code
  is something like this:
 
  It takes away a certain percentage of users as test users.
  For
   each test user it builds a training set consisting of ratings given by
  all other users + the ratings of the test user which are below the
  relevanceThreshold.
  It then builds a model and makes a
  recommendation to the test user and finds the intersection between this
  recommendation list and the items which are rated above the
  relevanceThreshold by the test user.
  It then calculates the precision and recall in the usual way.
 
  Probems:
  1. (mild) It builds a model for every test user which can take a lot of
  time.
 
  2. (severe) Only the ratings (of the test user) which are below the
  relevanceThreshold are put into the training set. This means that the
  algorithm
  only knows the preferences of the test user about the items which s/he
  don't like. This is not a good representation of user ratings.
 
  Moreover when I run this evaluator on movielens 1m data, the precision
 and
  recall turned out to be, respectively,
 
  0.011534185658699288
  0.007905982905982885
 
  and the run took about 13 minutes on my intel core i3. (I used user based
  recommendation with k=2)
 
 
  Altgough I know that it is not ok to judge the performance of a
  recommendation algorithm by looking at these absolute precision and
 recall
  values, still these numbers seems to me too low which might be the result
  of the second problem I mentioned above.
 
  Am I missing something?
 
  Thanks
  Ahmet
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
This is a good answer for evaluation of supervised ML, but, this is
unsupervised. Choosing randomly is choosing the 'right answers' randomly,
and that's plainly problematic.


On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 I think, it is better to choose ratings of the test user in a random
 fashion.

 On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote:
  Yes. But: the test sample is small. Using 40% of your data to test is
  probably quite too much.
 
  My point is that it may be the least-bad thing to do. What test are you
  proposing instead, and why is it coherent with what you're testing?
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
Sure, if you were predicting ratings for one movie given a set of ratings
for that movie and the ratings for many other movies. That isn't what the
recommender problem is. Here, the problem is to list N movies most likely
to be top-rated. The precision-recall test is, in turn, a test of top N
results, not a test over prediction accuracy. We aren't talking about RMSE
here or even any particular means of generating top N recommendations. You
don't even have to predict ratings to make a top N list.


On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 No, rating prediction is clearly a supervised ML problem

 On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote:
  This is a good answer for evaluation of supervised ML, but, this is
  unsupervised. Choosing randomly is choosing the 'right answers' randomly,
  and that's plainly problematic.
 
 
  On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:
 
  I think, it is better to choose ratings of the test user in a random
  fashion.
 
  On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote:
   Yes. But: the test sample is small. Using 40% of your data to test is
   probably quite too much.
  
   My point is that it may be the least-bad thing to do. What test are
 you
   proposing instead, and why is it coherent with what you're testing?
  
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
If you're suggesting that you hold out only high-rated items, and then
sample them, then that's what is done already in the code, except without
the sampling. The sampling doesn't buy anything that I can see.

If you're suggesting holding out a random subset and then throwing away the
held-out items with low rating, then it's also the same idea, except you're
randomly throwing away some lower-rated data from both test and train. I
don't see what that helps either.


On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 What I mean is you can choose ratings randomly and try to recommend
 the ones above  the threshold




Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
I understand the idea, but this boils down to the current implementation,
plus going back and throwing out some additional training data that is
lower rated -- it's neither in test or training. Anything's possible, but I
do not imagine this is a helpful practice in general.


On Sat, Feb 16, 2013 at 10:29 PM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:

 I'm suggesting the second one. In that way the test user's ratings in
 the training set will compose of both low and high rated items, that
 prevents the problem pointed out by Ahmet.

 On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro...@gmail.com wrote:
  If you're suggesting that you hold out only high-rated items, and then
  sample them, then that's what is done already in the code, except without
  the sampling. The sampling doesn't buy anything that I can see.
 
  If you're suggesting holding out a random subset and then throwing away
 the
  held-out items with low rating, then it's also the same idea, except
 you're
  randomly throwing away some lower-rated data from both test and train. I
  don't see what that helps either.
 
 
  On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:
 
  What I mean is you can choose ratings randomly and try to recommend
  the ones above  the threshold
 
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
The very question at hand is how to label the data as relevant and not
relevant results. The question exists because this is not given, which is
why I would not call this a supervised problem. That may just be semantics,
but the point I wanted to make is that the reasons choosing a random
training set are correct for a supervised learning problem are not reasons
to determine the labels randomly from among the given data. It is a good
idea if you're doing, say, logistic regression. It's not the best way here.
This also seems to reflect the difference between whatever you want to call
this and your garden variety supervised learning problem.

On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Sean

 I think it is still a supervised learning problem in that there is a
 labelled training data set and an unlabeled test data set.

 Learning a ranking doesn't change the basic dichotomy between supervised
 and unsupervised.  It just changes the desired figure of merit.



Re: Improving quality of item similarities?

2013-02-14 Thread Sean Owen
Yes, I don't know if removing that data would improve results. It might
mean you can compute things faster, at little or no observable loss in
quality of the results.

I'm not sure, but you probably have repeat purchases of the same item, and
items of different value. Working in that data may help here since you have
relatively few items.


On Thu, Feb 14, 2013 at 10:25 AM, Julian Ortega jorte...@gmail.com wrote:

 Hi everyone.

 I have a data set that looks like this:

 Number of users: 198651
 Number of items: 9972

 Statistics of purchases from users
 
 mean number of purchases
 3.3
 stdDev number of purchases
 3.5
 min number of purchases
 1
 max number of purchases
 176
 median number of purchases
 2

 Statistics of purchased items
 
 mean number of times bought
 65.1
 stdDev number of times bought
 120.7
 min number of times bought
 1
 max number of times bought
 3278
 median number of times bought
 25

 I'm using a GenericItemBasedRecommender with LogLikelihoodSimilarity to
 generate a list of similar items. However, I've been wondering how should I
 pre-process the data between passing it to the recommender to improve the
 quality.

 Some things I have consider are:

- Removing all users that have 5 or less purchases
- Removing all items that have been purchased 5 or less times

 In general terms, would that make sense? Presumably it will make the matrix
 less sparse and also avoid weak associations, albeit if I'm not
 mistaken LogLikelihood account for low number of occurrences.

 Any thoughts?

 Thanks,
 Julian



Re: Shopping cart

2013-02-14 Thread Sean Owen
This sounds like a job for frequent item set mining, which is kind of a
special case of the ideas you've mentioned here. Given N items in a cart,
which next item most frequently occurs in a purchased cart?


On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I thought you might say that but we don't have the add-to-cart action. We
 have to calculate cart purchases by matching cart IDs or session IDs. So we
 only have cart purchases with items.

 If we had the add-to-cart and the purchase we could use your cross-action
 method for getting recs by training only on those two actions.

 Still without the add-to-cart the method below should work, right? The
 main problem being finding a similar cart in the training set quickly. Are
 there other problems?

 On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 I think that this is an excellent use case for cross recommendation from
 cart contents (items) to cart purchases (items).  The cross aspect is that
 the recommendation is from two different kinds of actions, not two kinds of
 things.  The first action is insertion into a cart and the second is
 purchase of an item.

 On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  There are several methods for recommending things given a shopping cart
  contents. At the risk of using the same tool for every problem I was
  thinking about a recommender's use here.
 
  I'd do something like train on shopping cart purchases so row = cartID,
  column = itemID.
  Given cart contents I could find the most similar cart in the training
 set
  by using a similarity measure then get recs for this closest matched
 cart.
 
  The search for similar carts may be slow if I have to check for pairwise
  similarity so I could cluster and find the best cluster then search it
 for
  the best cart. I could create a decision tree on all trained carts and
 walk
  as far as I can down the tree to find the cart with the most
 cooccurrences.
  There may be other cooccurrence based methods in mahout??? With the id of
  the cart I can then get recs from the training set. I could also fold-in
  the new cart contents to the training set and ask for recs based on it
  (this seems like it would take a long time to compute). This last would
  also pollute the trained matrix with partial carts over time.
 
  This seems like another place where Lucene might help but are there other
  mahout methods to look at before I diving into Lucene?




Re: Shopping cart

2013-02-14 Thread Sean Owen
I don't think it's necessarily slow; this is how item-based recommenders
work. The only thing stopping you from using Mahout directly is that I
don't think there's an easy way to say recommend to this collection of
items. But that's what is happening inside when you recommend for a user.

You can just roll your own version of it. Yes you are computing similarity
for k carted items  by all N items, but is N so large? hundreds of
thousands of products? this is still likely pretty fast even if the
similarity is over millions of carts. Some smart precomputation and caching
goes a long way too.


On Thu, Feb 14, 2013 at 7:10 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yes, one time tested way to do this is the apriori algo which looks at
 frequent item sets and creates rules.

 I was looking for a shortcut using a recommender, which would be super
 easy to try. The rule builder is a little harder to implement but we can
 also test precision on that and compare the two.

 The recommender method below should be reasonable AFAICT except for the
 method(s) of retrieving recs, which seem likely to be slow.

 On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote:

 This sounds like a job for frequent item set mining, which is kind of a
 special case of the ideas you've mentioned here. Given N items in a cart,
 which next item most frequently occurs in a purchased cart?


 On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  I thought you might say that but we don't have the add-to-cart action. We
  have to calculate cart purchases by matching cart IDs or session IDs. So
 we
  only have cart purchases with items.
 
  If we had the add-to-cart and the purchase we could use your cross-action
  method for getting recs by training only on those two actions.
 
  Still without the add-to-cart the method below should work, right? The
  main problem being finding a similar cart in the training set quickly.
 Are
  there other problems?
 
  On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  I think that this is an excellent use case for cross recommendation from
  cart contents (items) to cart purchases (items).  The cross aspect is
 that
  the recommendation is from two different kinds of actions, not two kinds
 of
  things.  The first action is insertion into a cart and the second is
  purchase of an item.
 
  On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  There are several methods for recommending things given a shopping cart
  contents. At the risk of using the same tool for every problem I was
  thinking about a recommender's use here.
 
  I'd do something like train on shopping cart purchases so row = cartID,
  column = itemID.
  Given cart contents I could find the most similar cart in the training
  set
  by using a similarity measure then get recs for this closest matched
  cart.
 
  The search for similar carts may be slow if I have to check for pairwise
  similarity so I could cluster and find the best cluster then search it
  for
  the best cart. I could create a decision tree on all trained carts and
  walk
  as far as I can down the tree to find the cart with the most
  cooccurrences.
  There may be other cooccurrence based methods in mahout??? With the id
 of
  the cart I can then get recs from the training set. I could also fold-in
  the new cart contents to the training set and ask for recs based on it
  (this seems like it would take a long time to compute). This last would
  also pollute the trained matrix with partial carts over time.
 
  This seems like another place where Lucene might help but are there
 other
  mahout methods to look at before I diving into Lucene?
 
 




Re: Shopping cart

2013-02-14 Thread Sean Owen
Yes your only issue there, which I think you had touched on, was that you
have to put your current cart (which hasn't been purchased) into the model
in order to get an answer out of a recommender. I think we've talked about
the recommend-to-anonymous function in the context of another system, which
is exactly what you need here.

Yes, all you have to do then is reproduce the recommender computation. But
I understand that you were hoping to avoid rewriting it. It's really just a
loop though, so not much work to reproduce.

100K items x a few items in a cart is a few hundred thousand similarities.
This isn't trivial but not going to take seconds, I think. Yes this gets
much faster if you can precompute item-item similarity. Computing NxN pairs
is going to take a long time though when N=100,000. So yes something like
clustering is the nice way to scale that. Then your clusters greatly limit
the number of candidates to consider because you can round every other
inter-cluster similarity to 0.

By this point... I imagine it's about as hard to whip up a frequent itemset
implementation! or crib one and adapt it. This is in mahout. That's
probably the right tool for the job.



On Thu, Feb 14, 2013 at 8:19 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I'm creating a matrix of cart ids and items ids so cart x items in cart.
 The 'preference' then is cartID, itemID. This will create the correct
 matrix I think.

 For any cart id I would get a ranked list of recommended items that was
 calculated from other carts. This seems like what is needed in a SC
 recommender. So doing this should give a recommend to this collection of
 items, right?

 The only issue is finding the best cart to get the recs. I would be doing
 a pair-wise similarity comparison for N carts to the current cart contents
 and the result would have to come back in a very short amount of time, on
 the order of the time to get recs for 3M users and 100K items.

 Not sure what N is yet but the # of items is the same as in the purchase
 matrix. So finding the best cart to get recs for will be N similarity
 comparisons--worst case. Each cart is likely to have only a few items in it
 and I imagine this speeds the similarity calc.

 I guess I'll try it as described and optimize for speed if the precision
 is good compared to the apriori algo.

 On Feb 14, 2013, at 10:57 AM, Sean Owen sro...@gmail.com wrote:

 I don't think it's necessarily slow; this is how item-based recommenders
 work. The only thing stopping you from using Mahout directly is that I
 don't think there's an easy way to say recommend to this collection of
 items. But that's what is happening inside when you recommend for a user.

 You can just roll your own version of it. Yes you are computing similarity
 for k carted items  by all N items, but is N so large? hundreds of
 thousands of products? this is still likely pretty fast even if the
 similarity is over millions of carts. Some smart precomputation and caching
 goes a long way too.


 On Thu, Feb 14, 2013 at 7:10 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Yes, one time tested way to do this is the apriori algo which looks at
  frequent item sets and creates rules.
 
  I was looking for a shortcut using a recommender, which would be super
  easy to try. The rule builder is a little harder to implement but we can
  also test precision on that and compare the two.
 
  The recommender method below should be reasonable AFAICT except for the
  method(s) of retrieving recs, which seem likely to be slow.
 
  On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote:
 
  This sounds like a job for frequent item set mining, which is kind of a
  special case of the ideas you've mentioned here. Given N items in a cart,
  which next item most frequently occurs in a purchased cart?
 
 
  On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  I thought you might say that but we don't have the add-to-cart action.
 We
  have to calculate cart purchases by matching cart IDs or session IDs. So
  we
  only have cart purchases with items.
 
  If we had the add-to-cart and the purchase we could use your
 cross-action
  method for getting recs by training only on those two actions.
 
  Still without the add-to-cart the method below should work, right? The
  main problem being finding a similar cart in the training set quickly.
  Are
  there other problems?
 
  On Feb 14, 2013, at 9:19 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  I think that this is an excellent use case for cross recommendation from
  cart contents (items) to cart purchases (items).  The cross aspect is
  that
  the recommendation is from two different kinds of actions, not two kinds
  of
  things.  The first action is insertion into a cart and the second is
  purchase of an item.
 
  On Thu, Feb 14, 2013 at 9:53 AM, Pat Ferrel pat.fer...@gmail.com
  wrote:
 
  There are several methods for recommending things given a shopping cart
  contents. At the risk

Re: Implicit preferences

2013-02-10 Thread Sean Owen
I think you'd have to hack the code to not exclude previously-seen items,
or at least, not of the type you wish to consider. Yes you would also have
to hack it to add rather than replace existing values. Or for test
purposes, just do the adding yourself before inputting the data.

My hunch is that it will hurt non-trivially to treat different interaction
types as different items. You probably want to predict that someone who
viewed a product over and over is likely to buy it, but this would only
weakly tend to occur if the bought-item is not the same thing as the
viewed-item. You'd learn they go together but not as strongly as ought to
be obvious from the fact that they're the same. Still, interesting thought.

There ought to be some 'signal' in this data, just a question of how much
vs noise. A purchase means much more than a page view of course; it's not
as subject to noise. Finding a means to use that info is probably going to
help.




On Sat, Feb 9, 2013 at 7:50 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I'd like to experiment with using several types of implicit preference
 values with recommenders. I have purchases as an implicit pref of high
 strength. I'd like to see if add-to-cart, view-product-details,
 impressions-seen, etc. can increase offline precision in holdout tests.
 These less than obvious implicit prefs will get a much lower value than
 purchase and i'll experiment with different mixes. The problem is that some
 of these prefs will indicate that the user, for whom I'm getting recs, has
 expressed a preference.

 Using these implicit prefs seems reasonable in finding similarity of taste
 between users but presents several problems. 1) how to encode the prefs,
 each impression-seen will increase the strength of preference of a user for
 an item but the recommender framework replaces the preference value for
 items preferred more than once, doesn't it? 2) AFAIK the current
 recommender framework will return recs only for items that the user in
 question has expressed no preference for. If I use something like
 view-product-details or impressions-seen, I will be removing anything the
 user has seen from the recs, which is not what I want in this experiment.

 Has anyone tried something like this? I'm not convinced that these other
 implicit preferences will add anything to the recommender, just trying to
 find out.


Re: Implicit preferences

2013-02-10 Thread Sean Owen
Yeah I bet it does actually work well... but aren't you basically spending
an extra step to make the item-item matrix, to relearn that bought-X and
viewed-X go together? yeah you learn a lot more along the way, as this is
item-based recommendation at heart. It seems like you could add back that
knowledge.


On Sun, Feb 10, 2013 at 5:36 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Actually treating the different interactions separately can lead to very
 good recommendations.  The only issue is that the interactions are no
 longer dyadic.

 If you think about it, having two different kinds of interactions is like
 adjoining interaction matrices for the two different kinds of interaction.
  Suppose that you have user x views in matrix A and you have user x
 purchases in matrix B.  The complete interaction matrix of user x (views +
 purchases) is [A | B].

 When you compute cooccurrence in this matrix, you get

[A | B] = [ A' ]   [ A' A  A' B ]
   [A | B]' [A | B] = [] [A | B] = []
[A | B] = [ B' ]   [ B' A  B' B ]

 This matrix is (view + purchase) x (view + purchase).  But we don't care
 about predicting views so we only really need a matrix that is purchase x
 (view
 + purchase).  This is just the bottom part of the matrix above, or [ B'A |
 B'B ].  When you produce purchase recommendations r_p by multiplying by a
 mixed view and purchase history vector h which has a view part h_v and a
 purchase part h_p, you get

   r_p = [ B' A  B' B ] h = B'A h_v + B'B h_p

 That is a prediction of purchases based on past views and past purchase.

 Note that this general form applies for both decomposition techniques such
 as SVD, ALS and LLL as well as for sparsification techniques such as the
 LLR sparsification.  All that changes is the mechanics of how you do the
 multiplications.  Weighting of components works the same as well.

 What is very different here is that we have a component of cross
 recommendation.  That is the B'A in the formula above.  This is very
 different from a normal recommendation and cannot be computed with the
 simple self-join that we normally have in Mahout cooccurrence computation
 and also very different from the decompositions that we normally do.  It
 isn't hard to adapt the Mahout computations, however.

 When implementing a recommendation using a search engine as the base, these
 same techniques also work extremely well in my experience.  What happens is
 that for each item that you would like to recommend, you would have one
 field that has components of B'A and one field that has components of B'B.
  It is handy to simply use the binary values of the sparsified versions of
 these and let the search engine handle the weighting of different
 components at query time.  Having these components separated into different
 fields in the search index seems to help quite a lot, which makes a fair
 bit of sense.

 On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen sro...@gmail.com wrote:
 
  I think you'd have to hack the code to not exclude previously-seen items,
  or at least, not of the type you wish to consider. Yes you would also
 have
  to hack it to add rather than replace existing values. Or for test
  purposes, just do the adding yourself before inputting the data.
 
  My hunch is that it will hurt non-trivially to treat different
 interaction
  types as different items. You probably want to predict that someone who
  viewed a product over and over is likely to buy it, but this would only
  weakly tend to occur if the bought-item is not the same thing as the
  viewed-item. You'd learn they go together but not as strongly as ought to
  be obvious from the fact that they're the same. Still, interesting
 thought.
 
  There ought to be some 'signal' in this data, just a question of how much
  vs noise. A purchase means much more than a page view of course; it's not
  as subject to noise. Finding a means to use that info is probably going
 to
  help.
 
 
 
 
  On Sat, Feb 9, 2013 at 7:50 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
   I'd like to experiment with using several types of implicit preference
   values with recommenders. I have purchases as an implicit pref of high
   strength. I'd like to see if add-to-cart, view-product-details,
   impressions-seen, etc. can increase offline precision in holdout tests.
   These less than obvious implicit prefs will get a much lower value than
   purchase and i'll experiment with different mixes. The problem is that
 some
   of these prefs will indicate that the user, for whom I'm getting recs,
 has
   expressed a preference.
  
   Using these implicit prefs seems reasonable in finding similarity of
 taste
   between users but presents several problems. 1) how to encode the
 prefs,
   each impression-seen will increase the strength of preference of a user
 for
   an item but the recommender framework replaces the preference value for
   items preferred more than once, doesn't

Re: Rating scale

2013-02-04 Thread Sean Owen
You don't have to fix a scale. But your data needs to be consistent.
It wouldn't work to have users rate on a 1-5 scale one day, and 1-100
tomorrow (unless you go back and normalize the old data to 1-100).

On Mon, Feb 4, 2013 at 3:56 PM, Zia mel ziad.kame...@gmail.com wrote:
 Hi , is there a necessity to have a fix rating scale while running
 recommendations or it can be dynamic based on the users' data ?

 Many Thanks


Re: Failed to execute goal Surefire plugin -- any ideas?

2013-02-04 Thread Sean Owen
You can -DskipTests to skip tests, since that's what it is complaining
about. There aren't any current failures in trunk so could be
something specific to your setup. Or a flaky test. It may still be
something to fix.

On Mon, Feb 4, 2013 at 3:37 PM, jellyman colm_r...@hotmail.com wrote:
 Hi everyone,

 Can you help me please? I'm new to Mahout and am trying to get it 
 running
 on my local windows box on Eclipse IDE but I'm stuck.
 Here is what I have done so far:

 1. Pulled own latest source from:-
 http://svn.apache.org/repos/asf/mahout/trunk
 2. Following the instructions here:
 https://cwiki.apache.org/MAHOUT/buildingmahout.html
 3. mvn compile from inside the core directory -- results is good
 4. mvn install from inside the core directory -- I get an error message
 like:
 BUILD FAILED. Failed to execute goal
 org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (d3efault-test)
 on project mahout-core. There are test failures.
 5. I then run: mvn -X install for more information. Error message is:
 org.apache.maven.lifecycle.LifecycleExecutionException: failed to execute
 goal org.apache.maven.plugins:maven-surefire-plugin:2.13:test

 I'm running Maven v3.0.4 and Eclipse 4.2. I have C:\Program
 Files\Java\jdk1.7.0_06\bin in the environment path etc...

 Just wondering can anyone help me? Any ideas/suggestions that you would like
 to share?
 Thank a mill in advance,
 jelly.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Failed-to-execute-goal-Surefire-plugin-any-ideas-tp4038361.html
 Sent from the Mahout User List mailing list archive at Nabble.com.


Re: Threshold-based neighborhood and getReach

2013-02-04 Thread Sean Owen
You are asking for a smaller and smaller neighborhood around a user.
At some point the neighborhood includes no users, for some people --
or, the neighborhood includes no new items. Nothing can be
recommended, and so recall drops. Precision and recall tend to go in
opposite directions for similar reasons.

On Mon, Feb 4, 2013 at 3:54 PM, Zia mel ziad.kame...@gmail.com wrote:
 Hi , when selecting Threshold-based neighborhood, as the threshold
 increase the precision increase which makes sense. However, the
 getReach max provide recommendations for 0.2 users and decrease to
 0.0002 , is that normal? The recall also drops. When using a
 fixed-size neighborhood getReach provide much higher results.

 //=== Code used 

 UserNeighborhood neighborhood =new
 ThresholdUserNeighborhood(thresholdValue, similarity, model);
 return new GenericUserBasedRecommender(model, neighborhood, similarity);

 IRStatistics stats = evaluator.evaluate(recommenderBuilder, null,
 model, null, 10, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
 1.0);

 stats.getReach()
 //===

 Thanks


Re: Server sizing Hadoop + Mahout

2013-02-02 Thread Sean Owen
The problem with this POV is that it assumes it's obvious what the
right outcome is. With a transaction test or a disk write test or big
sort, it's obvious and you can make a benchmark. With ML, it's not
even close.

For example, I can make you a recommender that is literally as fast as
you like by picking any random set of items. Classifiers can likewise
do so by randomly picking a class. Specifying even a desired answer
isn't useful, since then you are just selecting a process that picks a
particular answer on a particular data set.

I don't think that works, since the classic idea of benchmark is not
well-defined here, but you're welcome to go create and run whatever
tests you like.

On Sat, Feb 2, 2013 at 3:19 PM, jordi jord...@gmail.com wrote:
 Hi Sean! First of all, thanks for your reply!
 I do agree that it's very complicated to do the sizing of an environment since
 there are many variables that should be considerated. You have mentioned some 
 of
 them: the algorithm, the distribution of data, the amount of data, type of
 hardware, etc.
 But I dont agree that it's impossible to give a baseline.
 Maybe should be a great idea for the Mahout+Hadoop community to take a look to
 this guys (Standard Performance Evaluation Corporation, http://www.spec.org/).
 They run the same benchmark on different types of architectures, establishing
 empirically a baseline that can be used as a good start point to do a capacity
 planning.
 They have a lot of benchmarks depending on CPU, Java Client Server, etc.
 Obviously, thats only a start point: before your software goes live to
 production mode, it's desirable to benchmark again your software running a
 load-test, adequating your infraestructure depending on performance results.




Re: (near) real time recommender/predictor

2013-01-31 Thread Sean Owen
It's a good question. I think you can achieve a partial solution in Mahout.

Real-time suggests that you won't be able to make use of
Hadoop-based implementations, since they are by nature big batch
processes.

All of the implementations accept the same input -- user,item,value.
That's OK; you can probably just reduce all of your user-thing
interactions to tuples like this. Any reasonable mapping should be OK.
Tags can be items too.

I don't think any of the implementations take advantage of time.

The non-Hadoop implementations are not-quite-realtime. The model is
loading data into memory from backing store, computing and maybe
caching partial results, and serving results as quickly as possible.
New input can't be immediately used, no. It comes into play when the
model is reloaded only.

I think you have very sparse input -- a high number of users and
items (tags, likes), but relatively few interactions. Matrix
factorization / latent factor models work well here. The ones in
Mahout that are not Hadoop-based may work for you, like
SVDRecommender. It's worth a try.

(Advertisement: the new recommender product I am commercializing,
Myrrix, does the real-time and matrix factorization thing just fine.
It's easy enough to start with that I would encourage you to
experiment with the open source system also:
http://myrrix.com/download/)



On Thu, Jan 31, 2013 at 7:02 PM, Frederik Kraus
frederik.kr...@gmail.com wrote:
 Hi Guys,

 I'm rather new to the whole Mahout ecosystem, so please excuse if the 
 questions I have are rather dumb ;)

 Our problem basically boils down to this: we want to match users with 
 either the content they interested in and/or the content they could 
 contribute to. To do this matching we have several dimensions both of users 
 and content items (things like: contribution history, tags, browsing history, 
 diggs, likes, ….).

 As interest of users can change over time some kind of CF algorithm including 
 temporal effects would obviously be best, but for the time being those 
 effects could probably be neglected.

 Now my questions:

 - what algorithm from the mahout toolkit would best fit our case?
 - How can we get this near realtime, i.e. not having to recalculate the 
 entire model when user dimensions change and/or new content is being added to 
 the system (or updated)
 - how would we model the user and item vectors (especially things like 
 tags)?
 - any hints on where to start? ;)

 Thanks a lot!

 Fred.



Re: Using setPreference() to update recommendations in DataModel in Memory

2013-01-30 Thread Sean Owen
It throws an exception except in a few implementations, mostly the
ones based on a database. It isn't something that's really used -- you
instead update the backing store indirectly. Yes, the model is batch
re-reads of data once in a while. Updates are not in real time in this
model.

On Wed, Jan 30, 2013 at 8:21 AM, Henning Kuich hku...@gmail.com wrote:
 So what does the method do instead?

 And basically the conclusion is: To update your recommender with new
 preference values, you need to reload the data model and everything that
 follows?

 Thanks,

 Henning


 On Tue, Jan 29, 2013 at 7:30 PM, Sean Owen sro...@gmail.com wrote:

 It doesn't really work this way. The model is predicated on loading the
 data from backing store periodically. In the short term it is read only.
 This method is misleading in a sense.
 On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote:

  Dear All,
 
  I would like to be able to update recommendations in the DataModel, and I
  understand that this can be done with the setPreference() method. So this
  can be used to create a new user-item-preference entry into the data
 model,
  or update an already existing one.
 
  My question is the following:
 
  I run my recommender.recommend, and get a recommendation for user1.
  As it happens, user1 now rates 5 other items, and I use the
 setPreference()
  method to place those 5 new ratings into my DataModel.
  If I now re-run the recommender.recommend, does the recommender
  automatically incorporate the 5 new ratings that have just been made, or
 do
  I need to update the recommender in between? And if so, how do I do this?
 
  I hope this question makes sense, and many thanks in advance.
 
  Henning
  --
 
  P. Henning J. L. Kuich
  email: hku...@gmail.com
  twitter: @hkuich http://twitter.com/hkuich
  facebook: henning.kuich
  G+: hkuich
 
  Confidentiality Notice: This e-mail message, including any
  attachments, is for the sole use of the intended recipient(s) and may
  contain confidential and privileged information.  Any unauthorized
  review, use, disclosure or distribution is prohibited.  If you are not
  the intended recipient, please contact the sender by reply e-mail and
  destroy all copies of the original message.
 




 --

 P. Henning J. L. Kuich
 email: hku...@gmail.com
 twitter: @hkuich http://twitter.com/hkuich
 facebook: henning.kuich
 G+: hkuich

 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.


Re: Using setPreference() to update recommendations in DataModel in Memory

2013-01-29 Thread Sean Owen
It doesn't really work this way. The model is predicated on loading the
data from backing store periodically. In the short term it is read only.
This method is misleading in a sense.
On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote:

 Dear All,

 I would like to be able to update recommendations in the DataModel, and I
 understand that this can be done with the setPreference() method. So this
 can be used to create a new user-item-preference entry into the data model,
 or update an already existing one.

 My question is the following:

 I run my recommender.recommend, and get a recommendation for user1.
 As it happens, user1 now rates 5 other items, and I use the setPreference()
 method to place those 5 new ratings into my DataModel.
 If I now re-run the recommender.recommend, does the recommender
 automatically incorporate the 5 new ratings that have just been made, or do
 I need to update the recommender in between? And if so, how do I do this?

 I hope this question makes sense, and many thanks in advance.

 Henning
 --

 P. Henning J. L. Kuich
 email: hku...@gmail.com
 twitter: @hkuich http://twitter.com/hkuich
 facebook: henning.kuich
 G+: hkuich

 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.



Re: Question about server/computer architecture...

2013-01-29 Thread Sean Owen
This is quite small and certainly doesn't require Hadoop. That's the good
news. Any reasonable server will do well for you. You won't be memory
bound. More cores will let you serve more QPS.

Your pain points will be elsewhere like tuning for best quality and real
time updates. See my separate email for a possible different solution.

Sean
 On Jan 29, 2013 5:21 PM, Henning Kuich hku...@gmail.com wrote:

 Thanks for the quick answer Ted.

 I want to build a User-based recommender for an e-commerce start-up. The 1M
 ratings dataset from grouplens is about what we are expecting in the nearer
 future. the data will be preferences either from 1-5 or 1-3...

 I hope this makes my question a bit more complete.. sorry about that!




 On Tue, Jan 29, 2013 at 5:47 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Depends on what you want to do with Mahout.
 
  What is that?
 
  How much data?
 
  What kind of data?
 
  On Tue, Jan 29, 2013 at 7:14 AM, Henning Kuich hku...@gmail.com wrote:
 
   Dear All,
  
   is there a preferred computer architecture for Mahout? for example, do
   multicore processors help? is there anything else in terms of server
   hardware that one should know about, or anything that might be
  particularly
   favorable to implement Mahout?
  
   Thanks in advance,
  
   Henning
  
   --
  
   P. Henning J. L. Kuich
   email: hku...@gmail.com
   twitter: @hkuich http://twitter.com/hkuich
  
   Confidentiality Notice: This e-mail message, including any
   attachments, is for the sole use of the intended recipient(s) and may
   contain confidential and privileged information.  Any unauthorized
   review, use, disclosure or distribution is prohibited.  If you are not
   the intended recipient, please contact the sender by reply e-mail and
   destroy all copies of the original message.
  
 



 --

 P. Henning J. L. Kuich
 email: hku...@gmail.com
 twitter: @hkuich http://twitter.com/hkuich
 facebook: henning.kuich
 G+: hkuich

 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.



Re: QRDecomposition performance

2013-01-28 Thread Sean Owen
Is it worth simply using the Commons Math implementation?

On Mon, Jan 28, 2013 at 8:04 AM, Sebastian Schelter s...@apache.org wrote:
 This is great news and will automatically boost the performance of all
 our ALS-based recommenders which are all using QRDecomposition internally.

 On 28.01.2013 04:02, Ted Dunning wrote:
 Did that.

 You are right.  The QRD in mahout is abysmally slow.  I wrote a new version
 on the airplane that seems to be about 10x faster and still jsut about as
 accurate (and vastly simpler).  I will put up some tests and a patch in the
 next week or so.



Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-28 Thread Sean Owen
These are settings to Hadoop, not Mahout. You may need to set them in
your cluster config. They are still only suggestions.

The question still remains why you think you need several mappers. Why?

On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi,
 I would like to again consolidate all the steps which I performed.

 Issue : MatrixMultiplication example is getting executed with only 1 map task.

 Steps :
 1. I created a file with size 104MB which is divided into 11 blocks with size 
 10MB each. The file contains 200x10 size of matrix.
 2. I exported $MAHOUT_OPTS to the following
   $   echo $MAHOUT_OPTS
   -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7
 3.  Tried to execute matrix multiplication example using commandline :
 mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200 --numColsA 
 10 --inputPathB /test/points/matrixA --numRowsB 200 --numColsB 10 
 --tempDir /test/temp

 When I check the Jobtracker UI , its shows me following for the running job :
 Running Map Tasks : 1
 Occupied Map Slots: 1

 How can I distribute the map task on different mappers for 
 MatrixMultiplication Job dynamically.
 Is it even possible that MatrixMultiplication can run distributedly on 
 multiple mappers as it internally uses CompositeInputFormat .

 Please Suggest

 Thanks
 Stuti


 -Original Message-
 From: Sean Owen [mailto:sro...@gmail.com]
 Sent: Wednesday, January 23, 2013 6:42 PM
 To: Mahout User List
 Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

 Mappers are usually extremely fast since they start themselves on top of the 
 data and their job is usually just parsing and emitting key value pairs. 
 Hadoop's choices are usually fine.

 If not it is usually because the mapper is emitting far more data than it 
 ingests. Are you computing some kind of Cartesian product of input?

 That's slow no matter what. More mappers may increase parallelism but its 
 still a lot of I/O. Avoid it if you can by sampling or pruning unimportant 
 values. Otherwise , try to implement a Combiner.
 On Jan 23, 2013 12:04 PM, Jonas Grote jfgr...@gmail.com wrote:

 I'd play with the mapred.map.tasks option. Setting it to something
 bigger than 1 gave me performance improvements for various hadoop jobs
 on my cluster.


 2013/1/16 Ashish paliwalash...@gmail.com

  I am afraid I don't know the answer. Need to experiment a bit more.
  I
 have
  not used CompositeInputFormat so cannot comment.
 
  Probably, someone else on the ML(Mailing List) would be able to
  guide
 here.
 
 
  On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi
  stutiawas...@hcl.com
  wrote:
 
   Thanks Ashish,
  
   So according to the link if one is using CompositeInputFormat then
   it
  will
   take entire file as Input to a mapper without considering
   InputSplits/blocksize.
   If I am understanding it correctly then it is asking to break
   [Original Input File]-[flie1,file2,] .
  
   So If my file is  [/test/MatrixA] -- [/test/smallfiles/file1,
   [/test/smallfiles/file2, [/test/smallfiles/file3...  ]
  
   Now will the input path in MatrixMultiplicationJob will be
   directory
 path
   : /test/smallfiles  ??
  
   Will breaking file in such manner will cause problem in
   algorithmic execution of MR job. Im not sure if output will be correct .
  
   -Original Message-
   From: Ashish [mailto:paliwalash...@gmail.com]
   Sent: Wednesday, January 16, 2013 5:44 PM
   To: user@mahout.apache.org
   Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
  
   MatrixMultiplicationJob internally sets InputFormat as
  CompositeInputFormat
  
   JobConf conf = new JobConf(initialConf,
   MatrixMultiplicationJob.class);
   conf.setInputFormat(CompositeInputFormat.class);
  
   and AFAIK, CompositeInputFormat ignores the splits. See this
  
 
 http://stackoverflow.com/questions/8654200/hadoop-file-splits-composit
 einputformat-inner-join
  
   Unfortunately, I don't know any other alternative as of now.
  
  
   On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi
   stutiawas...@hcl.com
   wrote:
  
The issue is that currently my matrix is of dimension
(100x100k), Later it can be (1MX10M) or big.
   
Even now if my job is running with the single mapper for
(100x100k) and it is not able to complete the Job. As I
mentioned map task just proceed to 0.99% and started spilling
the map output. Hence I wanted to tune my job so that Mahout is
able to complete the job and I can utilize my cluster resources.
   
As MatrixMultiplicationJob is a MR, so it should be able to
handle parallel map tasks. I am not sure if there is any
algorithmic constraints due to which it runs only with single mapper ?
I have taken the reference of thread so that I can set
Configuration myself rather by getting it with getConf() but did
not got any
 success
   
   
 http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers

Re: Precision question

2013-01-28 Thread Sean Owen
Impossible to say. More data means a more reliable estimate all else equal.
That's about it.
On Jan 28, 2013 5:17 PM, Zia mel ziad.kame...@gmail.com wrote:

 Any thoughts of this ?

 On Sat, Jan 26, 2013 at 10:55 AM, Zia mel ziad.kame...@gmail.com wrote:
  OK , in the precison when we reduce the size of sample to .1 or 0.05 ,
  would the results be related when we check with all the data ? For
  example, if we have data1 and data2 and test them using 0.1 and found
  that data 1 is producing better results , would the same thing stand
  when we check with all data?
 
   IRStatistics stats = evaluator.evaluate(recommenderBuilder,
  null, model, null, 10,
 
  GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
  0.05);
 
  Many thanks
 
  On Fri, Jan 25, 2013 at 12:26 PM, Sean Owen sro...@gmail.com wrote:
  No, it takes a fixed at value. You can modify it to do whatever you
 want.
  You will see it doesn't bother with users with little data, like 
  2*at data points.
 
  On Fri, Jan 25, 2013 at 6:23 PM, Zia mel ziad.kame...@gmail.com
 wrote:
  Interesting. Using
   IRStatistics stats = evaluator.evaluate(recommenderBuilder,
  null, model, null, 5,
 
  GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
  1.0);
 
  Can it be adjusted to each user ? In other words, is there a way to
  select a threshold instead of using 5 ?  mm Something like selecting y
  set , each set have a min of z user ?
 
 
 
  On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote:
  The way I do it is to set x different for each user, to the number of
  items in the user's test set -- you ask for x recommendations.
  This makes precision == recall, note. It dodges this problem though.
 
  Otherwise, if you fix x, the condition you need is stronger, really:
  each user needs = x *test set* items in addition to training set
  items to make this test fair.
 
 
  On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com
 wrote:
  When selecting precision at x let's say 5 , should I check that all
  users have 5 items or more? For example, if a user have 3 items and
  they were removed as top items,  then how can the recommender suggest
  items since there are no items to learn from?
  Thanks !



Re: Precision question

2013-01-28 Thread Sean Owen
Yes several independent samples of all the data will, together, give
you a better estimate of the real metric value than any individual
one.

On Mon, Jan 28, 2013 at 5:41 PM, Zia mel ziad.kame...@gmail.com wrote:
 What about running several tests on small data , can't that give an
 indicator of how big data will perform ?
 Thanks



Re: Precision question

2013-01-25 Thread Sean Owen
The way I do it is to set x different for each user, to the number of
items in the user's test set -- you ask for x recommendations.
This makes precision == recall, note. It dodges this problem though.

Otherwise, if you fix x, the condition you need is stronger, really:
each user needs = x *test set* items in addition to training set
items to make this test fair.


On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com wrote:
 When selecting precision at x let's say 5 , should I check that all
 users have 5 items or more? For example, if a user have 3 items and
 they were removed as top items,  then how can the recommender suggest
 items since there are no items to learn from?
 Thanks !


Re: Precision question

2013-01-25 Thread Sean Owen
No, it takes a fixed at value. You can modify it to do whatever you want.
You will see it doesn't bother with users with little data, like 
2*at data points.

On Fri, Jan 25, 2013 at 6:23 PM, Zia mel ziad.kame...@gmail.com wrote:
 Interesting. Using
  IRStatistics stats = evaluator.evaluate(recommenderBuilder,
 null, model, null, 5,

 GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
 1.0);

 Can it be adjusted to each user ? In other words, is there a way to
 select a threshold instead of using 5 ?  mm Something like selecting y
 set , each set have a min of z user ?



 On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote:
 The way I do it is to set x different for each user, to the number of
 items in the user's test set -- you ask for x recommendations.
 This makes precision == recall, note. It dodges this problem though.

 Otherwise, if you fix x, the condition you need is stronger, really:
 each user needs = x *test set* items in addition to training set
 items to make this test fair.


 On Fri, Jan 25, 2013 at 4:10 PM, Zia mel ziad.kame...@gmail.com wrote:
 When selecting precision at x let's say 5 , should I check that all
 users have 5 items or more? For example, if a user have 3 items and
 they were removed as top items,  then how can the recommender suggest
 items since there are no items to learn from?
 Thanks !


Re: EMR setup for seq2sparse

2013-01-24 Thread Sean Owen
In my experience, using many small instances hurts since there is more
data transferred (less data is local to any given computation) and the
instance have lower I/O performance.

On the high end, super-big instances become counter-productive because
they are not as cheap on the spot market -- and you should be using
the spot market for everything but your master for sure.

ml.xlarge is a good default. EMR's default config says that each can
handle 3 reducers. So set your parallelism to at least 3 times the
number of workers you run.


If you can get away with computing on one machine, without Hadoop, do
so. .Distributing via Hadoop tends to cost 5x as much computing
resource or more. And, you can rent amazingly huge machines in the
cloud.

There's still a point past which you can't fit on one machine, or it's
not economical -- the huge EC2 instances are expensive and not on the
spot market. But it may be big enough for a lot of problems.




On Thu, Jan 24, 2013 at 2:01 PM, Matti Kokkola matti.kokk...@iki.fi wrote:

 Hi,

 I'm using Mahout to vectorize and cluster data consisting of short
 texts. So far I have done vectorizing on a single multi-core machine
 and been quite happy with the results. However, now we are doing a
 lot of small adjustments to increase the qulity of results and thus
 would like to tighten the feedback loop, ie. get vectors more quickly.

 Does anyone have good reference setup for Amazon EMR configuration for such
 a task? I tried with 6 m1.small instances, but terminated the job after 24
 hrs, because I thought there is something wrong with the setup. I pretty
 much followed the guides in Mahout wiki for the basic setup.

 In the test case, my seq file size was 50MB and previous seq2sparse runs
 have resulted around 400k vectors from that data.

 Rest of the configuration was as follows:
 - mahout v0.7
 - 6 instances, instance type default (m1.small)
 - numReducers 6
 - maxNGramsize 2

 Does this sound right (24 hrs and more to come...) for the given data size?
 How mouch improvement should I except, if I use m1.large instances instead?
 Any other recommendations?-)

 br, Matti


Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
Not quite, the evaluation considers every item in the test set to be
good, but you would and should fix the test set size across
evaluations for this reason. You are right that there is a big
assumption there -- that everything in the test set is good. You have
to believe your test split process supports that assumption.

On Thu, Jan 24, 2013 at 6:37 PM, Zia mel ziad.kame...@gmail.com wrote:
 In general boolean recommender will get higher precision than using a
 recommender with preferences,  since the boolean considers every item
 as good which is not true! So is there a way to make a realistic
 measure from boolean ? For example, does dividing the precison by 2
 makes sense since we get high precison using boolean?
 Thanks



 On Wed, Jan 23, 2013 at 3:49 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 LLR should not be used to indicate proximity, but rather simply as a value
 to compare to a threshold.

 On Thu, Jan 24, 2013 at 1:45 AM, Zia mel ziad.kame...@gmail.com wrote:

 OK .  The TanimotoCoefficientSimilarity and LogLikelihoodSimilarity
 used in MIA page 54 and 55 provide a score, so it seems they were not
 using a Boolean recommender , something like code 1 maybe? Thanks

 On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote:
  Yes any metric that concerns estimated value vs real value can't be
  used since all values are 1. Yes, when you use the non-boolean version
  with boolean data you always get 1. When you use the boolean version
  with boolean data you will get nonsense since the output of this
  recommender is not an estimated rating at all.
 
  On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote:
  I got 0 when I used GenericUserBasedRecommender in code 2 but when
  using GenericBooleanPrefUserBasedRecommender score was not 0 . I
  repeat the test with different data and again I got some results.
  Moreover , when I use
   DataModel model = new FileDataModel(new File(ua.base));
  in code 2, the MAE score was higher.
 
  When you say RMSE can't be used with boolean data, I assume MAE also
  can't be used?
 
  Thanks !
 
  On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote:
  RMSE can't
  be used with boolean data.



Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
Well, if you are throwing away rating data, you are throwing away
rating data. They are no longer 100% different but 100% the same.
If that's not a good thing to do, don't do it.

It's possible that using ratings gets better precision, and it's
possible that it doesn't. It depends on whether the ratings data are
useful or noise, and whether you use them or not.

On Thu, Jan 24, 2013 at 7:52 PM, Zia mel ziad.kame...@gmail.com wrote:
 There should be something to solve this :) . For example, 2 users
 having the same items could rate them 100% different , but using the
 boolean their items will be recommended to each other.

 Is there a chance that using preferences would get higher precison
 that boolean? if so, when is that case?


Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
Yes, but the similarities are no longer weights, because there is
nothing to weight. They are used to compute a score directly, which is
not a weighted average but a function of the similarities themselves.

While it is true that more distant neighbors have less effect in
general, when the similarities *are* used as weights, it's not true
that a small bad contribution can't hurt. A small bad contribution can
still be bad.

On Thu, Jan 24, 2013 at 7:58 PM, Koobas koo...@gmail.com wrote:
 A naive question:
 Boolean recommender means that we are ignoring ratings,
 but aren't recommendations still weighted by user-user similarities or
 item-item similarities?
 Which would also mean that increasing the neighborhood will not deteriorate
 the results,
 because bad contributions from farther neighbors are attenuated by their
 lower similarities.


Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-23 Thread Sean Owen
Mappers are usually extremely fast since they start themselves on top of
the data and their job is usually just parsing and emitting key value
pairs. Hadoop's choices are usually fine.

If not it is usually because the mapper is emitting far more data than it
ingests. Are you computing some kind of Cartesian product of input?

That's slow no matter what. More mappers may increase parallelism but its
still a lot of I/O. Avoid it if you can by sampling or pruning unimportant
values. Otherwise , try to implement a Combiner.
On Jan 23, 2013 12:04 PM, Jonas Grote jfgr...@gmail.com wrote:

 I'd play with the mapred.map.tasks option. Setting it to something bigger
 than 1 gave me performance improvements for various hadoop jobs on my
 cluster.


 2013/1/16 Ashish paliwalash...@gmail.com

  I am afraid I don't know the answer. Need to experiment a bit more. I
 have
  not used CompositeInputFormat so cannot comment.
 
  Probably, someone else on the ML(Mailing List) would be able to guide
 here.
 
 
  On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi stutiawas...@hcl.com
  wrote:
 
   Thanks Ashish,
  
   So according to the link if one is using CompositeInputFormat then it
  will
   take entire file as Input to a mapper without considering
   InputSplits/blocksize.
   If I am understanding it correctly then it is asking to break [Original
   Input File]-[flie1,file2,] .
  
   So If my file is  [/test/MatrixA] -- [/test/smallfiles/file1,
   [/test/smallfiles/file2, [/test/smallfiles/file3...  ]
  
   Now will the input path in MatrixMultiplicationJob will be directory
 path
   : /test/smallfiles  ??
  
   Will breaking file in such manner will cause problem in algorithmic
   execution of MR job. Im not sure if output will be correct .
  
   -Original Message-
   From: Ashish [mailto:paliwalash...@gmail.com]
   Sent: Wednesday, January 16, 2013 5:44 PM
   To: user@mahout.apache.org
   Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
  
   MatrixMultiplicationJob internally sets InputFormat as
  CompositeInputFormat
  
   JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
   conf.setInputFormat(CompositeInputFormat.class);
  
   and AFAIK, CompositeInputFormat ignores the splits. See this
  
 
 http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
  
   Unfortunately, I don't know any other alternative as of now.
  
  
   On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi stutiawas...@hcl.com
   wrote:
  
The issue is that currently my matrix is of dimension (100x100k),
Later it can be (1MX10M) or big.
   
Even now if my job is running with the single mapper for (100x100k)
and it is not able to complete the Job. As I mentioned map task just
proceed to 0.99% and started spilling the map output. Hence I wanted
to tune my job so that Mahout is able to complete the job and I can
utilize my cluster resources.
   
As MatrixMultiplicationJob is a MR, so it should be able to handle
parallel map tasks. I am not sure if there is any algorithmic
constraints due to which it runs only with single mapper ?
I have taken the reference of thread so that I can set Configuration
myself rather by getting it with getConf() but did not got any
 success
   
   
 http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
ers-in-DistributedRowMatrix-Jobs-td888980.html
   
Stuti
   
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, January 16, 2013 4:46 PM
To: Mahout User List
Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
   
Why do you need multiple mappers? Is one too slow? Many are not
necessarily faster for small input On Jan 16, 2013 10:46 AM, Stuti
Awasthi stutiawas...@hcl.com wrote:
   
 Hi,
 I tried to call programmatically also but facing same issue : Only
 single MapTask is running and that too spilling the map output
 continuously.
 Hence im not able to generate the output for large matrix
   multiplication.

 Code Snippet :

 DistributedRowMatrix a = new DistributedRowMatrix(new
 Path(/test/points/matrixA), new
 Path(/test/temp),Integer.parseInt(100),
 Integer.parseInt(10)); DistributedRowMatrix b = new
 DistributedRowMatrix(new Path(/test/points/matrixA),new
 Path(tempDir),Integer.parseInt(100),
 Integer.parseInt(10));
 Configuration conf = new Configuration();
 conf.set(fs.default.name, hdfs://DS-1078D24B4736:10818);
 conf.set(mapred.child.java.opts,
 -Xmx2048m); conf.set(mapred.max.split.size,10485760);
 a.setConf(conf);
 b.setConf(conf);
 a.times(b);

 Where Im going wrong. Any idea ?

 Thanks
 Stuti
 -Original Message-
 From: Stuti Awasthi
 Sent: Wednesday, January 16, 2013 2:55 PM
 To: Mahout User List
 Subject: RE: MatrixMultiplicationJob

Re: Finding best NearestNUserNeighborhood size

2013-01-23 Thread Sean Owen
The stochastic nature of the evaluation means your results will vary
randomly from run to run. This looks to my eyeballs like most of the
variation you see. You probably want to average over many runs.

You will probably find that accuracy peaks around some neighborhood size:
adding more useful neighbors helps but at some point the next nearest isn't
so similar and the additional data harms the result more than helps.
On Jan 23, 2013 1:01 PM, Zia mel ziad.kame...@gmail.com wrote:

 Hi
 I used NearestNUserNeighborhood with RMSE in a user recommender that
 use PearsonCorrelationSimilarity , I found that changing the
 neighborhood size has no clear pattern or effect. Sometimes it
 increase others decrease. While using the neighborhood size with
 precision has a better pattern. Any reason? Another point is that the
 RMSE change for every run since it choose different sample , so would
 running the code for 10 or 20 times and taking the average be a good
 idea or there is better thing to do?

 //-- RUN 1
  2,  0.5523623146152608
  3,  0.5425283201773704
  4,  0.669846658662311
  5,  0.5956616542334392
  6,  0.6033911039809353
  7,  0.6135206544496685
  8,  0.5740444208649034
  9,  0.642798288443049
  10,  0.626653651472

 //-- RUN 2
  2,  0.5415411343523825
  3,  0.6784589323396696
  4,  0.6347069968141124
  5,  0.6968820296725008
  6,  0.5953849874479478
  7,  0.6791828191904128
  8,  0.6072462830257853
  9,  0.6461346217476011
  10,  0.6043919119341171

 Thanks !



Re: Finding best NearestNUserNeighborhood size

2013-01-23 Thread Sean Owen
That is good for making a test repeatable because you are picking the same
random sample repeatedly. For evaluation purposes here that's not a good
thing and you do want several actually different samples of the result.
On Jan 23, 2013 1:19 PM, Stevo Slavić ssla...@gmail.com wrote:

 When evaluating recommender before running evaluator put

 RandomUtils.useTestSeed();

 to make splitting of data set consistent; don't use it in production, just
 for evaluation.
 This is all explained more thoroughly in Mahout in Action book.

 Kind regards,
 Stevo Slavic.


 On Wed, Jan 23, 2013 at 2:01 PM, Zia mel ziad.kame...@gmail.com wrote:

  Hi
  I used NearestNUserNeighborhood with RMSE in a user recommender that
  use PearsonCorrelationSimilarity , I found that changing the
  neighborhood size has no clear pattern or effect. Sometimes it
  increase others decrease. While using the neighborhood size with
  precision has a better pattern. Any reason? Another point is that the
  RMSE change for every run since it choose different sample , so would
  running the code for 10 or 20 times and taking the average be a good
  idea or there is better thing to do?
 
  //-- RUN 1
   2,  0.5523623146152608
   3,  0.5425283201773704
   4,  0.669846658662311
   5,  0.5956616542334392
   6,  0.6033911039809353
   7,  0.6135206544496685
   8,  0.5740444208649034
   9,  0.642798288443049
   10,  0.626653651472
 
  //-- RUN 2
   2,  0.5415411343523825
   3,  0.6784589323396696
   4,  0.6347069968141124
   5,  0.6968820296725008
   6,  0.5953849874479478
   7,  0.6791828191904128
   8,  0.6072462830257853
   9,  0.6461346217476011
   10,  0.6043919119341171
 
  Thanks !
 



Re: Boolean preferences and evaluation

2013-01-23 Thread Sean Owen
These can use non boolean data as the value will just be ignored. The
opposite is what does not work.
On Jan 23, 2013 4:45 PM, Zia mel ziad.kame...@gmail.com wrote:

 OK .  The TanimotoCoefficientSimilarity and LogLikelihoodSimilarity
 used in MIA page 54 and 55 provide a score, so it seems they were not
 using a Boolean recommender , something like code 1 maybe? Thanks

 On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote:
  Yes any metric that concerns estimated value vs real value can't be
  used since all values are 1. Yes, when you use the non-boolean version
  with boolean data you always get 1. When you use the boolean version
  with boolean data you will get nonsense since the output of this
  recommender is not an estimated rating at all.
 
  On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote:
  I got 0 when I used GenericUserBasedRecommender in code 2 but when
  using GenericBooleanPrefUserBasedRecommender score was not 0 . I
  repeat the test with different data and again I got some results.
  Moreover , when I use
   DataModel model = new FileDataModel(new File(ua.base));
  in code 2, the MAE score was higher.
 
  When you say RMSE can't be used with boolean data, I assume MAE also
  can't be used?
 
  Thanks !
 
  On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote:
  RMSE can't
  be used with boolean data.



Re: ItemBased and data size

2013-01-23 Thread Sean Owen
It's hard to make such generalization, but all else equal, I'd expect
more data to improve results and decrease error, yes.

On Wed, Jan 23, 2013 at 8:02 PM, Zia mel ziad.kame...@gmail.com wrote:
 Is there a relation between ItemBased and data size? I found when I
 increase the data size the MAE decrease. Does that indicate anything?

 Many thanks


Re: Boolean preferences and evaluation

2013-01-22 Thread Sean Owen
That sounds reversed. Are you sure? without pref values, you should
get 0. With values, you almost certainly won't get 0 RMSE. RMSE can't
be used with boolean data.

Code #4 needs to use the boolean user-based recommender or else you
will get 1 for all estimates and results are randomly ordered then.

On Tue, Jan 22, 2013 at 4:04 PM, Zia mel ziad.kame...@gmail.com wrote:
 Thanks Sean.

 - When I used GenericUserBasedRecommender in code 2 I got 0 , but when
 using GenericBooleanPrefUserBasedRecommender both MAE and RMSE in case
 2 gave me scores, so only RMSE is not useful or also MAE ?

 - If I want to compare between recommenders that use preferences and
 those that don't use , does using code 3 and 4 below with
 GenericRecommenderIRStatsEvaluator makes sense? Since using code 2
 with GenericBooleanPrefUserBasedRecommender creates different
 recommender that uses weights.

 //---  Code 3 -

  DataModel model = new FileDataModel(new File(ua.base));

  RecommenderIRStatsEvaluator evaluator = new
 GenericRecommenderIRStatsEvaluator();
 RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

   public Recommender buildRecommender(DataModel model) throws
 TasteException {
  UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
  UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(k, similarity, model);
return new GenericUserBasedRecommender(model, neighborhood,
 similarity);
   }};

 //--- Code 4 ---

   DataModel model = new GenericBooleanPrefDataModel(
 GenericBooleanPrefDataModel.toDataMap(
   new FileDataModel(new File(ua.base;

 RecommenderIRStatsEvaluator evaluator = new
 GenericRecommenderIRStatsEvaluator();
 RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

   public Recommender buildRecommender(DataModel model) throws
 TasteException {
  UserSimilarity similarity = new LogLikelihoodSimilarity(model);
  UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(k, similarity, model);
return new GenericUserBasedRecommender(model, neighborhood,
 similarity);
   }};


 On Tue, Jan 22, 2013 at 1:58 AM, Sean Owen sro...@gmail.com wrote:
 No it's really #2, since the first still has data that is not
 true/false. I am not sure what eval you are running, but an RMSE test
 wouldn't be useful in case #2. It would always be 0 since there is
 only one value in the universe: 1. No value can ever be different from
 the right value.

 On Tue, Jan 22, 2013 at 4:34 AM, Zia mel ziad.kame...@gmail.com wrote:
 Hi !

 Can we say that both code 1 and 2 below are using boolean recommender
 since they both use LogLikelihoodSimilarity? Which code is used by
 default when no preferences are available ? When using
 GenericUserBasedRecommender in code 1 it gave a score during
 evaluation , but when using it in code 2 it gave 0 , is the score
 given by code 1 correct since in MAI book page 23 said In the case of
 Boolean preference data, only a precision-recall test is available
 anyway.

 //-- Code 1 --
   DataModel model = new GroupLensDataModel(new File(ratings.dat));
   RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
   public Recommender buildRecommender(DataModel model) throws
 TasteException {
   UserSimilarity similarity = new LogLikelihoodSimilarity(model);
   UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(2, similarity, model);
   return new GenericUserBasedRecommender(model, neighborhood,
 similarity);
   }};

 //--- Code 2 ---
 DataModel model = new GenericBooleanPrefDataModel(
 GenericBooleanPrefDataModel.toDataMap(
 new FileDataModel(new File(ua.base;

 RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
   public Recommender buildRecommender(DataModel model) throws
 TasteException {
 UserSimilarity similarity = new LogLikelihoodSimilarity(model);
 UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(2, similarity, model);
return new GenericBooleanPrefUserBasedRecommender (model,
 neighborhood, similarity);
   }};

 Many Thanks !


Re: Boolean preferences and evaluation

2013-01-22 Thread Sean Owen
Yes any metric that concerns estimated value vs real value can't be
used since all values are 1. Yes, when you use the non-boolean version
with boolean data you always get 1. When you use the boolean version
with boolean data you will get nonsense since the output of this
recommender is not an estimated rating at all.

On Tue, Jan 22, 2013 at 4:40 PM, Zia mel ziad.kame...@gmail.com wrote:
 I got 0 when I used GenericUserBasedRecommender in code 2 but when
 using GenericBooleanPrefUserBasedRecommender score was not 0 . I
 repeat the test with different data and again I got some results.
 Moreover , when I use
  DataModel model = new FileDataModel(new File(ua.base));
 in code 2, the MAE score was higher.

 When you say RMSE can't be used with boolean data, I assume MAE also
 can't be used?

 Thanks !

 On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote:
 RMSE can't
 be used with boolean data.


Re: Question - Mahout Taste - User-Based Recommendations...

2013-01-22 Thread Sean Owen
Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(),
and Recommender.estimatePreference(). These do what you are interested
in, and yes they are easy since they are just steps in the
recommendation process anyway.

On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com wrote:
 Dear All,

 I am wondering if I understand the User-based recommendation algorithm
 correctly.

 I need to be able to answer the following questions, given users and
 ratings:

 1) Which users are closest to a given user
 and
 2) given a user and a product, predict the preference for the product

 apart from the standard return topN recommendations. But as I understand
 it, the above two questions are just subquestions of the topN problem,
 correct? Because the algorithm determines the closest users since it's a
 user-based recommender, and since it calculates all potential user-item
 associations, the second question should also be taken care of.

 Do I understand this correctly?

 I would greatly appreciate any help,

 Henning




 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.


Re: Question - Mahout Taste - User-Based Recommendations...

2013-01-22 Thread Sean Owen
That's a question of using item-item similarity. For that you need to
use something based on an ItemSimilarity, which is not user-based but
instead the item-based implementation. Or you can just use
ItemSimilarity directly to iterate over the possibilities and find
most similar, but, the recommender would do it for you.

On Tue, Jan 22, 2013 at 7:50 PM, Henning Kuich hku...@gmail.com wrote:
 Oh, I forgot one thing: Is it just as simple using the User-based
 recommendation to find similar products, or is this only possible using
 item-based recommendations? So basically if a given user rated a certain
 product with x stars, to figure out what item is most like the one he has
 just rated, but using only user-based recommendation algorithms?

 HK


 On Tue, Jan 22, 2013 at 7:44 PM, Henning Kuich hku...@gmail.com wrote:

 That's what i though. I just wanted to make sure!

 Thanks so much for the quick reply!

 HK



 On Tue, Jan 22, 2013 at 7:40 PM, Sean Owen sro...@gmail.com wrote:

 Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(),
 and Recommender.estimatePreference(). These do what you are interested
 in, and yes they are easy since they are just steps in the
 recommendation process anyway.

 On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com wrote:
  Dear All,
 
  I am wondering if I understand the User-based recommendation algorithm
  correctly.
 
  I need to be able to answer the following questions, given users and
  ratings:
 
  1) Which users are closest to a given user
  and
  2) given a user and a product, predict the preference for the product
 
  apart from the standard return topN recommendations. But as I
 understand
  it, the above two questions are just subquestions of the topN problem,
  correct? Because the algorithm determines the closest users since
 it's a
  user-based recommender, and since it calculates all potential user-item
  associations, the second question should also be taken care of.
 
  Do I understand this correctly?
 
  I would greatly appreciate any help,
 
  Henning
 
 
 
 
  Confidentiality Notice: This e-mail message, including any
  attachments, is for the sole use of the intended recipient(s) and may
  contain confidential and privileged information.  Any unauthorized
  review, use, disclosure or distribution is prohibited.  If you are not
  the intended recipient, please contact the sender by reply e-mail and
  destroy all copies of the original message.




 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.


 Confidentiality Notice: This e-mail message, including any
 attachments, is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information.  Any unauthorized
 review, use, disclosure or distribution is prohibited.  If you are not
 the intended recipient, please contact the sender by reply e-mail and
 destroy all copies of the original message.


Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender

2013-01-21 Thread Sean Owen
You would have to write this yourself, yes.
If you're not keeping the data in memory, you're not updating the
results in real-time. So there's no real need to keep any DataModel
around at all. Just pre-compute and store recommendations and update
them periodically. Nothing has to be on-line then.

On Mon, Jan 21, 2013 at 7:54 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote:
 Hello,

 In our application we are using ReloadFromJDBCDataModel for its speed
 advantage of in-memory representation and being able to update periodically
 to pull in new data from a database source.

 However, once the recommender is build we do not want to keep the ratings
 data in memory (we would like to query the database when rating data is
 needed). We want to replace the ReloadFromJDBCDataModel with a
 MySqlJDBCDataModel after build. But there is no setter method for it,
 furthermore, the field that keeps the DataModel is in AbstractRecommender
 (superclass of SVDRecommender) and it is declared final.

 We thought we could write a new class that derives from DataModel, which
 initial keeps a Reload model instance (let's call this delegateModel), has
 a setter method for it, and delegates all DataModel methods, so that we
 could set this delegateModel field to another instance, say
 MySqlJDBCDataModel instance. Is this a good method for removing in-memory
 representation dependency after the build process?

 How can we achieve this change? Or is there an alternative and better way
 to achieve this?

 Thanks
 Ceyhun Can Ulker


Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender

2013-01-21 Thread Sean Owen
If you don't have the data in memory you can't compute anything. The
recommender itself doesn't do anything without data. That's why it
seemed like you really just wanted to compute everything offline
first, in which case the simplest solution is to store it however you
like and fetch that result however you like.

On Mon, Jan 21, 2013 at 8:22 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com wrote:
 Hi again,

 Thank you for your quick reply, Sean. I couldn't understand one point. What
 do you mean by pre-compute and store recommendations? Doesn't it mean
 having a dense (rather filled?) rating matrix? So it would make memory
 usage much worse, even if it is possible. Wouldn't it better to keep the
 model and compute whenever necessary?

 Thanks
 Ceyhun Can Ulker


 On Mon, Jan 21, 2013 at 9:58 PM, Sean Owen sro...@gmail.com wrote:

 You would have to write this yourself, yes.
 If you're not keeping the data in memory, you're not updating the
 results in real-time. So there's no real need to keep any DataModel
 around at all. Just pre-compute and store recommendations and update
 them periodically. Nothing has to be on-line then.

 On Mon, Jan 21, 2013 at 7:54 PM, Ceyhun Can ÜLKER ceyhunc...@gmail.com
 wrote:
  Hello,
 
  In our application we are using ReloadFromJDBCDataModel for its speed
  advantage of in-memory representation and being able to update
 periodically
  to pull in new data from a database source.
 
  However, once the recommender is build we do not want to keep the ratings
  data in memory (we would like to query the database when rating data is
  needed). We want to replace the ReloadFromJDBCDataModel with a
  MySqlJDBCDataModel after build. But there is no setter method for it,
  furthermore, the field that keeps the DataModel is in AbstractRecommender
  (superclass of SVDRecommender) and it is declared final.
 
  We thought we could write a new class that derives from DataModel, which
  initial keeps a Reload model instance (let's call this delegateModel),
 has
  a setter method for it, and delegates all DataModel methods, so that we
  could set this delegateModel field to another instance, say
  MySqlJDBCDataModel instance. Is this a good method for removing in-memory
  representation dependency after the build process?
 
  How can we achieve this change? Or is there an alternative and better way
  to achieve this?
 
  Thanks
  Ceyhun Can Ulker



Re: Boolean preferences and evaluation

2013-01-21 Thread Sean Owen
No it's really #2, since the first still has data that is not
true/false. I am not sure what eval you are running, but an RMSE test
wouldn't be useful in case #2. It would always be 0 since there is
only one value in the universe: 1. No value can ever be different from
the right value.

On Tue, Jan 22, 2013 at 4:34 AM, Zia mel ziad.kame...@gmail.com wrote:
 Hi !

 Can we say that both code 1 and 2 below are using boolean recommender
 since they both use LogLikelihoodSimilarity? Which code is used by
 default when no preferences are available ? When using
 GenericUserBasedRecommender in code 1 it gave a score during
 evaluation , but when using it in code 2 it gave 0 , is the score
 given by code 1 correct since in MAI book page 23 said In the case of
 Boolean preference data, only a precision-recall test is available
 anyway.

 //-- Code 1 --
   DataModel model = new GroupLensDataModel(new File(ratings.dat));
   RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
   public Recommender buildRecommender(DataModel model) throws
 TasteException {
   UserSimilarity similarity = new LogLikelihoodSimilarity(model);
   UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(2, similarity, model);
   return new GenericUserBasedRecommender(model, neighborhood,
 similarity);
   }};

 //--- Code 2 ---
 DataModel model = new GenericBooleanPrefDataModel(
 GenericBooleanPrefDataModel.toDataMap(
 new FileDataModel(new File(ua.base;

 RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
   public Recommender buildRecommender(DataModel model) throws
 TasteException {
 UserSimilarity similarity = new LogLikelihoodSimilarity(model);
 UserNeighborhood neighborhood = new
 NearestNUserNeighborhood(2, similarity, model);
return new GenericBooleanPrefUserBasedRecommender (model,
 neighborhood, similarity);
   }};

 Many Thanks !


Re: Any utility to solve the matrix inversion in Map/Reduce Way

2013-01-18 Thread Sean Owen
And, do you really need an inverse, or pseudo-inverse?
But, no, there are really no direct utilities for this. But we could
probably tell you how to do it efficiently, as long as you don't
actually mean a full inverse.

On Fri, Jan 18, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 Left unsaid in this comment is the fact that matrix inversion of any
 sizable matrix is almost always a mistake because it is (a) inaccurate, (b)
 slow.

 In scalable numerics it is also commonly true that the only really scalable
 problems are sparse.  The reason for that is that systems whose cost grows
 with O(n^2) cannot be scaled to arbitrary size n.  Sparse systems with only
 k items on average per row can often be handled with o(n) complexity which
 a requirement for a practical system.

 On Thu, Jan 17, 2013 at 8:49 PM, Koobas koo...@gmail.com wrote:

 Martix inversion, as in explicitly computing the inverse,
 e.g. computing variance / covariance,
 or matrix inversion, as in solving a linear system of equations?


 On Thu, Jan 17, 2013 at 7:49 PM, Colin Wang 
 colin.bin.wang.mah...@gmail.com
  wrote:

  Hi All,
 
  I want to solve the matrix inversion, of course, big size, in Map/Reduce
  way.
  I don't know if Mahout offers this kind of utility. Could you please give
  me some tips?
 
  Thank you,
  Colin
 



Re: Problem with mahout and AWS

2013-01-18 Thread Sean Owen
You should give more detail about the errors. You are running out of
memory on the child workers. This is not surprising since the default
memory they allocate is fairly small, and you're running a complete
recommender system inside each mapper. It has not much to do with the
size of the instane you use.

I am not sure what the second thing is, you should give more detail.

On Fri, Jan 18, 2013 at 2:02 PM, Iñigo Llamosas inigollamo...@gmail.com wrote:
 Hi,

 I am trying to run a simple recommender on AWS, but I'm getting errors when
 reducing. These are the jar-parameters lines:

 s3://inigobucket/jars/mahout-core-0.8-SNAPSHOT-job.jar

 org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob
 -Dmapred.input.dir=s3://inigobucket/data/grouplens10m/ratings.dat
 -Dmapred.output.dir=s3://inigobucket/output/
 --recommenderClassName
 org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender

 Starts OK, but when reducing it gives 2 kind of problems.

 -Heap space error. This confuses me because I had that error with a 2
 m.small slave cluster but also with a 5 c1.medium slave cluster
 -org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File
 exists error.

 Any suggestion?

 Many thanks,

 Inigo


Re: trying to get grouplens example to run

2013-01-17 Thread Sean Owen
That's the error right there:

On Thu, Jan 17, 2013 at 9:57 PM, Kamal Ali k...@grokker.com wrote:
 Caused by: java.io.IOException: Unexpected input format on line: 1 1 5


RE: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-16 Thread Sean Owen
Why do you need multiple mappers? Is one too slow? Many are not necessarily
faster for small input
On Jan 16, 2013 10:46 AM, Stuti Awasthi stutiawas...@hcl.com wrote:

 Hi,
 I tried to call programmatically also but facing same issue : Only single
 MapTask is running and that too spilling the map output  continuously.
 Hence im not able to generate the output for large matrix multiplication.

 Code Snippet :

 DistributedRowMatrix a = new DistributedRowMatrix(new
 Path(/test/points/matrixA), new
 Path(/test/temp),Integer.parseInt(100), Integer.parseInt(10));
 DistributedRowMatrix b = new DistributedRowMatrix(new
 Path(/test/points/matrixA),new Path(tempDir),Integer.parseInt(100),
 Integer.parseInt(10));
 Configuration conf = new Configuration();
 conf.set(fs.default.name, hdfs://DS-1078D24B4736:10818);
 conf.set(mapred.child.java.opts, -Xmx2048m);
 conf.set(mapred.max.split.size,10485760);
 a.setConf(conf);
 b.setConf(conf);
 a.times(b);

 Where Im going wrong. Any idea ?

 Thanks
 Stuti
 -Original Message-
 From: Stuti Awasthi
 Sent: Wednesday, January 16, 2013 2:55 PM
 To: Mahout User List
 Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?

 Hey Sean,
 Thanks for response. MatrixMultiplicationJob help shows the usage like :
 usage: command [Generic Options] [Job-Specific Options]

 Here Generic Option can be provided by -D property=value. Hence I tried
 with commandline -D options but it seems like that it is not making any
 effect.  It is also suggested in :

 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html

 Here I have noted 1 thing after your suggestion  that currently Im passing
 arguments like -Dproperty=value rather than -D property=value. I tried
 with space between -D and property=value also but then its giving error
 like:
 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
 /test/points/matrixA while processing Job-Specific Options:

 No such error comes if im passing the arguments without space between -D.

 By reference of Hadoop Definite Guide : Do not confuse setting Hadoop
 properties using the -D property=value option to GenericOptionsParser (and
 ToolRunner) with setting JVM system properties using the
 -Dproperty=value option to the java command. The syntax for JVM system
 properties does not allow any whitespace between the D and the property
 name, whereas GenericOptionsParser requires them to be separated by
 whitespace.

 Hence I suppose that GenericOptions should be parsed by -D property=value
 rather than -Dproperty=value.

 Additionally I tried -Dmapred.max.split.size=10485760 also through
 commandline but again only single MapTask started.

 Please Suggest


 -Original Message-
 From: Sean Owen [mailto:sro...@gmail.com]
 Sent: Wednesday, January 16, 2013 1:23 PM
 To: Mahout User List
 Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

 It's up to Hadoop in the end.

 Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value,
 like your 10MB (1000).

 I don't know if Hadoop params can be set as sys properties like that
 anyway?

 On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com
 wrote:
  Hi,
 
  I am trying to multiple dense matrix of size [100 x 100k]. The size of
 the file is 104MB and with default block sizeof 64MB only 2 blocks are
 getting created.
  So I reduced the block size to 10MB and now my file divided into 11
 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9
 DN/TT.
 
  Everytime Im running Mahout MatrixMultiplicationJob through commandline,
 I can see on JobTracker WebUI that only 1 map task is launched. According
 to my understanding of Inputsplit, there should be 11 map tasks launched.
  Apart from this Map task stays at 0.99% completion and in the Tasks Logs
 , I can see that map task is spilling the map output.
 
  Mahout Command:
 
  mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
  -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
  -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100
  --numColsA 10 --inputPathB /test/matrixA --numRowsB 100 --numColsB
  10 --tempDir /test/temp
 
  Now here I want to know that why only 1 map task is launched everytime
 and how can I performance tune the cluster so that I can perform the dense
 matrix multiplication of the order [90K x 1 Million] .
 
  Thanks
  Stuti
 
 
  ::DISCLAIMER::
  --
  --
  
 
  The contents of this e-mail and any attachment(s) are confidential and
 intended for the named recipient(s) only.
  E-mail transmission is not guaranteed to be secure or error-free as
  information could be intercepted, corrupted, lost, destroyed, arrive
  late or incomplete, or may contain viruses in transmission. The e mail
 and its contents (with or without referred errors) shall

Re: Test multiple similarities using the same data

2013-01-16 Thread Sean Owen
You can try resetting all the random seeds with RandomUtils.useTestSeed()
On Jan 16, 2013 4:01 PM, Zia mel ziad.kame...@gmail.com wrote:

 Hi

 How to evaluate a recommender using different similarities ? Once we call
 evaluator.evaluate(recommenderBuilder,..)
 it will decide the training and test data for that recommender and if
 we call it again for another setting (similarity,neighborhood) the
 data will be different. So how can we be consistent ?

 Thanks !



Re: Recommend to a group of users

2013-01-16 Thread Sean Owen
Not really directly, no. You can make N individual recommendations and
combine them, and there are many ways to do that. You can blindly rank
them on their absolute scores. You can interleave rankings so each
gets every Nth slot in the recommendation. A popular metric is to rank
by least-aversion -- the best recommendation one is the one most
acceptable to the person who will like it least in the group. You're
minimizing maximum unhappiness: often how it works in groups!

On Wed, Jan 16, 2013 at 4:56 PM, Zia mel ziad.kame...@gmail.com wrote:
 Hi

 Can we use Mahout to recommend to a group of users that share similar
 interests? Maybe some clustering or so.

 Thanks


Re: threshold assignment / selection

2013-01-15 Thread Sean Owen
It's fairly arbitrary. Strong positive ratings are probably more than
merely above average, but you could define the threshold higher or
lower if you wanted. It's a good default.

On Tue, Jan 15, 2013 at 3:58 PM, Zia mel ziad.kame...@gmail.com wrote:
 Hi
 Why in recommender the threshold is considered the user’s average
 preferences value plus one standard deviation ?
 Can we asssume that the good recommendations are anything above the
 user's average preferences?

 Many thanks


Re: Choosing precision

2013-01-15 Thread Sean Owen
Precision is not a great metric for recommenders, but it exists. There
is no best value here; I would choose something that mirrors how you
will use the results. If you show top 3 recs, use 3.

On Tue, Jan 15, 2013 at 4:51 PM, Zia mel ziad.kame...@gmail.com wrote:
 Hello,

 If I have users that have items between 1-20 , what would be the ideal
 way to evaluate the recommender using precisoion? Is there any
 recommended precision to choose such as  p@2 , p@5 p@10 or others and
 why?

 Many thanks


Re: Choosing precision

2013-01-15 Thread Sean Owen
The best tests are really from real users. A/B test different
recommenders and see which has better performance. That's not quite
practical though.

The problem is that you don't even know what the best recommendations
are. Splitting the data by date is reasonable, but recent items aren't
necessarily most-liked. Splitting by rating is more reasonable on this
point, but you still can't conclude that there aren't better
recommendations from among the un-rated items.

Still it out to correlate. I think you will find precision/recall are
very low in most cases, often a few percent. The result is noisy.
AUC will tell you about where all of those best recommendations in
the test set fell into the list, rather than only measuring the top
N's performance. This tells you more, and I think that's generally
good. However it is measuring performance over the entire list of
recs, when you are unlikely to use more than the top N.

Go ahead and use it since there's not a lot better you can do in the
lab, but be aware of the issues.


Re: RMSRecommenderEvaluator RMSE

2013-01-15 Thread Sean Owen
You have the definition there already, what are you asking?
On Jan 15, 2013 5:58 PM, Zia mel ziad.kame...@gmail.com wrote:

 Hi again ,

 When evaluting preferences in recommenders and using
 RMSRecommenderEvaluator, is it RMSE/RMSD
 http://en.wikipedia.org/wiki/Root_mean_square_deviation

 If we get a value of 1 or 10 for RMSE what does that really mean ? Can
 we represent RMSE by a % by dividing it on the range of preferences to
 get a % of the error. For example if the RMSE is 1 and range is from
 0-5 can we say that the error of predicting is 1/5= 20% ?

 Thanks



Re: Failed to create /META-INF/license file on Mac system

2013-01-15 Thread Sean Owen
http://stackoverflow.com/questions/10522835/hadoop-java-io-ioexception-mkdirs-failed-to-create-some-path

On Tue, Jan 15, 2013 at 9:42 PM, Yunming Zhang
zhangyunming1...@gmail.com wrote:
 Hi,

 I was trying to set up Mahout 0.8 on my Macbook Pro with OSX so I could do 
 some local testing, I am running Hadoop 1.0.3 (it worked fine with mahout in 
 my cluster)

 I have set up Pseudo distribution Hadoop, and I could put testdata direction 
 into HDFS,

 But when I try to execute

 $MAHOUT_HOME/bin/mahout 
 org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

 I get

 Exception in thread main java.io.IOException: Mkdirs failed to create 
 /PATH-TO-TMP/hadoop-unjar6845980999143023006/META-INF/license
 at org.apache.hadoop.util.RunJar.unJar(RunJar.java:47)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:132)

 It seems to be a really similar issue to this bug
 https://issues.apache.org/jira/browse/MAHOUT-780

 but I am using Mahout 0.8, so I am not sure what is happening here, I have 
 checked, there should be permission to the PATH-TO-TMP directory, so I don't 
 think it is a permission issue

 Thanks

 Yunming


Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-15 Thread Sean Owen
It's up to Hadoop in the end.

Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
value, like your 10MB (1000).

I don't know if Hadoop params can be set as sys properties like that anyway?

On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi,

 I am trying to multiple dense matrix of size [100 x 100k]. The size of the 
 file is 104MB and with default block sizeof 64MB only 2 blocks are getting 
 created.
 So I reduced the block size to 10MB and now my file divided into 11 blocks 
 across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT.

 Everytime Im running Mahout MatrixMultiplicationJob through commandline, I 
 can see on JobTracker WebUI that only 1 map task is launched. According to my 
 understanding of Inputsplit, there should be 11 map tasks launched.
 Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I 
 can see that map task is spilling the map output.

 Mahout Command:

 mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M 
 -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 
 -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 
 --numColsA 10 --inputPathB /test/matrixA --numRowsB 100 --numColsB 10 
 --tempDir /test/temp

 Now here I want to know that why only 1 map task is launched everytime and 
 how can I performance tune the cluster so that I can perform the dense matrix 
 multiplication of the order [90K x 1 Million] .

 Thanks
 Stuti


 ::DISCLAIMER::
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted,
 lost, destroyed, arrive late or incomplete, or may contain viruses in 
 transmission. The e mail and its contents
 (with or without referred errors) shall therefore not attach any liability on 
 the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of the 
 author and may not necessarily reflect the
 views or opinions of HCL or its affiliates. Any form of reproduction, 
 dissemination, copying, disclosure, modification,
 distribution and / or publication of this message without the prior written 
 consent of authorized representative of
 HCL is strictly prohibited. If you have received this email in error please 
 delete it and notify the sender immediately.
 Before opening any email and/or attachments, please check them for viruses 
 and other defects.

 


<    1   2   3   4   5   6   7   8   9   10   >