Re: Is Mahout obsolete now?

2015-10-19 Thread Sean Owen
No, this is pretty wrong. Spark is not, in general, a real-time anything. Spark Streaming is a near-real-time streaming framework, but it is not something you can build models with. Spark MLlib / ML are offline / batch. Not sure what you mean by Hadoop engine, but Spark does not build on

Re: Negative preferences

2014-08-15 Thread Sean Owen
I have used thumbs-down-like interactions as like an anti-click, and subtracts from the interaction between the user and item. The negative scores can be naturally applied in a matrix-factorization-like model like ALS, but that's not the situation here. Others probably have better first-hand

Re: ALS, weighed vs. non-weighed regularization paper

2014-06-16 Thread Sean Owen
Yeah I've turned that over in my head. I am not sure I have a great answer. But I interpret the net effect to be that the model prefers simple explanations for active users, at the cost of more error in the approximation. One would rather pick a basis that more naturally explains the data observed

Re: Does Mahout handle missing values in train and test data, for Decision Forest?

2014-04-22 Thread Sean Owen
From looking at the code recently, no it is not handled. On Tue, Apr 22, 2014 at 1:27 PM, Himanshu himanshu.ash...@gmail.com wrote: In Weka it is possible to mark the field with a question mark ? for unknown values and these are handled. Is there a similar way to mark unknown/missing field

RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-02 Thread Sean Owen
5771 M + 61 4 1463 7424 Etroung.p...@team.telstra.com W www.telstra.com -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, 2 April 2014 4:05 PM To: Mahout User List Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1 Hm, OK something

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen
troung.p...@team.telstra.com W www.telstra.com -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Monday, 31 March 2014 7:05 PM To: Mahout User List Subject: RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1 But you have a bunch of Hadoop 0.20 jars on your

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen
This may be getting to you're-on-your-own-territory since you're modifying the build. This error means your directory structure doesn't match up with declarations. You said somewhere that the parent of module X was Y, but the location given points to the pom of a module that isn't Y. On Wed, Apr

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen
[INFO] Thanks and Regards, Truong Phan P+ 61 2 8576 5771 M + 61 4 1463 7424 Etroung.p...@team.telstra.com W www.telstra.com -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent

RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-03-31 Thread Sean Owen
But you have a bunch of Hadoop 0.20 jars on your classpath! Definitely a problem. Those should not be there. On Mar 31, 2014 7:09 AM, Phan, Truong Q troung.p...@team.telstra.com wrote: Yes, I did rebuild it. oracle@bpdevdmsdbs01:

Re: Profiling with visualvm

2014-03-30 Thread Sean Owen
Profiled what exactly, a Hadoop job? If you profile a client, you aren't learning anything about the work, but just that the client process is blocked waiting for Hadoop jobs to complete. On Mar 30, 2014 10:08 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote: Hi, I profiled the Mahout command

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-06 Thread Sean Owen
Are you sure? Are you crazy?) would be more palatable to some teams than installing tarballs, is what I'm getting at. On Wed, Mar 5, 2014 at 1:30 PM, Sean Owen sro...@gmail.com wrote: You can always install whatever version of anything on your cluster that you want

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-06 Thread Sean Owen
thought someone cleaned that up... On Thu, Mar 6, 2014 at 3:34 PM, Kevin Moulart kevinmoul...@gmail.com wrote: Ok so should I try and recompile and change the guava version to 11.0.2 in the pom ? Kévin Moulart 2014-03-06 16:26 GMT+01:00 Sean Owen sro...@gmail.com: That's gonna be a Guava

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
CDH 4.5 and 4.6 are both 0.7 + patches. Neither contains 0.8, since it has (tiny) breaking changes vs 0.7 and this is a minor version update. CDH5 contains 0.8 + patches. I did not say CDH4 has 0.8 -- re-read the message of mine that was quoted.

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
I don't follow what here makes you say they are cut down releases? They are release plus patches not release minus patches. The question is not about how to use 0.7, but how to use 1.0-SNAPSHOT. Why would switching to the official 0.7 release help? I think the answer is you build Mahout for

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
, Sean Owen sro...@gmail.com wrote: I don't follow what here makes you say they are cut down releases? meaning it seems to be pretty much 2 releases behind the official. But i definitely don't follow CDH developments in this department, you seem in a better position to explain the existing

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
be dragons warning. I know that complicates things but people do use your releases a long time. I personally wished I could upgrade Pig on CDH 4 for new features but there was no simple way on a managed cluster. On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen sro...@gmail.com wrote: I don't understand

Re: Mahout on Spark?

2014-02-19 Thread Sean Owen
Agree that 'merging' is so infeasible as to not make sense. Mahout has been ML on M/R and that's it's thing, which seems fine. IMHO this project has been hurt by an active unwillingness to define scope, and pretending it's helpful to have little bits of lots of ideas and technologies. I also

Re: Mahout on Spark?

2014-02-19 Thread Sean Owen
To set expectations appropriately, I think it's important to point out this is completely infeasible short of a total rewrite, and I can't imagine that will happen. It may not be obvious if you haven't looked at the code how completely dependent on M/R it is. You can swap out M/R and Spark if you

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Sean Owen
FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine with CDH4. You do have to build with the Hadoop 2.x profile, as usual. On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: Bikash, Don't use that version. Use a more recent release. We can't help that

Re: get similar items

2014-02-12 Thread Sean Owen
Try LogLikelihoodSimilarity. On Wed, Feb 12, 2014 at 9:06 AM, 12481...@qq.com 12481...@qq.com wrote: Hi Sean, you said It depends what ItemSimilarity you are using. what kind of ItemSimilarity can work correctly without preference? thanks. -- View this message in context:

Re: Mahout 0.9 with cloudera

2014-02-06 Thread Sean Owen
Yeah that's the version that's bundled with 4.x. 5.x has basically 0.8 plus patches to work on MR2. Mahout is not really something you have to install. Even though it does get packaged and dumped onto the cluster nodes. Just use it against your cluster -- it can be from a machine that isn't part

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Sean Owen
Yes I looked at the impl here, and I think it is aging, since I'm not sure Deneche had time to put in many bells or whistles at the start, and not sure it's been touched much since. My limited experience is that it generally does less clever stuff than R, which in turn is less clever than sklearn

Re: Tuning parameters for ALS-WR

2013-09-11 Thread Sean Owen
On Wed, Sep 11, 2013 at 12:22 AM, Parimi Rohit rohit.par...@gmail.comwrote: 1. Do we have to follow this setting, to compare algorithms? Can't we report the parameter combination for which we get highest mean average precision for the test data, when trained on the train set, with out any

Re: running mahout on Hadoop 2.0.0-cdh4.3.1

2013-09-10 Thread Sean Owen
You are trying to run on Hadoop 2 and Mahout only works with Hadoop 1 and related branches. This wont work. However the CDH distributions also come in an 'mr1' flavor that stands a much better chance of working with something that is built for Hadoop 1. Use 2.0.0-mr1-4.3.1 instead. (PS 4.3.2 and

Re: ALS and SVD feature vectors

2013-09-04 Thread Sean Owen
The feature vectors? rows of X and Y? no, they definitely should not be normalized. It will change the approximation you so carefully built quite a lot. As you say U and V are orthornormal in the SVD. But you still multiply all of them together with Sigma when making recs. (Or you embed Sigma in

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Sean Owen
I think it all minimally works on Hadoop 2.0.x, though I haven't tried it recently -- it does require a recompile. This is different from it working on MRv2 versus MRv1. I'm almost certain it does not work on MRv2 and doubt it will. The effort is not large, but it's subtle. A few hacks may fail

Re: Data distribution guidance for recommendation engines

2013-08-01 Thread Sean Owen
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote: If I split my data into train and test sets, I can show good performance of Good performance according to what metric? it makes a lot of difference whether you are talking about precision/recall or RMSE. the model on the

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Sean Owen
FWIW I know Mahout 0.8 works fine with CDH4 (the mr1 version of course) and is what CDH5 will include. Should be no problems there. On Wed, Jul 31, 2013 at 4:33 PM, Marco zentrop...@yahoo.co.uk wrote: great. at least i know what's wrong :) will check out if cloudera supports mahout 0.8.

Re: Calculating affinity

2013-07-23 Thread Sean Owen
Here's just one perspective -- Yes this is kind of how things like ALS work. The input values are viewed as 'weights', not ratings. They're not reconstructed directly but used as a weight in a loss function. This turns out to make more sense when paired with a squared-error loss function, as it

Myrrix is now a part of Cloudera

2013-07-16 Thread Sean Owen
This may be relevant enough to announce here: http://blog.cloudera.com/blog/2013/07/myrrix-joins-cloudera-to-bring-big-learning-to-hadoop/ (Brief recap: Myrrix is a product / project / tiny company related to large scale-recommenders, and shares some APIs and background with Mahout.) I think

Re: LZ4 file extensions from Mahout recommender

2013-07-04 Thread Sean Owen
This is nothing to do with Mahout, but how your Hadoop cluster is configured. I assume you have turned map / reduce output compression and are using the LZO codec. On Thu, Jul 4, 2013 at 11:06 AM, Sugato Samanta sugato@gmail.com wrote: Hello, I was trying to execute the recommendation

Re: UseConcMarkSweepGC with Mahout

2013-07-02 Thread Sean Owen
This is old-ish advice. I tend to favor UseParallelOldGC even on Java 7, over G1GC, even though it may even be a default now? The Old just means it also uses a parallel collector thread on the old generation. In general it's good to make use of increasingly multi-core machines by making GC

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen
Yeah this has gone well off-road. ALS is not non-deterministic because of hardware errors or cosmic rays. It's also nothing to do with floating-point round-off, or certainly, that is not the primary source of non-determinism to several orders of magnitude. ALS starts from a random solution and

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen
On Tue, Jun 25, 2013 at 12:44 AM, Michael Kazekin kazm...@hotmail.com wrote: But doesn't alternation guarantee convexity? No, the problem remains non-convex. At each step, where half the parameters are fixed, yes that constrained problem is convex. But each of these is not the same as the

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
someone can check my facts here, but the log-likelihood ratio follows a chi-square distribution. You can figure an actual probability from that in the usual way, from its CDF. You would need to tweak the code you see in the project to compute an actual LLR by normalizing the input. You could use

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
being similar is 1 - p (which is exactly the CDF for that value of X). Now, my question is: in the contingency table case, why would I normalize? It's a ratio already, isn't it? On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen sro...@gmail.com wrote: someone can check my facts here, but the log

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen
? On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen sro...@gmail.com wrote: I think the quickest answer is: the formula computes the test statistic as a difference of log values, rather than log of ratio of values. By not normalizing, the entropy is multiplied by a factor (sum of the counts) vs

Re: Negative Preferences in a Recommender

2013-06-18 Thread Sean Owen
Yes the model has no room for literally negative input. I think that conceptually people do want negative input, and in this model, negative numbers really are the natural thing to express that. You could give negative input a small positive weight. Or extend the definition of c so that it is

Re: Negative Preferences in a Recommender

2013-06-18 Thread Sean Owen
I'm suggesting using numbers like -1 for thumbs-down ratings, and then using these as a positive weight towards 0, just like positive values are used as positive weighting towards 1. Most people don't make many negative ratings. For them, what you do with these doesn't make a lot of difference.

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen
Is it compatible with any Hadoop release? of course, would it make sense if not? I'm not sure where you get this idea. 0.5 was, I think, compiled vs 0.20.x. The last release was vs 1.0.3 or so. The current release is vs 1.1.x. In all cases these are the latest stable Apache releases, so not sure

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen
is really an ancient release. We have two or three issues left for 0.8, then we'll have a code freeze and do testing before we release 0.8. -sebastian -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Monday, June 17, 2013 4:53 PM To: Mahout User List Subject: Re

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen
Yes you have to refer to the 'mrv1' artifacts if I recall correctly, if you use CDH4. You are talking about CDH3, which is different. On Mon, Jun 17, 2013 at 3:23 PM, cont...@dhuebner.com wrote: Well, I just setup up CDH4 with Mahout for testing a few days ago. It still required some fixing of

Re: Running Mahout recommendations on a Cassandra data set

2013-06-12 Thread Sean Owen
This is more of a Hadoop question. The input hides behind the InputFormat implementation. If you have an InputFormat that can read and produce the same key-value pairs that you'd get from a SequenceFileInputFormat / TextInputFormat and HDFS, yes the rest just works automatically. You have to

Re: Social Network Link Prediction in Mahout

2013-06-08 Thread Sean Owen
Use an implementation that doesn't expect a rating. These are so-called 'boolean' implementations, like GenericBooleanPrefDataModel. For example you can build and item-based recommender with the boolean version of item based recommender and a log-likelihood similarity. Or, yes you can calculate

Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Sean Owen
I agree with deprecating all of that FWIW. On Sat, Jun 8, 2013 at 6:33 PM, Grant Ingersoll gsing...@apache.org wrote: Collaborative Filtering: - all recommenders in o.a.m.cf.taste.impl.recommender.knn - the TreeClusteringRecommender in o.a.m.cf.taste.impl.recommender - the SlopeOne

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
:50 PM, Sean Owen sro...@gmail.com wrote: It depends on the algorithm I suppose. In some cases, the already-known items would always be top recommendations and the test would tell you nothing. Just like in an RMSE test -- if you already know the right answers your score is always a perfect 0

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
I believe the suggestion is just for purposes of evaluation. You would not return these items in practice, yes. Although there are cases where you do want to return known items. For example, maybe you are modeling user interaction with restaurant categories. This could be useful, because as soon

Re: Database connection pooling for a recommendation engine

2013-06-05 Thread Sean Owen
Not sure, is this really related to Mahout? I don't know of an equivalent of J2EE / Tomcat for C++, but there must be something. As a general principle, you will have to load your data into memory if you want to perform the computations on the fly in real time. So how you access the data isn't

Re: IRStats Evaluation for Recommender Systems

2013-05-30 Thread Sean Owen
THere's nothing direct, but you can probably save yourself time by copying the code that computes these stats and apply them to your pre-computed values. It's not terribly complex, just counting the intersection and union size and deriving some stats from it. The split is actually based on value

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen
I feel like I've seen this too and it's just a bug. You're not running out of memory. Are you also setting io.sort.factor? that can help too. You might try as high as 100. Also have you tried a Combiner? if you can apply it it should help too as it is designed to reduce the amount of stuff

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen
about combiner. Thanks for your answer. W dniu 22.05.2013 14:59, Sean Owen pisze: I feel like I've seen this too and it's just a bug. You're not running out of memory. Are you also setting io.sort.factor? that can help too. You might try as high as 100. Also have you tried

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
/docs/RecommenderArchitecture.png Hope that helps Manuel Am 19.05.2013 um 19:20 schrieb Sean Owen: I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
for showing the past ratings of a user. Ahmet From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org Sent: Sunday, May 19, 2013 9:26 PM Subject: Re: Which database should I use with Mahout I think everyone is agreeing

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
an option for transferring a lot of data: https://github.com/facebook/scribe#readme I would suggest that you just start with the technology that you know best and then if you solve the problem as soon as you get them. /Manuel Am 19.05.2013 um 20:26 schrieb Sean Owen: I think everyone

Re: How to extend FileDataModel

2013-05-16 Thread Sean Owen
Why not? it's just the object reference that is local to the function. The Map itself is not, and on the heap like everything else in the JVM. On Thu, May 16, 2013 at 2:19 AM, huangjia cucumbergua...@gmail.com wrote: Hi, I want to build a recommendation model based on Mahout. My dataset format

Re: How to execute RecommenderJob without preference value

2013-05-11 Thread Sean Owen
You can't have a blank line, if that's what you mean, yes. That's not a valid record. A terminal newline is fine. But the error seems to be something else: java.io.FileNotFoundException: File does not exist: /user/hadoop/temp/preparePreferenceMatrix/numUsers.bin

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen
This sounds like overfitting. More features lets you fit your training set better, but at some point, fitting too well means you fit other test data less well. Lambda resists overfitting, so setting it too low increases the overfitting problem. I assume you still get better test set results with

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen
it is: http://i.imgur.com/3e1eTE5.png I've used 75% for training and 25% for evaluation. Well reasonably lambda gives close enough results, however not better. Thanks, Bernát GÁBOR On Thu, May 9, 2013 at 2:46 PM, Sean Owen sro...@gmail.com wrote: This sounds like overfitting. More features

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen
). Bernát GÁBOR On Thu, May 9, 2013 at 3:05 PM, Sean Owen sro...@gmail.com wrote: (The MAE metric may also be a complicating issue... it's measuring average error where all elements are equally weighted, but as the WR suggests in ALS-WR, the loss function being minimized weights different

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen
and the one used for implicit data). @Gabor, what do you specify for the constructor argument usesImplicitFeedback ? On 09.05.2013 15:33, Sean Owen wrote: RMSE would have the same potential issue. ALS-WR is going to prefer to minimize one error at the expense of letting another get much larger

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen
Yes, you overfit the training data set, so you under-fit the test set. I'm trying to suggest why more degrees of freedom (features) makes for a worse fit. It doesn't, on the training set, but those same parameters may fit the test set increasingly badly. It doesn't make sense to evaluate on a

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen
It is true that a process based on user-user similarity only won't be able to recommend item 4 in this example. This is a drawback of the algorithm and not something that can be worked around. You could try not to choose this item in the test set, but then that does not quite reflect reality in

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen
this? Any help will be highly appreciated. Best Regards, Jimmy Zhongduo Lin (Jimmy) MASc candidate in ECE department University of Toronto On 2013-05-08 4:44 AM, Sean Owen wrote: It is true that a process based on user-user similarity only won't be able to recommend item 4 in this example

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen
relative to the variance of the data set using Mahout? Unfortunately I got an error using the precision and recall evaluation method, I guess that's because the data are too sparse. Best Regards, Jimmy On 13-05-08 10:05 AM, Sean Owen wrote: It may be true that the results are best

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen
It may be selected as a test item. Other algorithms can predict the '4'. The test process is random so as to not favor one algorithm. I think you are just arguing that the algorithm you are using isn't good for your data -- so just don't use it. Is that not the answer? I don't know what you mean

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen
absolute difference or RMSE. How can I say RMSE is worse relative to the variance of the data set using Mahout? Unfortunately I got an error using the precision and recall evaluation method, I guess that's because the data are too sparse. Best Regards, Jimmy On 13-05-08 10:05 AM, Sean Owen wrote

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. On Mon, May 6, 2013 at 10:17 AM, William icswilliam2...@gmail.com wrote: I have a dataset about user and movie(no rate).But I want to

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
wrote: Sean Owen srowen at gmail.com writes: If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. I suppose the rate of seen movies are 1. Is it right? If I use Collaborative

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
Mahout has algorithms for one-class collaborative filtering. On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote: ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with 1. I think you

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
? Are there matrix factorization algorithms in Mahout which can work with this kind of data (that is, the kind of data which consists of users and the movies they have seen). On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote: Yes, it goes by the name 'boolean prefs' in the project

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
only 1's. On Mon, May 6, 2013 at 11:29 PM, Sean Owen sro...@gmail.com wrote: Parallel ALS is exactly an example of where you can use matrix factorization for 0/1 data. On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Sean, Isn't boolean preferences

Re: Clustering product views and sales

2013-05-06 Thread Sean Owen
It sounds like you don't quite have a cold start problem. You have a few behaviors, a few views or clicks, not zero. So you really just need to find an approach that's quite comfortable with sparse input. A low-rank factorization model like ALS works fine in this case, for example. There's a

Re: Mahout database pooling best practice

2013-05-01 Thread Sean Owen
Rather, it needs to extend ConnectionPoolDataSource. But you can ignore it if you're sure you are using a pooling implementation. You might just double-check that. On Wed, May 1, 2013 at 9:25 AM, Mugoma Joseph O. mug...@yengas.com wrote: Thanks Sean. From source, AbstractJDBCDataModel.java

Re: Fold-in for ALSWR

2013-04-30 Thread Sean Owen
I should say that it depends of course on what you are implementing. You can also write an algorithm to factor R, not P. If you're doing that, then I would not expect values to be so low. But I thought you were following the version where you factor P = R != 0. Multiplying by 3 and adding 1 would

Re: Time Based Recommender System

2013-04-30 Thread Sean Owen
No, time is in the data model but nothing uses it that I know of. On Tue, Apr 30, 2013 at 3:18 PM, Chirag Lakhani clakh...@zaloni.com wrote: I was wondering if the collaborative filtering library in Mahout has any algorithms that incorporate concept drift i.e. time dynamics. From my own

Re: Time Based Recommender System

2013-04-30 Thread Sean Owen
GraphLab -- http://docs.graphlab.org/collaborative_filtering.html#SVD_PLUS_PLUS On Tue, Apr 30, 2013 at 3:30 PM, Chirag Lakhani clakh...@zaloni.com wrote: Do you know of any other large scale machine learning platforms that do incorporate it? On Tue, Apr 30, 2013 at 10:21 AM, Sean Owen sro

Re: Mahout database pooling best practice

2013-04-29 Thread Sean Owen
If you are actually using a connection pool, ignore it, it just means the implementation doesn't appear to extend the usual connection pool class in the JDK. Just make sure you are in fact using this class and you're fine. On Tue, Apr 30, 2013 at 4:01 AM, Mugoma Joseph O. mug...@yengas.com wrote:

Re: Fold-in for ALSWR

2013-04-29 Thread Sean Owen
ALS-WR is not predicting your input matrix R, but the matrix P which is R != 0. It is not predicting ratings, but a 0/1 indicator of whether the connection exists. So the values are usually in [0,1]. On Tue, Apr 30, 2013 at 2:40 AM, Chloe chloe.gu...@gmail.com wrote: Dear Sean, Thanks a lot

Re: Mahout Similarity Caching

2013-04-23 Thread Sean Owen
+ which is way to much for what I need. Thanks, Bernát GÁBOR On Tue, Apr 23, 2013 at 12:53 AM, Sean Owen sro...@gmail.com wrote: 49 seconds is orders of magnitude too long -- something is very wrong here, for so little data. Are you running this off a database? or are you somehow counting

Re: Mahout Similarity Caching

2013-04-23 Thread Sean Owen
I agree, but how is pre-adding a cached value for X different than requesting X from the cache? Either way you get X in the cache. Computing offline seems the same as computing on-line, but in some kind of warm-up state or phase. Which can be concurrent with serving early requests even. You can do

Re: Mahout Similarity Caching

2013-04-22 Thread Sean Owen
49 seconds is orders of magnitude too long -- something is very wrong here, for so little data. Are you running this off a database? or are you somehow counting the overhead of 3-4K network calls? On Mon, Apr 22, 2013 at 11:22 PM, Gabor Bernat ber...@primeranks.net wrote: Hello, I'm using

Re: Error creating assembly archive job: error in opening zip file

2013-04-18 Thread Sean Owen
Probably a corrupt download inside Maven. Delete ~/.m2/repository entirely On Apr 19, 2013 12:23 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hm. This is really not a known error. Which suggests something really platitudinarian: open file handle limits? lack of disk space? Sorry if that's not

Re: Boosting User-Based with the user's attributes

2013-04-17 Thread Sean Owen
a lot for the insight,very useful! * Agata Filiana Erasmus Mundus DMKM Student 2011-2013 http://www.em-dmkm.eu/ * On 16 April 2013 16:40, Sean Owen sro...@gmail.com wrote: Of course it's not meaningless. They provide a basis for ranking items, so you can return top-K recommendations

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen
In the usual recommender, the output is a weighted average of ratings. In a model where there are no ratings, this has no meaning -- everything is 1 implicitly. So the output is something else, and here it's a sum of similarities actually. On Tue, Apr 16, 2013 at 3:05 PM, Agata Filiana

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen
Of course it's not meaningless. They provide a basis for ranking items, so you can return top-K recommendations. If it's normally based on similarity and ratings -- and you have no ratings -- similarity is of course the only thing you can base the result on. On Tue, Apr 16, 2013 at 3:36 PM, Agata

Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Sean Owen
Yes that's true, it is more usually bits. Here it's natural log / nats. Since it's unnormalized anyway another constant factor doesn't hurt and it means not having to change the base. On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote: I got 168, because I use log base 2

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
This sounds like just a most-similar-items problem. That's good news because that's simpler. The only question is how you want to compute item-item similarities. That could be based on user-item interactions. If you're on Hadoop, try the RowSimilarityJob (where you will need rows to be items,

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
#A and #C to other users who order #B ... I still don't want this if the items are similar and/or the users similar. Cheers Billy On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote: This sounds like just a most-similar-items problem. That's good news because that's simpler. The only

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
. These may be much more valuable for cross-sell than things in the same order. On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote: You can try treating your orders as the 'users'. Then just compute item-item similarities per usual. On Thu, Apr 11, 2013 at 7:59 PM, Billy b

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Sean Owen
Yes I also get (er, Mahout gets) 117 (116.69), FWIW. I think the second question concerned counts vs relative frequencies -- normalized, or not. Like whether you divide all the counts by their sum or not. For a fixed set of observations that does change the LLR because it is unnormalized, not

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
These events do sound 'similar'. They occur together about half the time either one of them occurs. You might have many pairs that end up being similar for the same reason, and this is not surprising. They're all really similar. The mapping here from LLR's range of [0,inf) to [0,1] is pretty

Re: log-likelihood ratio value in item similarity calculation

2013-04-10 Thread Sean Owen
, 2013 at 5:50 PM, Sean Owen sro...@gmail.com wrote: These events do sound 'similar'. They occur together about half the time either one of them occurs. You might have many pairs that end up being similar for the same reason, and this is not surprising. They're all really similar. The mapping here

Re: Fold-in for ALSWR

2013-04-10 Thread Sean Owen
For simplicity let's consider a brand-new user first, not a new rating for existing user. I'll use the notation from my slides that you mention, A = X * Y'. To clarify, I think you mean you have a new A_u row, and want to know X_u. The two expressions are not alternatives, they're the same thing,

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-08 Thread Sean Owen
. Anyway -- long story short, a simple check on the inf norm of X' * X or Y' * Y seems to suffice to decide that lambda is too big and go complain about it rather than proceed. On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote: All that said I don't think inverting is the issue here

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-07 Thread Sean Owen
...@gmail.com wrote: Okay, you do have a problem. Y'*Y is 10x10, but it's rank is 5. Has to have something to do with the input data. On Sat, Apr 6, 2013 at 7:47 PM, Sean Owen sro...@gmail.com wrote: For example, here's Y: Y = -0.278098 -0.256438 0.127559 -0.045869 -0.769172 -0.255599

Re: I believe the TanimotoSimilarity scorer actually uses the Jaccard similarity measure

2013-04-07 Thread Sean Owen
I had not heard of Tanimoto being generalized to n-way similarity, but then again, I can't say I know much at all authoritative about the term. The Wikipedia page says it's incorrectly used to describe a lot of things. Here, we're only looking at 2-way comparisons, pair-wise similarity. As far as

  1   2   3   4   5   6   7   8   9   10   >