No, this is pretty wrong. Spark is not, in general, a real-time
anything. Spark Streaming is a near-real-time streaming framework, but
it is not something you can build models with. Spark MLlib / ML are
offline / batch. Not sure what you mean by Hadoop engine, but Spark
does not build on
I have used thumbs-down-like interactions as like an anti-click, and
subtracts from the interaction between the user and item. The negative
scores can be naturally applied in a matrix-factorization-like model
like ALS, but that's not the situation here.
Others probably have better first-hand
Yeah I've turned that over in my head. I am not sure I have a great
answer. But I interpret the net effect to be that the model prefers
simple explanations for active users, at the cost of more error in the
approximation. One would rather pick a basis that more naturally
explains the data observed
From looking at the code recently, no it is not handled.
On Tue, Apr 22, 2014 at 1:27 PM, Himanshu himanshu.ash...@gmail.com wrote:
In Weka it is possible to mark the field with a question mark ? for unknown
values and these are handled. Is there a similar way to mark
unknown/missing field
5771
M + 61 4 1463 7424
Etroung.p...@team.telstra.com
W www.telstra.com
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, 2 April 2014 4:05 PM
To: Mahout User List
Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
Hm, OK something
troung.p...@team.telstra.com
W www.telstra.com
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Monday, 31 March 2014 7:05 PM
To: Mahout User List
Subject: RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
But you have a bunch of Hadoop 0.20 jars on your
This may be getting to you're-on-your-own-territory since you're
modifying the build. This error means your directory structure doesn't
match up with declarations. You said somewhere that the parent of
module X was Y, but the location given points to the pom of a module
that isn't Y.
On Wed, Apr
[INFO]
Thanks and Regards,
Truong Phan
P+ 61 2 8576 5771
M + 61 4 1463 7424
Etroung.p...@team.telstra.com
W www.telstra.com
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent
But you have a bunch of Hadoop 0.20 jars on your classpath! Definitely a
problem. Those should not be there.
On Mar 31, 2014 7:09 AM, Phan, Truong Q troung.p...@team.telstra.com
wrote:
Yes, I did rebuild it.
oracle@bpdevdmsdbs01:
Profiled what exactly, a Hadoop job? If you profile a client, you aren't
learning anything about the work, but just that the client process is
blocked waiting for Hadoop jobs to complete.
On Mar 30, 2014 10:08 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote:
Hi,
I profiled the Mahout command
Are you sure?
Are
you crazy?) would be more palatable to some teams than installing
tarballs, is what I'm getting at.
On Wed, Mar 5, 2014 at 1:30 PM, Sean Owen sro...@gmail.com wrote:
You can always install whatever version of anything on your cluster
that you want
thought someone cleaned
that up...
On Thu, Mar 6, 2014 at 3:34 PM, Kevin Moulart kevinmoul...@gmail.com wrote:
Ok so should I try and recompile and change the guava version to 11.0.2 in
the pom ?
Kévin Moulart
2014-03-06 16:26 GMT+01:00 Sean Owen sro...@gmail.com:
That's gonna be a Guava
CDH 4.5 and 4.6 are both 0.7 + patches. Neither contains 0.8, since it
has (tiny) breaking changes vs 0.7 and this is a minor version update.
CDH5 contains 0.8 + patches. I did not say CDH4 has 0.8 -- re-read the
message of mine that was quoted.
I don't follow what here makes you say they are cut down releases?
They are release plus patches not release minus patches.
The question is not about how to use 0.7, but how to use 1.0-SNAPSHOT.
Why would switching to the official 0.7 release help?
I think the answer is you build Mahout for
, Sean Owen sro...@gmail.com wrote:
I don't follow what here makes you say they are cut down releases?
meaning it seems to be pretty much 2 releases behind the official. But i
definitely don't follow CDH developments in this department, you seem in a
better position to explain the existing
be dragons warning. I know that complicates
things but people do use your releases a long time. I personally wished I
could upgrade Pig on CDH 4 for new features but there was no simple way on
a managed cluster.
On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen sro...@gmail.com wrote:
I don't understand
Agree that 'merging' is so infeasible as to not make sense. Mahout has
been ML on M/R and that's it's thing, which seems fine. IMHO this
project has been hurt by an active unwillingness to define scope, and
pretending it's helpful to have little bits of lots of ideas and
technologies.
I also
To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.
You can swap out M/R and Spark if you
FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine
with CDH4. You do have to build with the Hadoop 2.x profile, as usual.
On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Bikash,
Don't use that version. Use a more recent release. We can't help that
Try LogLikelihoodSimilarity.
On Wed, Feb 12, 2014 at 9:06 AM, 12481...@qq.com 12481...@qq.com wrote:
Hi Sean, you said It depends what ItemSimilarity you are using.
what kind of ItemSimilarity can work correctly without preference?
thanks.
--
View this message in context:
Yeah that's the version that's bundled with 4.x. 5.x has basically 0.8
plus patches to work on MR2.
Mahout is not really something you have to install. Even though it
does get packaged and dumped onto the cluster nodes. Just use it
against your cluster -- it can be from a machine that isn't part
Yes I looked at the impl here, and I think it is aging, since I'm not
sure Deneche had time to put in many bells or whistles at the start,
and not sure it's been touched much since.
My limited experience is that it generally does less clever stuff than
R, which in turn is less clever than sklearn
On Wed, Sep 11, 2013 at 12:22 AM, Parimi Rohit rohit.par...@gmail.comwrote:
1. Do we have to follow this setting, to compare algorithms? Can't we
report the parameter combination for which we get highest mean average
precision for the test data, when trained on the train set, with out any
You are trying to run on Hadoop 2 and Mahout only works with Hadoop 1 and
related branches. This wont work.
However the CDH distributions also come in an 'mr1' flavor that stands a
much better chance of working with something that is built for Hadoop 1.
Use 2.0.0-mr1-4.3.1 instead. (PS 4.3.2 and
The feature vectors? rows of X and Y? no, they definitely should not be
normalized. It will change the approximation you so carefully built quite a
lot.
As you say U and V are orthornormal in the SVD. But you still multiply all
of them together with Sigma when making recs. (Or you embed Sigma in
I think it all minimally works on Hadoop 2.0.x, though I haven't tried
it recently -- it does require a recompile.
This is different from it working on MRv2 versus MRv1. I'm almost
certain it does not work on MRv2 and doubt it will.
The effort is not large, but it's subtle. A few hacks may fail
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote:
If I split my data into train and test sets, I can show good performance of
Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.
the model on the
FWIW I know Mahout 0.8 works fine with CDH4 (the mr1 version of
course) and is what CDH5 will include. Should be no problems there.
On Wed, Jul 31, 2013 at 4:33 PM, Marco zentrop...@yahoo.co.uk wrote:
great. at least i know what's wrong :)
will check out if cloudera supports mahout 0.8.
Here's just one perspective --
Yes this is kind of how things like ALS work. The input values are
viewed as 'weights', not ratings. They're not reconstructed directly
but used as a weight in a loss function. This turns out to make more
sense when paired with a squared-error loss function, as it
This may be relevant enough to announce here:
http://blog.cloudera.com/blog/2013/07/myrrix-joins-cloudera-to-bring-big-learning-to-hadoop/
(Brief recap: Myrrix is a product / project / tiny company related to
large scale-recommenders, and shares some APIs and background with
Mahout.)
I think
This is nothing to do with Mahout, but how your Hadoop cluster is
configured. I assume you have turned map / reduce output compression
and are using the LZO codec.
On Thu, Jul 4, 2013 at 11:06 AM, Sugato Samanta sugato@gmail.com wrote:
Hello,
I was trying to execute the recommendation
This is old-ish advice. I tend to favor UseParallelOldGC even on Java
7, over G1GC, even though it may even be a default now?
The Old just means it also uses a parallel collector thread on the
old generation. In general it's good to make use of increasingly
multi-core machines by making GC
Yeah this has gone well off-road.
ALS is not non-deterministic because of hardware errors or cosmic
rays. It's also nothing to do with floating-point round-off, or
certainly, that is not the primary source of non-determinism to
several orders of magnitude.
ALS starts from a random solution and
On Tue, Jun 25, 2013 at 12:44 AM, Michael Kazekin kazm...@hotmail.com wrote:
But doesn't alternation guarantee convexity?
No, the problem remains non-convex. At each step, where half the
parameters are fixed, yes that constrained problem is convex. But each
of these is not the same as the
someone can check my facts here, but the log-likelihood ratio follows
a chi-square distribution. You can figure an actual probability from
that in the usual way, from its CDF. You would need to tweak the code
you see in the project to compute an actual LLR by normalizing the
input.
You could use
being similar is 1 - p (which is exactly the
CDF for that value of X).
Now, my question is: in the contingency table case, why would I normalize?
It's a ratio already, isn't it?
On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen sro...@gmail.com wrote:
someone can check my facts here, but the log
?
On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen sro...@gmail.com wrote:
I think the quickest answer is: the formula computes the test
statistic as a difference of log values, rather than log of ratio of
values. By not normalizing, the entropy is multiplied by a factor (sum
of the counts) vs
Yes the model has no room for literally negative input. I think that
conceptually people do want negative input, and in this model,
negative numbers really are the natural thing to express that.
You could give negative input a small positive weight. Or extend the
definition of c so that it is
I'm suggesting using numbers like -1 for thumbs-down ratings, and then
using these as a positive weight towards 0, just like positive values
are used as positive weighting towards 1.
Most people don't make many negative ratings. For them, what you do
with these doesn't make a lot of difference.
Is it compatible with any Hadoop release? of course, would it make sense if not?
I'm not sure where you get this idea. 0.5 was, I think, compiled vs
0.20.x. The last release was vs 1.0.3 or so. The current release is vs
1.1.x. In all cases these are the latest stable Apache releases, so
not sure
is really an
ancient release.
We have two or three issues left for 0.8, then we'll have a code freeze
and do testing before we release 0.8.
-sebastian
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Monday, June 17, 2013 4:53 PM
To: Mahout User List
Subject: Re
Yes you have to refer to the 'mrv1' artifacts if I recall correctly,
if you use CDH4. You are talking about CDH3, which is different.
On Mon, Jun 17, 2013 at 3:23 PM, cont...@dhuebner.com wrote:
Well, I just setup up CDH4 with Mahout for testing a few days ago. It still
required some fixing of
This is more of a Hadoop question. The input hides behind the
InputFormat implementation. If you have an InputFormat that can read
and produce the same key-value pairs that you'd get from a
SequenceFileInputFormat / TextInputFormat and HDFS, yes the rest just
works automatically. You have to
Use an implementation that doesn't expect a rating. These are
so-called 'boolean' implementations, like GenericBooleanPrefDataModel.
For example you can build and item-based recommender with the boolean
version of item based recommender and a log-likelihood similarity.
Or, yes you can calculate
I agree with deprecating all of that FWIW.
On Sat, Jun 8, 2013 at 6:33 PM, Grant Ingersoll gsing...@apache.org wrote:
Collaborative Filtering:
- all recommenders in o.a.m.cf.taste.impl.recommender.knn
- the TreeClusteringRecommender in o.a.m.cf.taste.impl.recommender
- the SlopeOne
In point 1, I don't think I'd say it that way. It's not true that
test/training is divided by user, because every user would either be
100% in the training or 100% in the test data. Instead you hold out
part of the data for each user, or at least, for some subset of users.
Then you can see whether
, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:
In point 1, I don't think I'd say it that way. It's not true that
test/training is divided by user, because every user would either be
100% in the training or 100% in the test data. Instead you hold out
part of the data for each user
:50 PM, Sean Owen sro...@gmail.com wrote:
It depends on the algorithm I suppose. In some cases, the
already-known items would always be top recommendations and the test
would tell you nothing. Just like in an RMSE test -- if you already
know the right answers your score is always a perfect 0
I believe the suggestion is just for purposes of evaluation. You would
not return these items in practice, yes.
Although there are cases where you do want to return known items. For
example, maybe you are modeling user interaction with restaurant
categories. This could be useful, because as soon
Not sure, is this really related to Mahout?
I don't know of an equivalent of J2EE / Tomcat for C++, but there must
be something.
As a general principle, you will have to load your data into memory if
you want to perform the computations on the fly in real time. So how
you access the data isn't
THere's nothing direct, but you can probably save yourself time by copying
the code that computes these stats and apply them to your pre-computed
values. It's not terribly complex, just counting the intersection and union
size and deriving some stats from it.
The split is actually based on value
I feel like I've seen this too and it's just a bug. You're not running
out of memory.
Are you also setting io.sort.factor? that can help too. You might try
as high as 100.
Also have you tried a Combiner? if you can apply it it should help too
as it is designed to reduce the amount of stuff
about combiner.
Thanks for your answer.
W dniu 22.05.2013 14:59, Sean Owen pisze:
I feel like I've seen this too and it's just a bug. You're not running
out of memory.
Are you also setting io.sort.factor? that can help too. You might try
as high as 100.
Also have you tried
It doesn't matter, in the sense that it is never going to be fast
enough for real-time at any reasonable scale if actually run off a
database directly. One operation results in thousands of queries. It's
going to read data into memory anyway and cache it there. So, whatever
is easiest for you. The
19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote:
It doesn't matter, in the sense that it is never going to be fast
enough for real-time at any reasonable scale if actually run off a
database directly. One operation results in thousands of queries. It's
going to read data into memory anyway
/docs/RecommenderArchitecture.png
Hope that helps
Manuel
Am 19.05.2013 um 19:20 schrieb Sean Owen:
I'm first saying that you really don't want to use the database as a
data model directly. It is far too slow.
Instead you want to use a data model implementation that reads all of
the data
for showing the past ratings of a user.
Ahmet
From: Sean Owen sro...@gmail.com
To: Mahout User List user@mahout.apache.org
Sent: Sunday, May 19, 2013 9:26 PM
Subject: Re: Which database should I use with Mahout
I think everyone is agreeing
an option for transferring a lot of data:
https://github.com/facebook/scribe#readme
I would suggest that you just start with the technology that you know best
and then if you solve the problem as soon as you get them.
/Manuel
Am 19.05.2013 um 20:26 schrieb Sean Owen:
I think everyone
Why not? it's just the object reference that is local to the function.
The Map itself is not, and on the heap like everything else in the
JVM.
On Thu, May 16, 2013 at 2:19 AM, huangjia cucumbergua...@gmail.com wrote:
Hi,
I want to build a recommendation model based on Mahout. My dataset format
You can't have a blank line, if that's what you mean, yes. That's not
a valid record. A terminal newline is fine.
But the error seems to be something else:
java.io.FileNotFoundException: File does not exist:
/user/hadoop/temp/preparePreferenceMatrix/numUsers.bin
This sounds like overfitting. More features lets you fit your training
set better, but at some point, fitting too well means you fit other
test data less well. Lambda resists overfitting, so setting it too low
increases the overfitting problem.
I assume you still get better test set results with
it is: http://i.imgur.com/3e1eTE5.png
I've used 75% for training and 25% for evaluation.
Well reasonably lambda gives close enough results, however not better.
Thanks,
Bernát GÁBOR
On Thu, May 9, 2013 at 2:46 PM, Sean Owen sro...@gmail.com wrote:
This sounds like overfitting. More features
).
Bernát GÁBOR
On Thu, May 9, 2013 at 3:05 PM, Sean Owen sro...@gmail.com wrote:
(The MAE metric may also be a complicating issue... it's measuring
average error where all elements are equally weighted, but as the WR
suggests in ALS-WR, the loss function being minimized weights
different
and the one used for implicit data). @Gabor, what do you
specify for the constructor argument usesImplicitFeedback ?
On 09.05.2013 15:33, Sean Owen wrote:
RMSE would have the same potential issue. ALS-WR is going to prefer to
minimize one error at the expense of letting another get much larger
Yes, you overfit the training data set, so you under-fit the test
set. I'm trying to suggest why more degrees of freedom (features)
makes for a worse fit. It doesn't, on the training set, but those
same parameters may fit the test set increasingly badly.
It doesn't make sense to evaluate on a
It is true that a process based on user-user similarity only won't be
able to recommend item 4 in this example. This is a drawback of the
algorithm and not something that can be worked around. You could try
not to choose this item in the test set, but then that does not quite
reflect reality in
this?
Any help will be highly appreciated.
Best Regards,
Jimmy
Zhongduo Lin (Jimmy)
MASc candidate in ECE department
University of Toronto
On 2013-05-08 4:44 AM, Sean Owen wrote:
It is true that a process based on user-user similarity only won't be
able to recommend item 4 in this example
relative to the
variance of the data set using Mahout? Unfortunately I got an error using
the precision and recall evaluation method, I guess that's because the data
are too sparse.
Best Regards,
Jimmy
On 13-05-08 10:05 AM, Sean Owen wrote:
It may be true that the results are best
It may be selected as a test item. Other algorithms can predict the
'4'. The test process is random so as to not favor one algorithm.
I think you are just arguing that the algorithm you are using isn't
good for your data -- so just don't use it. Is that not the answer?
I don't know what you mean
absolute difference or RMSE. How can I say RMSE is worse
relative to the variance of the data set using Mahout? Unfortunately
I got an error using the precision and recall evaluation method, I
guess that's because the data are too sparse.
Best Regards,
Jimmy
On 13-05-08 10:05 AM, Sean Owen wrote
If you have no ratings, how are you using RMSE? this typically
measures error in reconstructing ratings.
I think you are probably measuring something meaningless.
On Mon, May 6, 2013 at 10:17 AM, William icswilliam2...@gmail.com wrote:
I have a dataset about user and movie(no rate).But I want to
wrote:
Sean Owen srowen at gmail.com writes:
If you have no ratings, how are you using RMSE? this typically
measures error in reconstructing ratings.
I think you are probably measuring something meaningless.
I suppose the rate of seen movies are 1. Is it right?
If I use Collaborative
Mahout has algorithms for one-class
collaborative filtering.
On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote:
ALS-WR weights the error on each term differently, so the average
error doesn't really have meaning here, even if you are comparing the
difference with 1. I think you
?
Are there matrix factorization algorithms in Mahout which can work
with this kind of data (that is, the kind of data which consists of
users and the movies they have seen).
On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote:
Yes, it goes by the name 'boolean prefs' in the project
only 1's.
On Mon, May 6, 2013 at 11:29 PM, Sean Owen sro...@gmail.com wrote:
Parallel ALS is exactly an example of where you can use matrix
factorization for 0/1 data.
On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
wrote:
Hi Sean,
Isn't boolean preferences
It sounds like you don't quite have a cold start problem. You have a
few behaviors, a few views or clicks, not zero. So you really just
need to find an approach that's quite comfortable with sparse input. A
low-rank factorization model like ALS works fine in this case, for
example.
There's a
Rather, it needs to extend ConnectionPoolDataSource. But you can
ignore it if you're sure you are using a pooling implementation. You
might just double-check that.
On Wed, May 1, 2013 at 9:25 AM, Mugoma Joseph O. mug...@yengas.com wrote:
Thanks Sean.
From source, AbstractJDBCDataModel.java
I should say that it depends of course on what you are implementing.
You can also write an algorithm to factor R, not P. If you're doing
that, then I would not expect values to be so low. But I thought you
were following the version where you factor P = R != 0.
Multiplying by 3 and adding 1 would
No, time is in the data model but nothing uses it that I know of.
On Tue, Apr 30, 2013 at 3:18 PM, Chirag Lakhani clakh...@zaloni.com wrote:
I was wondering if the collaborative filtering library in Mahout has any
algorithms that incorporate concept drift i.e. time dynamics. From my own
GraphLab -- http://docs.graphlab.org/collaborative_filtering.html#SVD_PLUS_PLUS
On Tue, Apr 30, 2013 at 3:30 PM, Chirag Lakhani clakh...@zaloni.com wrote:
Do you know of any other large scale machine learning platforms that do
incorporate it?
On Tue, Apr 30, 2013 at 10:21 AM, Sean Owen sro
If you are actually using a connection pool, ignore it, it just means
the implementation doesn't appear to extend the usual connection pool
class in the JDK. Just make sure you are in fact using this class and
you're fine.
On Tue, Apr 30, 2013 at 4:01 AM, Mugoma Joseph O. mug...@yengas.com wrote:
ALS-WR is not predicting your input matrix R, but the matrix P which
is R != 0. It is not predicting ratings, but a 0/1 indicator of
whether the connection exists. So the values are usually in [0,1].
On Tue, Apr 30, 2013 at 2:40 AM, Chloe chloe.gu...@gmail.com wrote:
Dear Sean,
Thanks a lot
+ which is way to much for what I need.
Thanks,
Bernát GÁBOR
On Tue, Apr 23, 2013 at 12:53 AM, Sean Owen sro...@gmail.com wrote:
49 seconds is orders of magnitude too long -- something is very wrong
here, for so little data. Are you running this off a database? or are
you somehow counting
I agree, but how is pre-adding a cached value for X different than
requesting X from the cache? Either way you get X in the cache.
Computing offline seems the same as computing on-line, but in some
kind of warm-up state or phase. Which can be concurrent with serving
early requests even. You can do
49 seconds is orders of magnitude too long -- something is very wrong
here, for so little data. Are you running this off a database? or are
you somehow counting the overhead of 3-4K network calls?
On Mon, Apr 22, 2013 at 11:22 PM, Gabor Bernat ber...@primeranks.net wrote:
Hello,
I'm using
Probably a corrupt download inside Maven. Delete ~/.m2/repository entirely
On Apr 19, 2013 12:23 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Hm. This is really not a known error. Which suggests something really
platitudinarian: open file handle limits? lack of disk space? Sorry if
that's not
a lot for the insight,very useful!
*
Agata Filiana
Erasmus Mundus DMKM Student 2011-2013 http://www.em-dmkm.eu/
*
On 16 April 2013 16:40, Sean Owen sro...@gmail.com wrote:
Of course it's not meaningless. They provide a basis for ranking
items, so you can return top-K recommendations
In the usual recommender, the output is a weighted average of ratings.
In a model where there are no ratings, this has no meaning --
everything is 1 implicitly. So the output is something else, and
here it's a sum of similarities actually.
On Tue, Apr 16, 2013 at 3:05 PM, Agata Filiana
Of course it's not meaningless. They provide a basis for ranking
items, so you can return top-K recommendations.
If it's normally based on similarity and ratings -- and you have no
ratings -- similarity is of course the only thing you can base the
result on.
On Tue, Apr 16, 2013 at 3:36 PM, Agata
Yes that's true, it is more usually bits. Here it's natural log / nats.
Since it's unnormalized anyway another constant factor doesn't hurt and it
means not having to change the base.
On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai baizh...@gmail.com wrote:
I got 168, because I use log base 2
This sounds like just a most-similar-items problem. That's good news
because that's simpler. The only question is how you want to compute
item-item similarities. That could be based on user-item interactions.
If you're on Hadoop, try the RowSimilarityJob (where you will need
rows to be items,
#A and #C to other users who order #B ... I still don't want this if the
items are similar and/or the users similar.
Cheers
Billy
On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:
This sounds like just a most-similar-items problem. That's good news
because that's simpler. The only
. These may be much more
valuable for cross-sell than things in the same order.
On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:
You can try treating your orders as the 'users'. Then just compute
item-item similarities per usual.
On Thu, Apr 11, 2013 at 7:59 PM, Billy b
Yes I also get (er, Mahout gets) 117 (116.69), FWIW.
I think the second question concerned counts vs relative frequencies
-- normalized, or not. Like whether you divide all the counts by their
sum or not. For a fixed set of observations that does change the LLR
because it is unnormalized, not
These events do sound 'similar'. They occur together about half the
time either one of them occurs. You might have many pairs that end up
being similar for the same reason, and this is not surprising. They're
all really similar.
The mapping here from LLR's range of [0,inf) to [0,1] is pretty
, 2013 at 5:50 PM, Sean Owen sro...@gmail.com wrote:
These events do sound 'similar'. They occur together about half the
time either one of them occurs. You might have many pairs that end up
being similar for the same reason, and this is not surprising. They're
all really similar.
The mapping here
For simplicity let's consider a brand-new user first, not a new rating
for existing user. I'll use the notation from my slides that you
mention, A = X * Y'. To clarify, I think you mean you have a new A_u
row, and want to know X_u.
The two expressions are not alternatives, they're the same thing,
.
Anyway -- long story short, a simple check on the inf norm of X' * X
or Y' * Y seems to suffice to decide that lambda is too big and go
complain about it rather than proceed.
On Sun, Apr 7, 2013 at 10:00 AM, Sean Owen sro...@gmail.com wrote:
All that said I don't think inverting is the issue here
...@gmail.com wrote:
Okay, you do have a problem.
Y'*Y is 10x10, but it's rank is 5.
Has to have something to do with the input data.
On Sat, Apr 6, 2013 at 7:47 PM, Sean Owen sro...@gmail.com wrote:
For example, here's Y:
Y =
-0.278098 -0.256438 0.127559 -0.045869 -0.769172 -0.255599
I had not heard of Tanimoto being generalized to n-way similarity, but
then again, I can't say I know much at all authoritative about the
term. The Wikipedia page says it's incorrectly used to describe a lot
of things. Here, we're only looking at 2-way comparisons, pair-wise
similarity. As far as
1 - 100 of 1295 matches
Mail list logo