You can use the low-order bits, or have a look at what the UUID class
does to hash itself to 32 bits in hashCode() and emulate that for 64
bits. Collisions in a 64-bit space are very very very rare, enough to
not care about here by a wide margin. A collision only means you
confuse prefs from two
For example, here's Y:
Y =
-0.278098 -0.256438 0.127559 -0.045869 -0.769172 -0.255599
0.150450 -0.436548 0.209881 -0.526238
0.613175 -0.600739 -0.291662 -1.142282 0.277204 -0.296846
-0.175122 0.031656 -0.202138 -0.254480
-0.187816 -0.889571 0.052191 -0.304053
(On this aside -- the Commons Math version uses Householder
reflections but operates on a transposed representation for just this
reason.)
On Thu, Apr 4, 2013 at 11:11 PM, Ted Dunning ted.dunn...@gmail.com wrote:
But then I started trying to build a HH version using vector ops and
realized
OK yes you're on to something here. I should clarify. Koobas you are
right that the ALS algorithm itself is fine here as far as my
knowledge takes me. The thing it inverts to solve for a row of X is
something like (Y' * Cu * Y + lambda * I). No problem there, and
indeed I see why the
This is more of a linear algebra question, but I thought it worth
posing to the group --
As part of a process like ALS, you solve a system like A = X * Y' for
X or for Y, given the other two. A is sparse (m x n); X and Y are tall
and skinny (m x k, m x n, where k m,n)
For example to solve for
I think that's what I'm saying, yes. Small rows X shouldn't become
large rows of A -- and similarly small changes in X shouldn't mean
large changes in A. Not quite the same thing but both are relevant. I
see that this is just the ratio of largest and smallest singular
values. Is there established
the condition number but from what I learned this is
probably the thing you want to be looking at.
Good luck!
[1] http://www.math.ufl.edu/~kees/ConditionNumber.pdf
[2] http://www.rejonesconsulting.com/CS210_lect07.pdf
On Thu, Apr 4, 2013 at 5:26 PM, Sean Owen sro...@gmail.com wrote:
I
It might make a difference that you're just running 1 iteration. Normally
it's run to 'convergence' -- or here let's say, 10+ iterations to be safe.
This is the QR factorization of Y' * Y at the finish? This seems like it
can't be right... Y has only 5 vectors in 10 dimensions and Y' * Y is
No, just was never written I suppose back in the day. The way it is
structured now it creates a test split for each user, which is also
slow, and may be challenging to memory limitations as that's N data
models in memory. You could take a crack at a patch.
When I rewrote this aspect in a separate
You should be able to get reproducible random seed values by calling
RandomUtils.useTestSeed() at the very start of your program. But if
your goal is to get an unbiased view of the quality of results, you
want to run several times and take the average yes.
On Sat, Mar 30, 2013 at 3:57 PM,
Yes it's OK. You need to care for thread safety though, which will be
hard. The other problem is that changing the underlying data doesn't
necessarily invalidate caches above it. You'll have to consider that
part as well. I suppose this is part of why it was conceived as a
model where the data is
This is really a Hadoop-level thing. I am not sure I have ever
successfully induced M/R to run multiple mappers on less than one
block of data, even with a low max split size. Reducers you can
control.
On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister
Modify the existing code to change the SQL -- it's just a matter of
copying a class that only specifies SQL and making new SQL statements.
I think there's a version that even reads from a Properties object.
On Mon, Mar 25, 2013 at 12:11 AM, Matt Mitchell goodie...@gmail.com wrote:
Hi,
I have a
Points from across several e-mails --
The initial item-feature matrix can be just random unit vectors too. I
have slightly better results with that.
You are finding the least-squares solution of A = U M' for U given A
and M. Yes you can derive that analytically as the zero of the
derivative of
OK, the 'k iterations' happen inline in one job? I thought the Lanczos
algorithm found the k eigenvalues/vectors one after the other. Yeah I
suppose that doesn't literally mean k map/reduce jobs. Yes the broader
idea was whether or not you might get something useful out of ALS
earlier.
On Mon,
On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter s...@apache.org wrote:
Well in LSI it is ok to do that, as a missing entry means that the
document contains zero occurrences of a given term which is totally fine.
In Collaborative Filtering with explicit feedback, a missing rating is
not
a ClassNotFoundException
I'm using version 0.7 of mahout-core and mahout-math, and version 0.5 of
mahout-utils.
- Matt
On Mon, Mar 25, 2013 at 6:21 AM, Sean Owen sro...@gmail.com wrote:
I think you'd have to define not working first
On Mon, Mar 25, 2013 at 1:32 AM, Matt Mitchell goodie...@gmail.com
(The unobserved entries are still in the loss function, just with low
weight. They are also in the system of equations you are solving for.)
On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Classic als wr is bypassing underlearning problem by cutting out unrated
On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote:
But the assumption works nicely for click-like data. Better still when
you can weakly prefer to reconstruct the 0 for missing observations
and much more strongly prefer to reconstruct the 1 for observed
data.
This does seem
:
On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen sro...@gmail.com wrote:
On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote:
But the assumption works nicely for click-like data. Better still when
you can weakly prefer to reconstruct the 0 for missing observations
and much more
be normalized in
a way?
Thank you and sorry for the basic questions.
Regards,
Agata Filiana
On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote:
There are many ways to think about combining these two types of data.
If you can make some similarity metric based on age, gender
sense? Or am I
confusing myself?
Agata
On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote:
You would have to make up the similarity metric separately since it
depends
entirely on how you want to define it.
The part of the book you are talking about concerns rescoring, which
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about Collaborative
Filtering for Implicit Feedback Datasets.
I've been working with some folks who point out that alpha=40 seems to be
too high for most data sets. After
://labrosa.ee.columbia.edu/millionsong/tasteprofile
On 18.03.2013 17:47, Sean Owen wrote:
One word of caution, is that there are at least two papers on ALS and
they
define lambda differently. I think you are talking about Collaborative
Filtering for Implicit Feedback Datasets.
I've been working
somehow loop through the item data
and the hobby data and then combine the score for a pair of users?
I am having trouble in how to combine both similarity into one metric,
could you possibly point me out a clue?
Thank you
On 18 March 2013 14:54, Sean Owen sro...@gmail.com wrote
What's your question? ALS has a random starting point which changes the
results a bit. Not sure about KNN though.
On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote:
Can anybody shed any light on the issue of reproducibility in Mahout,
with and without Hadoop, specifically in the
of, a big deal.
Maybe it's not much of a concern in machine learning.
I am just curious.
On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen sro...@gmail.com wrote:
What's your question? ALS has a random starting point which changes the
results a bit. Not sure about KNN though.
On Sun, Mar 17
There are many ways to think about combining these two types of data.
If you can make some similarity metric based on age, gender and interests,
then you can use it as the similarity metric in
GenericBooleanPrefUserBasedRecommender. You would be using both data sets
in some way. Of course this
I think you are referring to the same step? QR decomposition is how you
solve for u_i which is what I imagine the same step you have in mind.
I think someone submitted a different build profile that changes the
dependencies for you. I believe the issue is using hadoop-common and not
hadoop-core as well as changing versions. I think the rest is compile
compatible and probably runtime compatible. But I've not tried.
On Wed, Mar 13, 2013
it is a likely
performance bug. The computation is AB'. Perhaps you refer to rows of B
which are the columns of B'.
Sent from my sleepy thumbs set to typing on my iPhone.
On Mar 6, 2013, at 4:16 AM, Sean Owen sro...@gmail.com wrote:
If there are 100 features, it's more like 2.6M * 2.8M * 100
OK and he mentioned that 10 mappers were running, when it ought to be able
to use several per machine. The # of mappers is a function of the input
size really, so probably needs to turn down the max file split size to
induce more mappers?
On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter
the allocation down to negligible
levels.
On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen sro...@gmail.com wrote:
OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
probably, as most machines can handle 2 reducers at once).
I think the recommendation step loads one whole matrix
Without any tricks, yes you have to do this much work to really know which
are the largest values in UM' for every row. There's not an obvious twist
that speeds it up.
(Do you really want to compute all user recommendations? how many of the
2.6M are likely to be active soon, or, ever?)
First,
if this was sane! I'll
have a look into this as well if needed.
Thanks for the advice!
Josh
On 5 March 2013 22:23, Sean Owen sro...@gmail.com wrote:
Without any tricks, yes you have to do this much work to really know
which
are the largest values in UM' for every row. There's not an obvious twist
methods throw an UnsupportedOperationException. I
read in an old thread that you had updated these methods to work. I'm not
sure what I'm missing here. Can you point me in the right direction?
On Mar 2, 2013, at 6:42 AM, Sean Owen wrote:
Yes to integrate any new data everything must
Yes to integrate any new data everything must be reloaded.
On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote:
I am using a FileDataModel and remove and insert preferences before
estimating preferences. Do I need to rebuild the recommender after these
methods are called for it to be
Although I don't know of any specific incompatibility, I would not be
surprised. 0.18 is pretty old. As you can see in pom.xml it currently works
against the latest stable version, 1.1.1.
On Sat, Mar 2, 2013 at 6:16 PM, MARCOS UBIRAJARA
marcosubiraj...@ig.com.brwrote:
Dear Gentleman,
First
It's true, although many of the algorithms will by nature not emphasize
popular items.
There is an old and semi-deprecated class in the project
called InverseUserFrequency, which you can use to manually de-emphasize
popular items internally. I wouldn't really recommend it.
You can always use
A common measure of cluster coherence is the mean distance or mean squared
difference between the members and the cluster centroid. It sounds like
this is the kind of thing you're measuring with this all-pairs distances.
That could be a measure too; I've usually seen that done by taking the
I may not be 100% following the thread, but:
Similarity metrics won't care whether some items are really actions and
some items are items. The math is the same. The problem which you may be
alluding to is the one I mentioned earlier -- there is no connection
between item and item-action in the
It's also valid, yes. The difference is partly due to asymmetry, but also
just historical (i.e. no great reason). The item-item system uses a
different strategy for picking candidates based on CandidateItemStrategy.
On Thu, Feb 21, 2013 at 2:37 PM, Koobas koo...@gmail.com wrote:
In the
I think all of the code uses double-precision floats. I imagine much of it
could work as well with single-precision floats.
MapReduce and a GPU are very different things though, and I'm not sure how
you would use both together effectively.
On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade
over this in addition to what Ted Dunning presented the
other day on Solr involment in building/loading cooccurrence matrix for
Mahout recommendation, it should be a big leap in innovating Mahout
recommendation.
Am I missing sothing or just dreamig?
Regards,,,
Y.Mandai
2013/2/20 Sean Owen sro
.
Although bigger N values overcomes this problem mostly, still it does not
seem totally supervised.
On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen sro...@gmail.com wrote:
The very question at hand is how to label the data as relevant and not
relevant results. The question exists because
No, this is not a problem.
Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was
similar to B than C, which is not true.
From: Sean Owen sro...@gmail.com
To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz
ahmetyilmazefe...@yahoo.com
Sent: Saturday, February 16, 2013 8:41 PM
Subject: Re: Problems with Mahout's
of the test user in a random
fashion.
On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote:
Yes. But: the test sample is small. Using 40% of your data to test is
probably quite too much.
My point is that it may be the least-bad thing to do. What test are you
proposing
prediction is clearly a supervised ML problem
On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote:
This is a good answer for evaluation of supervised ML, but, this is
unsupervised. Choosing randomly is choosing the 'right answers' randomly,
and that's plainly problematic
If you're suggesting that you hold out only high-rated items, and then
sample them, then that's what is done already in the code, except without
the sampling. The sampling doesn't buy anything that I can see.
If you're suggesting holding out a random subset and then throwing away the
held-out
at 10:29 PM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:
I'm suggesting the second one. In that way the test user's ratings in
the training set will compose of both low and high rated items, that
prevents the problem pointed out by Ahmet.
On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro
The very question at hand is how to label the data as relevant and not
relevant results. The question exists because this is not given, which is
why I would not call this a supervised problem. That may just be semantics,
but the point I wanted to make is that the reasons choosing a random
training
Yes, I don't know if removing that data would improve results. It might
mean you can compute things faster, at little or no observable loss in
quality of the results.
I'm not sure, but you probably have repeat purchases of the same item, and
items of different value. Working in that data may help
This sounds like a job for frequent item set mining, which is kind of a
special case of the ideas you've mentioned here. Given N items in a cart,
which next item most frequently occurs in a purchased cart?
On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote:
I thought you
harder to implement but we can
also test precision on that and compare the two.
The recommender method below should be reasonable AFAICT except for the
method(s) of retrieving recs, which seem likely to be slow.
On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote:
This sounds like
comparisons--worst case. Each cart is likely to have only a few items in it
and I imagine this speeds the similarity calc.
I guess I'll try it as described and optimize for speed if the precision
is good compared to the apriori algo.
On Feb 14, 2013, at 10:57 AM, Sean Owen sro...@gmail.com wrote
I think you'd have to hack the code to not exclude previously-seen items,
or at least, not of the type you wish to consider. Yes you would also have
to hack it to add rather than replace existing values. Or for test
purposes, just do the adding yourself before inputting the data.
My hunch is that
of the sparsified versions of
these and let the search engine handle the weighting of different
components at query time. Having these components separated into different
fields in the search index seems to help quite a lot, which makes a fair
bit of sense.
On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen
You don't have to fix a scale. But your data needs to be consistent.
It wouldn't work to have users rate on a 1-5 scale one day, and 1-100
tomorrow (unless you go back and normalize the old data to 1-100).
On Mon, Feb 4, 2013 at 3:56 PM, Zia mel ziad.kame...@gmail.com wrote:
Hi , is there a
You can -DskipTests to skip tests, since that's what it is complaining
about. There aren't any current failures in trunk so could be
something specific to your setup. Or a flaky test. It may still be
something to fix.
On Mon, Feb 4, 2013 at 3:37 PM, jellyman colm_r...@hotmail.com wrote:
Hi
You are asking for a smaller and smaller neighborhood around a user.
At some point the neighborhood includes no users, for some people --
or, the neighborhood includes no new items. Nothing can be
recommended, and so recall drops. Precision and recall tend to go in
opposite directions for similar
The problem with this POV is that it assumes it's obvious what the
right outcome is. With a transaction test or a disk write test or big
sort, it's obvious and you can make a benchmark. With ML, it's not
even close.
For example, I can make you a recommender that is literally as fast as
you like
It's a good question. I think you can achieve a partial solution in Mahout.
Real-time suggests that you won't be able to make use of
Hadoop-based implementations, since they are by nature big batch
processes.
All of the implementations accept the same input -- user,item,value.
That's OK; you can
:30 PM, Sean Owen sro...@gmail.com wrote:
It doesn't really work this way. The model is predicated on loading the
data from backing store periodically. In the short term it is read only.
This method is misleading in a sense.
On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote:
Dear
It doesn't really work this way. The model is predicated on loading the
data from backing store periodically. In the short term it is read only.
This method is misleading in a sense.
On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote:
Dear All,
I would like to be able to update
This is quite small and certainly doesn't require Hadoop. That's the good
news. Any reasonable server will do well for you. You won't be memory
bound. More cores will let you serve more QPS.
Your pain points will be elsewhere like tuning for best quality and real
time updates. See my separate
Is it worth simply using the Commons Math implementation?
On Mon, Jan 28, 2013 at 8:04 AM, Sebastian Schelter s...@apache.org wrote:
This is great news and will automatically boost the performance of all
our ALS-based recommenders which are all using QRDecomposition internally.
On 28.01.2013
.
Is it even possible that MatrixMultiplication can run distributedly on
multiple mappers as it internally uses CompositeInputFormat .
Please Suggest
Thanks
Stuti
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, January 23, 2013 6:42 PM
To: Mahout User
= evaluator.evaluate(recommenderBuilder,
null, model, null, 10,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
0.05);
Many thanks
On Fri, Jan 25, 2013 at 12:26 PM, Sean Owen sro...@gmail.com
Yes several independent samples of all the data will, together, give
you a better estimate of the real metric value than any individual
one.
On Mon, Jan 28, 2013 at 5:41 PM, Zia mel ziad.kame...@gmail.com wrote:
What about running several tests on small data , can't that give an
indicator of
The way I do it is to set x different for each user, to the number of
items in the user's test set -- you ask for x recommendations.
This makes precision == recall, note. It dodges this problem though.
Otherwise, if you fix x, the condition you need is stronger, really:
each user needs = x *test
? mm Something like selecting y
set , each set have a min of z user ?
On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote:
The way I do it is to set x different for each user, to the number of
items in the user's test set -- you ask for x recommendations.
This makes precision
In my experience, using many small instances hurts since there is more
data transferred (less data is local to any given computation) and the
instance have lower I/O performance.
On the high end, super-big instances become counter-productive because
they are not as cheap on the spot market -- and
On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote:
Yes any metric that concerns estimated value vs real value can't be
used since all values are 1. Yes, when you use the non-boolean version
with boolean data you always get 1. When you use the boolean version
with boolean
Well, if you are throwing away rating data, you are throwing away
rating data. They are no longer 100% different but 100% the same.
If that's not a good thing to do, don't do it.
It's possible that using ratings gets better precision, and it's
possible that it doesn't. It depends on whether the
Yes, but the similarities are no longer weights, because there is
nothing to weight. They are used to compute a score directly, which is
not a weighted average but a function of the similarities themselves.
While it is true that more distant neighbors have less effect in
general, when the
not got any
success
http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
ers-in-DistributedRowMatrix-Jobs-td888980.html
Stuti
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, January 16, 2013 4:46 PM
The stochastic nature of the evaluation means your results will vary
randomly from run to run. This looks to my eyeballs like most of the
variation you see. You probably want to average over many runs.
You will probably find that accuracy peaks around some neighborhood size:
adding more useful
That is good for making a test repeatable because you are picking the same
random sample repeatedly. For evaluation purposes here that's not a good
thing and you do want several actually different samples of the result.
On Jan 23, 2013 1:19 PM, Stevo Slavić ssla...@gmail.com wrote:
When
they were not
using a Boolean recommender , something like code 1 maybe? Thanks
On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote:
Yes any metric that concerns estimated value vs real value can't be
used since all values are 1. Yes, when you use the non-boolean version
It's hard to make such generalization, but all else equal, I'd expect
more data to improve results and decrease error, yes.
On Wed, Jan 23, 2013 at 8:02 PM, Zia mel ziad.kame...@gmail.com wrote:
Is there a relation between ItemBased and data size? I found when I
increase the data size the MAE
GenericUserBasedRecommender(model, neighborhood,
similarity);
}};
On Tue, Jan 22, 2013 at 1:58 AM, Sean Owen sro...@gmail.com wrote:
No it's really #2, since the first still has data that is not
true/false. I am not sure what eval you are running, but an RMSE test
wouldn't be useful
.
Moreover , when I use
DataModel model = new FileDataModel(new File(ua.base));
in code 2, the MAE score was higher.
When you say RMSE can't be used with boolean data, I assume MAE also
can't be used?
Thanks !
On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote:
RMSE can't
Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(),
and Recommender.estimatePreference(). These do what you are interested
in, and yes they are easy since they are just steps in the
recommendation process anyway.
On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com
for the quick reply!
HK
On Tue, Jan 22, 2013 at 7:40 PM, Sean Owen sro...@gmail.com wrote:
Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(),
and Recommender.estimatePreference(). These do what you are interested
in, and yes they are easy since they are just steps
You would have to write this yourself, yes.
If you're not keeping the data in memory, you're not updating the
results in real-time. So there's no real need to keep any DataModel
around at all. Just pre-compute and store recommendations and update
them periodically. Nothing has to be on-line then.
matrix? So it would make memory
usage much worse, even if it is possible. Wouldn't it better to keep the
model and compute whenever necessary?
Thanks
Ceyhun Can Ulker
On Mon, Jan 21, 2013 at 9:58 PM, Sean Owen sro...@gmail.com wrote:
You would have to write this yourself, yes.
If you're
No it's really #2, since the first still has data that is not
true/false. I am not sure what eval you are running, but an RMSE test
wouldn't be useful in case #2. It would always be 0 since there is
only one value in the universe: 1. No value can ever be different from
the right value.
On Tue,
And, do you really need an inverse, or pseudo-inverse?
But, no, there are really no direct utilities for this. But we could
probably tell you how to do it efficiently, as long as you don't
actually mean a full inverse.
On Fri, Jan 18, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:
You should give more detail about the errors. You are running out of
memory on the child workers. This is not surprising since the default
memory they allocate is fairly small, and you're running a complete
recommender system inside each mapper. It has not much to do with the
size of the instane
That's the error right there:
On Thu, Jan 17, 2013 at 9:57 PM, Kamal Ali k...@grokker.com wrote:
Caused by: java.io.IOException: Unexpected input format on line: 1 1 5
.
Please Suggest
-Original Message-
From: Sean Owen [mailto:sro...@gmail.com]
Sent: Wednesday, January 16, 2013 1:23 PM
To: Mahout User List
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
It's up to Hadoop in the end.
Try calling FileInputFormat.setMaxInputSplitSize
You can try resetting all the random seeds with RandomUtils.useTestSeed()
On Jan 16, 2013 4:01 PM, Zia mel ziad.kame...@gmail.com wrote:
Hi
How to evaluate a recommender using different similarities ? Once we call
evaluator.evaluate(recommenderBuilder,..)
it will decide the training and test
Not really directly, no. You can make N individual recommendations and
combine them, and there are many ways to do that. You can blindly rank
them on their absolute scores. You can interleave rankings so each
gets every Nth slot in the recommendation. A popular metric is to rank
by least-aversion
It's fairly arbitrary. Strong positive ratings are probably more than
merely above average, but you could define the threshold higher or
lower if you wanted. It's a good default.
On Tue, Jan 15, 2013 at 3:58 PM, Zia mel ziad.kame...@gmail.com wrote:
Hi
Why in recommender the threshold is
Precision is not a great metric for recommenders, but it exists. There
is no best value here; I would choose something that mirrors how you
will use the results. If you show top 3 recs, use 3.
On Tue, Jan 15, 2013 at 4:51 PM, Zia mel ziad.kame...@gmail.com wrote:
Hello,
If I have users that
The best tests are really from real users. A/B test different
recommenders and see which has better performance. That's not quite
practical though.
The problem is that you don't even know what the best recommendations
are. Splitting the data by date is reasonable, but recent items aren't
You have the definition there already, what are you asking?
On Jan 15, 2013 5:58 PM, Zia mel ziad.kame...@gmail.com wrote:
Hi again ,
When evaluting preferences in recommenders and using
RMSRecommenderEvaluator, is it RMSE/RMSD
http://en.wikipedia.org/wiki/Root_mean_square_deviation
If we
http://stackoverflow.com/questions/10522835/hadoop-java-io-ioexception-mkdirs-failed-to-create-some-path
On Tue, Jan 15, 2013 at 9:42 PM, Yunming Zhang
zhangyunming1...@gmail.com wrote:
Hi,
I was trying to set up Mahout 0.8 on my Macbook Pro with OSX so I could do
some local testing, I am
It's up to Hadoop in the end.
Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
value, like your 10MB (1000).
I don't know if Hadoop params can be set as sys properties like that anyway?
On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
Hi,
I am
101 - 200 of 1295 matches
Mail list logo