This is not right. THe sequential version would have finished long before
this for any reasonable value of k.
I do note, however, that you have set k = 200,000 where you only have
300,000 documents. Depending on which value you set (I don't have the code
handy), this may actually be increased
Use a better recommender. Slope one is just there for completeness.
Sent from my iPhone
On Dec 8, 2013, at 2:24, Siddharth Patnaik spatnai...@gmail.com wrote:
What should be done to improve the runtime performance?
The problem of correlation of features is clearly present in text, but it is
not so clear what the effect will be. For naive bayes this has the effect of
making the classifier over confident but it usually still works reasonably
well. For logistic regression without regularization it can
On Sun, Dec 8, 2013 at 5:50 PM, Fernando Santos
fernandoleandro1...@gmail.com wrote:
Actually I had never heard of PCA and LDA. I'll take a look on it.
PCA and LDA are probably not quite what you want for Naive Bayes,
especially in Mahout. There is an assumption of a sparse binary
that
aren't co-rated can't meaningfully be included in this computation.
On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Good point Amit.
Not sure how much this matters. It may be that
PearsonCorrelationSimilarity is bad name that should
, or you have another one you
can forward to me, your doctoral dissertation? Thanks.
Jason Xin
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Friday, December 06, 2013 7:56 PM
To: user@mahout.apache.org
Subject: Re: Question about Pearson Correlation in non
Angelo,
The first question is how you intend to define which items are similar.
Also, what is the intended use of the clustering? Without knowing that, it
is very hard to say how to best do the clustering.
For instance, are two records more similar if the record are at the same
time of day?
to
determine the optimal number of clusters that best fits the dataset and
passing that information as parameter to Kmeans clustering (kmeansDriver
class).
Regards
Prabhakar
On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Can you be more specific about which code you
Ani,
I really don't understand your second point.
Here is how I view things ... if you can phrase things in those terms, it
might help me understand your question.
The TF part of TF-IDF refers to the term frequencies in a document.
Typically, each possible word is assigned to a positive
Can you be more specific about which code you are asking about?
The ball k-means implementation provides a capability somewhat like this,
but perhaps in a more clearly defined way.
On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan
prabhakar.sriniva...@gmail.com wrote:
Hello!
Can someone
Peter,
What you say is a bit confusing to me.
You say you have centers already. But then you talk about algorithms which
find the centers.
Also, you say you want to assign points based on centers, but you also say
that clusters have different shapes, area, size and point count. Do you
mean
Elephant bird is distinctly superior to Pig Vector for many things (it
moved forward, Pig Vector did not).
I believe here is also a Twitter internal project known as PigML which is
much more what Pig Vector wanted to be.
There is also https://github.com/hanborq/pigml, but I think it is very
Do you want to cluster users or items?
For items, the vectorization that you suggest will work reasonably well,
especially if you use TF.IDF weighting and normalize the resulting vectors.
You can also use one of the matrix decomposition techniques and cluster the
resulting vectors. The spectral
Inline
On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote:
... To accomplish this, we used AdaptiveLogisticRegression and trained 46
binary classification models. Our approach has been to do an 80/20 split
on the data, holding the 20% back for cross-validation of the
, Ted Dunning ted.dunn...@gmail.com
wrote:
On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com
wrote:
Hi Ted,
Thanks for your response. I thought that the mean of a sparse vector is
simply the mean of the defined elements? Why would the vectors become
dense unless
Did the training run use both machines?
How large is the input for the test run?
Is it contained in a single file?
On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos
fernandoleandro1...@gmail.com wrote:
Hello everyone,
I'm trying to do a text classification task. My dataset is not that
The new Ball k-means and streaming k-means implementations have non-Hadoop
versions. The streaming k-means implementation also has a threaded
implementation that runs without Hadoop.
The threaded streaming k-means implementation should be pretty fast.
On Sun, Dec 1, 2013 at 7:55 PM, Shan Lu
The default with the Mahout encoders is two probes. This is unnecessary
with the intercept term, of course, if you protect the intercept term from
other updates, possible by encoding other data using a view of the original
feature vector.
For each probe, a different hash is used so each value is
after encoding a new value in the vector? This would give a user the
information that the length of the chosen vector is too short. So far,
I did not find any method in the api to check for that.
2013/11/29 Ted Dunning ted.dunn...@gmail.com:
The default with the Mahout encoders is two probes
Well, the best way to compute correlation using sparse vectors is to make
sure you keep them sparse. To do that, you must avoid subtracting the mean
by expanding whatever formulae you are using. For instance, if you are
computing
(x - m_x) . (y - m_y)
(here . means dot product)
If you do
On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote:
Hi Ted,
Thanks for your response. I thought that the mean of a sparse vector is
simply the mean of the defined elements? Why would the vectors become
dense unless you're meaning that all the undefined elements (0?) now
public double currentLearningRate() {
return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
stepOffset, forgettingExponent);
}
I presume that you would like Adagrad-like solution to replace the above ?
On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn
of
OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.
b) for truly on-line learning where no repeated passes through the
data..
What would it take to get to an implementation ? How can any one help ?
Regards,
On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com
Are we to assume that SGD is still a work in progress and implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
They are too raw to be accepted uncritically, for sure. They have
Have you looked at the streaming k-means work? The basic idea is that you
generate a sketch of the data which you can then cluster in-memory. That
lets you use very advanced centroid generation algorithms that require lots
of processing.
On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu
Well, first off, let me say that I am much less of a fan now of the magical
cross validation approach and adaptation based on that than I was when I
wrote the ALR code. There are definitely legs in the ideas, but my
implementation has a number of flaws.
For example:
a) the way that I provide
On Mon, Nov 25, 2013 at 3:14 AM, Manuel Blechschmidt
manuel.blechschm...@gmx.de wrote:
There are/were multiple kNN implementation in Mahout:
Recommender knn
, Andreas Bauer b...@gmx.net wrote:
Ok, I'll have a look. Thanks! I know mahout is intended for large scale
machine learning, but I guess it shouldn't have problems with such small
data either.
Ted Dunning ted.dunn...@gmail.com schrieb:
On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b
For recommendation work, I suggest that it would be better to simply code
out an explicit OR query.
On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler kkrugler_li...@transpac.comwrote:
Hi Pat,
On Nov 7, 2013, at 7:30pm, Pat Ferrel pat.fer...@gmail.com wrote:
Another approach would be to weight
On Thu, Nov 7, 2013 at 12:50 AM, Gokhan Capan gkhn...@gmail.com wrote:
This particular approach is discussed, and proven to increase the accuracy
in Collaborative filtering with Temporal Dynamics by Yehuda Koren. The
decay function is parameterized per user, keeping track of how consistent
Why is FEATURE_NUMBER != 13?
With 12 features that are already lovely and continuous, just stick them in
elements 1..12 of a 13 long vector and put a constant value at the
beginning of it. Hashed encoding is good for sparse stuff, but confusing
for your case.
Also, it looks like you only pass
On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote:
Hi,
Thanks for your comments.
I modified the examples from the mahout in action book, therefore I used
the hashed approach and that's why i used 100 features. I'll adjust the
number.
Makes sense. But the book was doing
No. Scheduling is outside of Mahout's scope.
On Wed, Oct 30, 2013 at 12:55 PM, Cassio Melo melo.cas...@gmail.com wrote:
I wonder if Mahout (more precisely
org.apache.mahout.cf.taste package) has any helper class to execute
scheduled tasks like fetch data, compute similarity, etc.
Thank
Actually that isn't quite correct.
Watchmaker was removed. That was a genetic algorithm implementation.
EP or evolutionary programming still has an implementation in Mahout in the
class org.apache.mahout.ep.EvolutionaryProcess
This algorithm is documented here: http://arxiv.org/abs/0803.3838
Tim,
Yes, RF's are ensemble learners, but that doesn't mean that you couldn't
wrap them up with other classifiers to have a higher level ensemble.
On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut t...@timpeut.com wrote:
Thanks for the info and suggestions everyone.
On 19 October 2013 01:00, Ted
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote:
Has anyone found that Mahout's random forest doesn't perform as well as
other implementations? If not, is there any reason why it wouldn't perform
as well?
This is disappointing, but not entirely surprising. There has been
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser
j.barrett.straus...@gmail.com wrote:
How difficult would it be to wrap the RF classifier into an ensemble
learner?
It is callable. Should be relatively easy.
Search engines do cool things.
On Fri, Oct 11, 2013 at 7:42 AM, Jens Bonerz jbon...@googlemail.com wrote:
what a nice idea :-) really like that approach
2013/10/11 Ted Dunning ted.dunn...@gmail.com
You don't need Mahout for this.
A very easy way to do this is to gather all the words
For language detection, you are going to have a hard time doing better than
one of the standard packages for the purpose. See here:
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones dean.m.jo...@gmail.com wrote:
Hi Si,
Cool. Sounds like you are ahead of the game.
Sent from my iPhone
On Oct 10, 2013, at 13:15, Dean Jones dean.m.jo...@gmail.com wrote:
On 10 October 2013 12:46, Ted Dunning ted.dunn...@gmail.com wrote:
For language detection, you are going to have a hard time doing better than
one
Yes. Should work to use character n-grams. There are oddities in the
stats because the different n-grams are not independent, but Naive Bayes
methods are in such a state of sin that it shouldn't hurt any worse.
No... I don't think that there is a capability built in to generate the
character
of the rest could be trimmed away by config or adherence to conventions I
suspect. In the demo site I'm working on I've had to adopt some slightly
hacky conventions that I'll describe some day.
On Oct 1, 2013, at 10:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Pat,
Ellen and some folks
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
On 10/9/13 3:08 PM, Pat Ferrel wrote:
Solr uses cosine similarity for it's queries. The implementation on
github uses Mahout LLR for calculating the item-item similarity matrix but
when you do the
On Wed, Oct 9, 2013 at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote:
2) What you are doing is something else that I was calling a shopping-cart
recommender. You are using the item-set in the current cart and finding
similar, what, items? A different way to tackle this is to store all other
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
It sounds like you are doing item-item similarities for recommendations,
not actually calculating user-history based recs, is that true?
Yes that's true so far. Our recommender system has the ability to
iPhone
On Oct 6, 2013, at 12:37, Jens Bonerz jbon...@googlemail.com wrote:
Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
in the list of available programs when calling the mahout binary...
2013/10/3 Ted Dunning ted.dunn...@gmail.com
What you are seeing here
Why do you say that this is unacceptable?
If the phrase is the most common way that the word English is used, this isn't
such a bad thing.
In general, with machine learning, the idea is to let the data speak. If the
data say something you don't like, you have to be careful about
MahoutCluster similarProducts.txt
What am I missing?
2013/10/3 Ted Dunning ted.dunn...@gmail.com
Yes. That will work.
The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N
\approx 30 so the sketch will have at about 300,000 weighted centroids
On Fri, Oct 4, 2013 at 6:13 AM, Puneet Arora arorapuneet2...@gmail.comwrote:
yes you guessed correct that I am using naive bayes, but how can I handle
this type of problem.
I didn't hear about a problem.
You said you didn't like weights on words like English to reflect the fact
that they
on their short description text? What else could I use?
2013/10/1 Ted Dunning ted.dunn...@gmail.com
At such small sizes, I would guess that the sequential version of the
streaming k-means or ball k-means would be better options.
On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon
happen, if I define a very high number that is
guaranteed to be the estimated number of clusters.
for example if I set it to 10.000 clusters if an estimate of 5.000 is
likely, will that work?
2013/10/2 Ted Dunning ted.dunn...@gmail.com
The way that the new streaming k-means works
At such small sizes, I would guess that the sequential version of the
streaming k-means or ball k-means would be better options.
On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.comwrote:
Hello all,
I am currently trying create clusters from a group of 50.000 strings that
Yes. You can turn the normal item-item relationships around to get this.
What you have is an item x feature matrix. Normally, one has a user x item
matrix in cooccurrence analysis and you get an item x item matrix.
If you consider the features to be users in the computation, then the
resulting
?
2013/9/20 Ted Dunning ted.dunn...@gmail.com
It also depends on what you are doing. Several parts of Mahout have non
Hadoop versions.
On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com
wrote:
It is always possible to run mahout without a cluster on a single
It also depends on what you are doing. Several parts of Mahout have non
Hadoop versions.
On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote:
It is always possible to run mahout without a cluster on a single machine
but donot expect too much performance gain on it if
Right now the best in terms of speed without losing quality in Mahout is
the streaming k-means implementation.
One exciting possibility is that you probably can combine a streaming
k-means pre-pass with a regularized k-means algorithm in order to get
results more like Lingo. You could also
On Wed, Sep 11, 2013 at 12:07 AM, Sean Owen sro...@gmail.com wrote:
2. Do we have to tune the similarityclass parameter in item-based CF?
If
so, do we compare the mean average precision values based on validation
data, and then report the same for the test set?
Yes you are
You definitely need to separate into three sets.
Another way to put it is that with cross validation, any learning algorithm
needs to have test data withheld from it. The remaining data is training
data to be used by the learning algorithm.
Some training algorithms such as the one that you
On Fri, Sep 6, 2013 at 9:33 AM, Pat Ferrel pat.fer...@gmail.com wrote:
One of the unique things about the Solr recommender is online recs. Two
scenarios come to mind:
1) ask the user to pick from among a list of videos, taking the picks as
preferences and making recs. Make more and see if
That means If I Recall Correctly. It is an internet slang.
See also http://en.wiktionary.org/wiki/Appendix:English_internet_slang
On Sat, Sep 7, 2013 at 12:39 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:
Sebastian, what is IIRC?
On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter
Darius comments are good.
You also have to think about what similar means to you. From the data you
describe, I see several possibilities:
- geo-location from machine id (if it includes IP address)
- content from the query
- frequency of posting
- diurnal phase of posting (tells us time
On Sat, Sep 7, 2013 at 2:35 PM, Pat Ferrel p...@occamsmachete.com wrote:
...
Clustering can be done by doing SVD or ALS on the user x thing matrix
first
or by directly clustering the columns of the user x thing matrix after
some
kind of IDF weighting. I think that only the streaming
Ahh...
That makes a lot of sense.
On Thu, Sep 5, 2013 at 11:38 PM, Lauren Massa-Lochridge
laurl...@ieee.orgwrote:
Ted Dunning ted.dunning at gmail.com writes:
OK.
So the easy answer strikes out.
On Sat, Aug 3, 2013 at 5:04 AM, Swami Kevala
swami.kevala at ishafoundation.org
On Wed, Sep 4, 2013 at 6:58 PM, Alan Krumholz alan_krumh...@yahoo.com.mxwrote:
I pulled that code
(org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:215)and
I think is trying to read a file from one of the paths I passed to the
method but with a
I haven't seen any discussion of this other than what you reference.
On Thu, Sep 5, 2013 at 7:59 AM, Henry Lee honesthe...@gmail.com wrote:
I am about to implement Jake Mannix's suggestion out of Twitter fork.
Has anyone already implemented true L-LDA out of Mahout?
I think that Dominik's comments are exactly on target.
As far as implementation is concerned, I think that it is very important to
not distort the basic recommendation algorithm with business rules like
this. It is much better to post-process the results to impose your will
directly. One
On Wed, Sep 4, 2013 at 10:59 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Now, what happens in the case of SVD?
The vectors are normal by definition.
Are singular values used at all, or just left and right singular vectors?
SVD does not take weights so it cannot ignore or weigh out a
You also have to watch out in the case of web errors. Maven can store an
error message instead of a well formed file in your repo leading to all
kinds of confusion. Try deleting thus
*rm -rf ~/.m2/repository/com/ibm*
On Tue, Aug 27, 2013 at 7:37 AM, Stevo Slavić ssla...@gmail.com wrote:
-similairty case.
The cross-corelation sparsification via cooccurrence is probably pretty
weak, no?
On Aug 18, 2013, at 11:53 AM, Ted Dunning ted.dunn...@gmail.com wrote:
Outside of the context of your demo, suppose that you have events a, b, c
and d. Event a is the one we are centered
the data
from Mahout.
On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Yes. That would be interesting.
On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
A little digression: Might a Matrix implementation backed by a Solr
index
values between the two sequence pairs to flip the
order at will... which is information that co-occurrence of course does not
know about.
On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
This is nice. As you say, k11 is the only part that is used in
cooccurrence
This is nice. As you say, k11 is the only part that is used in
cooccurrence and it doesn't weight by prevalence, either.
This size analysis is hard to demonstrate much difference because it is
hard to show interesting values of LLR without absurdly string coordination
between items.
On Fri,
-identify as a member of
the small demand set Ted Dunning describes, I figure I can chime in. As
always, YMMV.
No. There is very small demand for Mahout on Hadoop 2.0 so far and the
forward/backward incompatibility of 2.0 has made it difficult to motivate
moving to 2.0.
The bigtop guys built a maven profile for 0.23 some time ago. I don't know
the status of that.
I don't think that the differences are
Why do you think this?
On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote:
Mahout 0.9 snapshot
RowSimilarityJob.java , sampleDown method
line 291 or 300
double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow)
/ observationsPerRow;
return either 0.0
, observationsPerRow) /
observationsPerRow;
we get rowSampleRate =0.0 ( not 0.7)
do we totally skip this column or sample column entries with .7 probalility
(roughly get 700 entries)
On Tue, Aug 13, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Why do you think this?
On Tue
features for each vector so
that I can convert the dense vectors to sparse vectors.
Your thoughts on this are welcome.
Thanks,
Ashvini
On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Aside from your issues with clusterdumper, the values you want can be had
from
The tasks that you need to do include:
a) group your history by user id
b) extract the features you want to use from each user history
c) repeat clustering and adjusting the scaling of your features until you
are happy
If you have a few hundred examples of customers broken down by the
On Mon, Aug 12, 2013 at 12:52 PM, Martin, Nick nimar...@pssd.com wrote:
I'd love to contribute so I'll get on JIRA and sign up for the dev@mailing
list to start getting a feel for that process.
Sounds like you already know the drill.
Welcome!
item space. This should be a very easy change If my thinking is
correct.
On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
4) To add more metadata to the Solr output will be left to the consumer
Aside from your issues with clusterdumper, the values you want can be had
from a sparse vector using v.iterateNonZero() and v.norm(0).
The issue with clusterdumper is odd.
Are you saying that the display shows all the components of the vector? Or
that there is an in-memory representation that
Check out the streaming k-means code.
It provides capabilities for weighted samples.
On Sat, Aug 10, 2013 at 6:57 AM, William Moran echofo...@gmail.com wrote:
Hi,
How would I go about changing the weighting of certain words when preparing
data for kmeans?
Also, in clusterdumps I have
On Fri, Aug 9, 2013 at 12:30 PM, Matt Molek mpmo...@gmail.com wrote:
From some local IR precision/recall testing, I've found that user based
recommenders do better on my data, so I'd like to stick with user based if
I can. I know precision/recall measures aren't always that important when
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic
otis_gospodne...@yahoo.com
Sent: Wednesday, August 7, 2013 11:48 PM
Subject: Re: Is OnlineSummarizer mergeable?
Otis,
What statistics do you need?
What
That might slow down the job enormously for certain nasty inputs.
The more that I think about things, the more convinced I am that there
should be a post-processing pass to enforce things like not recommending
input items. The recommendation algorithm itself should not be distorted
to do this if
Mahout is a library. You can link against any version you like and still
have a perfectly valid Hadoop program.
On Wed, Aug 7, 2013 at 11:51 AM, Adam Baron adam.j.ba...@gmail.com wrote:
Suneel,
Unfortunately no, we're still on Mahout 0.7. My team is one of many teams
which share a
If you are doing a student project, it may be best for you to do this as a
separate github project that *depends* on Mahout rather than trying to
build a modification to Mahout in the first instance.
The reasons that I say this include:
a) the Apache process will probably be foreign to you at
On Thu, Aug 8, 2013 at 1:31 PM, Sushanth Bhat(MT2012147)
sushanth.b...@iiitb.org wrote:
One more doubt I have that do we need to start our project without Mahout
library, I mean just implementing algorithm?
I would suggest that Mahout would be very useful for your project.
Use Maven and
still a little too phat...which is what made me think of your
OnlineSummarizer as a possible, slimmer alternative.
Otis
Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
http://sematext.com/spm
From: Ted Dunning ted.dunn...@gmail.com
Rafal,
The major problems with these sorts of metrics with recommendations include
a) different algorithms pull up different data and you don't have any
deeply scored reference data. The problem is similar to search except
without test collections. There are some partial solutions to this
b)
On Wed, Aug 7, 2013 at 3:56 PM, John Meagher john.meag...@gmail.com wrote:
Continuous values are being used now in addition to a large set of
boolean flags. I think I could convert the continuous values to some
sort of bucketed values that could be represented as additional flags.
If that
On Wed, Aug 7, 2013 at 7:29 AM, cont...@dhuebner.com wrote:
This typically won't be fast enough if you have something like a random
forest, but if your final targeting model is logistic regression, it
probably will be fast enough.
So usually I do need to train a custom model for each user
It isn't as mergeable as I would like. If you have randomized record
selection, it should be possible, but perverse ordering can cause serious
errors.
It would be better to use something like a Q-digest.
http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
On Wed, Aug 7, 2013 at
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
4) To add more metadata to the Solr output will be left to the consumer
for now. If there is a good data set to use we can illustrate how to do it
in the project. Ted may have some data for this from musicbrainz.
I am
/ HBase -
http://sematext.com/spm
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org
Sent: Wednesday, August 7, 2013 4:51 PM
Subject: Re: Is OnlineSummarizer mergeable?
It isn't as mergeable as I would like. If you
There is a considerable amount of discussion going on about a new edition
of Mahout in Action.
On Wed, Aug 7, 2013 at 12:36 PM, Piero Giacomelli pgiac...@gmail.comwrote:
Basically all my examples will be based on mahout 0.8. So for example the
k-means clustering will be used with the updated
By non-text, do you mean continuous values? Or sparse sets of tokens?
The general idea for Naive Bayes is that it requires input consisting of
sparse sets of tokens.
On Wed, Aug 7, 2013 at 2:00 PM, John Meagher john.meag...@gmail.com wrote:
I'm just starting work with Mahout and I'm
that create data structures that cannot be merged
Loss of accuracy that is not predictably small or configurable
Thank you,
Otis
Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
http://sematext.com/spm
From: Ted Dunning ted.dunn
On Tue, Aug 6, 2013 at 5:27 PM, Dominik Hübner cont...@dhuebner.com wrote:
I wonder how model based approaches might be scaled to a large number of
users. My understanding is that I would have to train some model like a
decision tree or naive bayes (or regression … etc.) for each user and do
Concur here. Obviously CrossRowSimilarityJob and RowSimilarityJob will be
able to share some down-stream code. But there are economies in RSJ that
probably can't apply to CRSJ.
On Mon, Aug 5, 2013 at 7:20 AM, Sebastian Schelter s...@apache.org wrote:
I think the downsampling belongs into
301 - 400 of 1929 matches
Mail list logo