Hi,
the counts for two events are:
* **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
k21=13**k22=300,000*
according to the code, I will get:
rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
These events do sound 'similar'. They occur together about half the
time either one of them occurs. You might have many pairs that end up
being similar for the same reason, and this is not surprising. They're
all really similar.
The mapping here from LLR's range of [0,inf) to [0,1] is pretty
Good point.
btw, why use counts instead of probabilities? for easy and efficient
implementation?
also, do you think the similarity score using counts might quite differ
from using probabilities?
thank you very much for your prompt reply. [?]
On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen
Thanks! I actually didn't know you can do that. :)
On Tue, Apr 9, 2013 at 7:22 PM, Johannes Schulte johannes.schu...@gmail.com
wrote:
dataPoints can be in memory or from disk, and you can sample the dataPoints
for initialClusters.
On Tue, Apr 9, 2013 at 6:16 PM, Johannes Schulte
Yes using counts is more efficient. Certainly it makes the LLR value
different since the results are not normalized; all the input values are N
times larger (N = sum k), and so the LLR is N times larger. 2x more events
in the same ratio will make the LLR 2x larger too.
That's just fine if you're
Hi Jörn,
On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote:
Logistic Regression (is that similar to our maxent ?)
Online Passive Aggressive
HMM
The datasets we are training OpenNLP are usually rather small and can
easily be processed with a single CPU, does Mahout support
Hi all,
I'm following the tutorials at
http://searchhub.org/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
to
generate a vector file and a dictionary file from a solr index.
Does anyone know or have any links to a resource that describes how to use
these files for
BTW I have this working on trivial data and am in the process of measuring it's
results on some real world data. It does a lot with DistributedRowMatix and so
I'll be interested to see how it performs with a larger data set.
Does anyone know of a public data set that provides things like views
Doesn't the random part of random forest defend against overfitting ?
-Original Message-
From: ey-chih chow [mailto:eyc...@gmail.com]
Sent: Saturday, April 06, 2013 5:45 PM
To: user@mahout.apache.org
Subject: Re: Classification Algorithms in Mahout
I actually got a lot of over fitting.
Thanks, we will try MapReduce version as you described
From: Dan Filimon dangeorge.fili...@gmail.com
To: user@mahout.apache.org
Sent: Wednesday, April 10, 2013 1:19 PM
Subject: Re: In-memory kmeans clustering
Thanks! I actually didn't know you can do that.
Thanks for your response, I will give it a try, our follow up jira issue
is here:
https://issues.apache.org/jira/browse/OPENNLP-574
Jörn
On 04/10/2013 05:04 PM, Isabel Drost-Fromm wrote:
Hi Jörn,
On Tuesday, April 09, 2013 10:12:47 PM Jörn Kottmann wrote:
Logistic Regression (is that
FWIW, this paper talks about the equivalence of Logistic Regression and Maxent
http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf
From: Jörn Kottmann kottm...@gmail.com
To: user@mahout.apache.org
Sent: Wednesday, April 10, 2013 4:23 PM
Subject:
Hi everyone,
I am reaching out to the list requesting some help/advice on implementing
fold-in with the Alternating Least Squares algo in Mahout, a problem on
which I am stumped. I've read other posts on the list and over on SO, like:
For simplicity let's consider a brand-new user first, not a new rating
for existing user. I'll use the notation from my slides that you
mention, A = X * Y'. To clarify, I think you mean you have a new A_u
row, and want to know X_u.
The two expressions are not alternatives, they're the same thing,
On Wed, Apr 10, 2013 at 10:38 AM, Pat Ferrel p...@occamsmachete.com wrote:
Does anyone know of a public data set that provides things like views and
purchases?
I don't.
Retail data may be hard to impossible, but one can improvise.
It seems to be fairly common to use Wikipedia articles (Myrrix, GraphLab).
Another idea is to use StackOverflow tags (Myrrix examples).
Although they are only good for emulating implicit feedback.
On Wed, Apr 10, 2013 at 6:48 PM, Ted
I have retail data but can't publish results from it. If I could get a public
sample I'd share how the technique worked out.
Not sure how to simulate this data. It has the important characteristic that
every purchase is also a view but not the other way around and Ted's technique
is a way to
17 matches
Mail list logo