What about using a distance metric like this one?
http://en.wikipedia.org/wiki/Normalized_Google_distance
From: Joel Nothman [joel.noth...@gmail.com]
Sent: 19 February 2014 22:50
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] Logistic
2014-02-19 20:57 GMT+01:00 Pavel Soriano :
> I thought about using the values of the coefficients of the fitted logit
> equation to get a glimpse of what words in the vocabulary, or what style
> features, affect the most to the classification decision. Is it correct to
> assume that if the coeffici
Sounds like you're on the right path. Looking at the misclassified
documents and the feature coefficients is a common way to debug a
classifier, especially if you use boolean features.
If you're using a sklearn vectorizer this might be of interest to you:
http://stackoverflow.com/questions/669
It is correct to assume that a positive coefficient contributes positively
to a decision.
However, because the features are interdependent, the raw strength of a
feature isn't always straightforward to interpret. For example, it might
give a big positive coefficient to "Tel" and a similar negative
Hello scikit!
I need some insights into what I am doing.
Currently I am doing a text classifier (2 classes) using unigrams (word
level) and some writing style features. I am using a Logistic Regression
model, with L1 regularization. I have a decent performance (around .70
f-measure) for the given