Hi, I've been looking at using the sklearn.metrics.cluster.mutual_info_score as
a feature selection metric for language data. This is quite a common thing to
do in the NLP community. My problem however is that it is really slow as I need
to iterate over all the features in the dataset and pass them in one by one to
get the score for that particular feature. I implemented a version of the score
that works directly with matrices, the problem however is that I don't quite
get the same results out of the two functions. I've been looking at the
implementation of mutual_info_score but I can't figure out what the outer
product is doing, so I can't really say if my implementation has an error and
if so where. It is a lot faster though.
>>> X.shape
(1448, 52641)
>>> %timeit np.array([mutual_info_score(y, X[:,i]) for i in range(X.shape[1])])
1 loops, best of 3: 23.6 s per loop
>>> %timeit mi2(X,y)
1 loops, best of 3: 1.22 s per loop
def mutual_info_score(labels_true, labels_pred, contingency=None):
f contingency is None:
labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
contingency = contingency_matrix(labels_true, labels_pred)
contingency = np.array(contingency, dtype='float')
contingency_sum = np.sum(contingency)
pi = np.sum(contingency, axis=1)
pj = np.sum(contingency, axis=0)
outer = np.outer(pi, pj)
nnz = contingency != 0.0
# normalized contingency
contingency_nm = contingency[nnz]
log_contingency_nm = np.log(contingency_nm)
contingency_nm /= contingency_sum
# log(a / b) should be calculated as log(a) - log(b) for
# possible loss of precision
log_outer = -np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum())
mi = (contingency_nm * (log_contingency_nm - log(contingency_sum))
+ contingency_nm * log_outer)
return mi.sum()
def mi2(X, y):
classes = np.unique(y)
num_classes = classes.shape[0]
mi_f = np.zeros((X.shape[1], num_classes))
c_all = float(np.sum(X))
f_counts = np.sum(X, axis=0)
p_f = f_counts / float(X.shape[0])
num_docs = float(X.shape[0])
for i, cls in enumerate(classes):
cls_mask = y == cls
cls_count = float(np.sum(cls_mask))
p_fc = np.sum(X[cls_mask], axis=0) / num_docs
p_c = cls_count / num_docs
mi_partial = p_fc * np.log(p_fc / (p_f * p_c))
mi_f[:, i] = np.nan_to_num(mi_partial)
return np.sum(mi_f, axis=1)
Can anyone clarify as to how the current implementation of mutual_info_score
works, and is there a way of making it (a lot) faster?
p.s. I've also tried the
sklearn.metrics.cluster.mutual_info_score_fast.expected_mutual_information but
that doesn't help the speed issue either.
>>> %timeit conts = [contingency_matrix(y, X[:,i]) for i in range(X.shape[1])];
>>> [expected_mutual_information(c, 1448) for c in conts]
1 loops, best of 3: 35.9 s per loop
--------------------------------
Matti Lyra
DPhil Student
Text Analytics Group
Chichester 1, R203
School of Engineering and Informatics
University of Sussex
Brighton, UK
[email protected]
Tel: +441273 872956
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general