Hi, I've been looking at using the sklearn.metrics.cluster.mutual_info_score as 
a feature selection metric for language data. This is quite a common thing to 
do in the NLP community. My problem however is that it is really slow as I need 
to iterate over all the features in the dataset and pass them in one by one to 
get the score for that particular feature. I implemented a version of the score 
that works directly with matrices, the problem however is that I don't quite 
get the same results out of the two functions. I've been looking at the 
implementation of mutual_info_score but I can't figure out what the outer 
product is doing, so I can't really say if my implementation has an error and 
if so where. It is a lot faster though.

>>> X.shape
(1448, 52641)

>>> %timeit np.array([mutual_info_score(y, X[:,i]) for i in range(X.shape[1])])
1 loops, best of 3: 23.6 s per loop

>>> %timeit mi2(X,y)
1 loops, best of 3: 1.22 s per loop

def mutual_info_score(labels_true, labels_pred, contingency=None):
    f contingency is None:
        labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
        contingency = contingency_matrix(labels_true, labels_pred)
    contingency = np.array(contingency, dtype='float')
    contingency_sum = np.sum(contingency)
    pi = np.sum(contingency, axis=1)
    pj = np.sum(contingency, axis=0)
    outer = np.outer(pi, pj)
    nnz = contingency != 0.0
    # normalized contingency
    contingency_nm = contingency[nnz]
    log_contingency_nm = np.log(contingency_nm)
    contingency_nm /= contingency_sum
    # log(a / b) should be calculated as log(a) - log(b) for
    # possible loss of precision
    log_outer = -np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum())
    mi = (contingency_nm * (log_contingency_nm - log(contingency_sum))
          + contingency_nm * log_outer)
    return mi.sum()


def mi2(X, y):
    classes = np.unique(y)
    num_classes = classes.shape[0]
    mi_f = np.zeros((X.shape[1], num_classes))
    c_all = float(np.sum(X))
    f_counts = np.sum(X, axis=0)
    p_f = f_counts / float(X.shape[0])
    num_docs = float(X.shape[0])
    for i, cls in enumerate(classes):
        cls_mask = y == cls
        cls_count = float(np.sum(cls_mask))
        p_fc =  np.sum(X[cls_mask], axis=0) / num_docs
        p_c = cls_count / num_docs
        mi_partial = p_fc * np.log(p_fc / (p_f * p_c))
        mi_f[:, i] = np.nan_to_num(mi_partial)
    return np.sum(mi_f, axis=1)


Can anyone clarify as to how the current implementation of mutual_info_score 
works, and is there a way of making it (a lot) faster?

p.s. I've also tried the 
sklearn.metrics.cluster.mutual_info_score_fast.expected_mutual_information but 
that doesn't help the speed issue either.

>>> %timeit conts = [contingency_matrix(y, X[:,i]) for i in range(X.shape[1])]; 
>>> [expected_mutual_information(c, 1448) for c in conts]
1 loops, best of 3: 35.9 s per loop


--------------------------------
Matti Lyra
DPhil Student

Text Analytics Group
Chichester 1, R203 
School of Engineering and Informatics
University of Sussex
Brighton, UK
[email protected]
Tel: +441273 872956




------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to