[Scikit-learn-general] SMOTE implementation for over-sampling

Karsten Jeschkies Sat, 24 Nov 2012 07:46:34 -0800

Hi,

I've written a quick implementation of the SMOTE over-sampling algorithm (
http://arxiv.org/abs/1106.1813).


I haven't found any in Python. Here it is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
Created on 24.11.2012

@author: karsten

This is an implementation of the SMOTE Algorithm.
See: "SMOTE: synthetic minority over-sampling technique" by
Chawla, N.V et al.
'''

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors

def SMOTE(T, N, k):
    """
    Returns (N/100) * n_minority_samples synthetic minority samples.

    Parameters
    ----------
    T : array-like, shape = [n_minority_samples, n_features]
        Holds the minority samples
    N : percetange of new synthetic samples:
        n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
    k : int. Number of nearest neighbours.

    Returns
    -------
    S : array, shape = [(N/100) * n_minority_samples, n_features]
    """
    n_minority_samples, n_features = T.shape

    if N < 100:
        #create synthetic samples only for a subset of T.
        #TODO: select random minortiy samples
        N = 100
        pass

    if (N % 100) != 0:
        raise ValueError("N must be < 100 or multiple of 100")

    N = N/100
    n_synthetic_samples = N * n_minority_samples
    S = np.zeros(shape=(n_synthetic_samples, n_features))

    #Learn nearest neighbours
    neigh = NearestNeighbors(n_neighbors = k)
    neigh.fit(T)

    #Calculate synthetic samples
    for i in xrange(n_minority_samples):
        nn = neigh.kneighbors(T[i], return_distance=False)
        for n in xrange(N):
            nn_index = choice(nn[0])
            #NOTE: nn includes T[i], we don't want to select it
            while nn_index == i:
                nn_index = choice(nn[0])

            dif = T[nn_index] - T[i]
            gap = np.random.random()
            S[n + i * N, :] = T[i,:] + gap * dif[:]

    return S


I am going to test it on some bigger data.
There are a few things that might need to change: choice is the Python
implemententation. I did not find the numpy equivalent. Also, what
does neigh.kneighbors really return? I thought it was the indices of the
nearest neighbours but somehow it is an array of an array.
Any other suggestions?

Oh, if it's of any use for anybody feel free to use it or even add it to
scikit. Just tell me if you find any errors or what I should change to make
it more suitable for any API.

best regards,
Karsten

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] SMOTE implementation for over-sampling

Reply via email to