[Scikit-learn-general] Feature selection/extraction contribution 2

Andrea Bravi Sat, 19 Oct 2013 08:30:37 -0700

Hi Andreas,


you can find an extensive description of those techniques in this doctoral
thesis from a friend of mine at Oxford University (pag. 99-105), together
with the appropriate references.

http://people.maths.ox.ac.uk/tsanas/Preprints/DPhil%20thesis.pdf

Let me provide you with a brief summary:

- mRMR is an algorithm that computes the mutual information across features
of a dataset (redundancy), as well as the mutual information between
each feature and the target class (relevance), and what it does is to
sequentially order the features by creating sets maximizing relevance and
minimizing redundancy. Therefore, the first feature will be the one with
highest relevance and lowest redundancy, the first and second feature
together will be the subsequent set maximizing relevance and minimizing
redundancy, and so on...

- The Gram-Schmidt orthogonalization (GSO) instead selects feature based on
their ability of "linearly explain" the class. Briefly, consider an input
matrix IN=NxM matrix, where N is the number of samples and M the number of
features, and an output vector OUT=Nx1, then:
1) find the features F (Nx1 vector) of IN which is most aligned with OUT
2) project IN into F, obtaining a new input matrix IN2=Nx(M-1), and create
e new output vector OUT2=Nx1 based on what F could not explain of OUT
3) repeat the process

- Relief is based on a nearest-neighbour approach. Basically for T times
you select a random instance of the NxM dataset (defined as before), and
identify the instance which is closer to it in the feature space and have
the same class (Hit) and the instance which is closer to it however has a
different class (Miss). Each of the M feature will be assigned a positive
relevance if it makes the chosen instance close to Hit and far from Miss,
and a negative relevance in the opposite case.

Those are three algorithms that are quite known and applied in a variety of
scenarios, as you can see from the Dphil thesis I referenced above.

mRMR is quite general purpose, GSO works well if you plan using a
discriminant model afterwards, Relief works nicely combined with kNN.

Let me know if you guys need further details!

Andrea

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Feature selection/extraction contribution 2

Reply via email to