On Thu, Jun 07, 2012 at 03:09:11PM +0000, LI Wei wrote:
> Intuitively maybe we can set the missing values using the average over the
> nearest neighbors calculated using these existing features? Not sure
> whether it is the correct way to do it :-)
That's known as "imputation" (or in a particular variant, "k-NN impute").
In general how you treat missing values will depend a lot on your statistical
assumptions, and thus it would be very unwise to have a "one size fits all"
approach to handling missing data, at least without qualifying it as based
on one assumption or another.
Like the independent-and-identically-distributed assumption, the relevant
assumptions are "missing at random" (where the assumption is that the
probability of observing a feature is independent of that feature's value)
and "missing completely at random" (where the assumption is that the
probability of observing a given feature is independent of ALL the features
observed for that training case).
In the case of neural networks, for MAR or MCAR data, simply setting the
feature to zero is not completely crazy, especially when doing stochastic
gradient descent, as the weights update will get multiplied by that zero for
that specific training case. In fact, artificially introducing zeros
("masking noise") is a neat way to encourage robustness for some problems
even when you don't have missing data. For not-missing-at-random data you'd
need to modify the cost function to incorporate your model of how frequently
and when things drop out, and probably estimate the parameters of that model
simultaneously with the MLP parameters -- not something you can really
prepackage.
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general