Hi John.
I think there is no doubt that making use of missing values is
beneficial in real applications.
Also, you are right, decision trees are particularly good for handling them.
This is more about implementation and API issues.
Not raising an error doesn't mean it does something useful. We raise an
error because we don't have
an implementation available.
I haven't really caught up with Gilles' rewrite in #2131, but that might
make implementation more easy.
Your way of wanting to split on nan is not the only way to handle
missing values, though.
An alternative is to go down both sides of the test. Which is more
appropriate depends on the data, I guess.
There are also general open questions on how to encode missing labels,
though. As most estimators don't
support working with missing values, I think we mostly thought about
imputation.
Representing missing values as NaN might create significant overhead. Also,
it might hide problems that users had in preprocessing. So we have to
think about whether that is really the best way.
If you can provide a PR (relative to Gilles' branch) that handles splits
on NaN optionally, but does not have a
negative impact on datasets without missing values, I'm pretty sure our
expert tree growers would appreciate that :)
Andy
On 07/16/2013 12:50 AM, John Prior wrote:
In the course of trying to build a model to predict home prices, I
replaced missing values (nan's) with -inf's in order to allow a
regression tree (RandomForestRegressor) to split the missing values
into their own branch, and then I encountered this bug/error.
Exception ValueError: ValueError('Attempting to find a split with an
empty sample_mask',) in 'sklearn.tree._tree.Tree.recursive_partition'
ignored
I found a relevant bug thread, but the agreed solution seemed to be to
throw exceptions whenever a feature value is not finite.
I disagree with this strategy.
Often missing values represent a systematic, structural behavior of
the system being modeled. Having an algorithm that explicitly
recognizes this fact would be very useful.
From a user perspective, it would be helpful and easy to have
regressors/classifiers that handle "nan"s or missing values without
having to do special data pre-processing. In particular, CART is well
suited to overcome the "problem" of missing data; i.e. missing values
"should be" grouped together and considered as an alternate candidate
for a node split.
The alternatives of imputation and throwing out samples (brushing) are
not as attractive as explicit splits on "nan" or "-inf" since they
bias the raw data and assume that there was no "meaning" the pattern
of missing values. Sometimes that "nan" really means something
important, which is not easy to represent with medians, means, or even
model-imputed values. Adding a "has_finite_value" feature doesn't help
because the original feature still will have imputed values in it,
which could corrupt the information that the finite features values
encode.
If exceptions are thrown when a "nan" is encountered, then sklearn
would force the user to use imputation or brushing, which I think
ignores "real life data" reality. CART is one of the few algorithms
that could gracefully handle these exceptions and actually produce
better models through this explicit acknowledgement that "missing
values happen". It would be a shame to lose this opportunity to
advance the state-of-the-art in modeling by just throwing exceptions
when nan's are encountered, it would be far better to enhance the
algorithm to be able to split on the decision "is nan".
I hope this makes sense. It certainly would make my life easier!
Thanks!
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general