Hi John.
I think there is no doubt that making use of missing values is beneficial in real applications.
Also, you are right, decision trees are particularly good for handling them.
This is more about implementation and API issues.
Not raising an error doesn't mean it does something useful. We raise an error because we don't have
an implementation available.


I haven't really caught up with Gilles' rewrite in #2131, but that might make implementation more easy. Your way of wanting to split on nan is not the only way to handle missing values, though. An alternative is to go down both sides of the test. Which is more appropriate depends on the data, I guess.

There are also general open questions on how to encode missing labels, though. As most estimators don't support working with missing values, I think we mostly thought about imputation.
Representing missing values as NaN might create significant overhead. Also,
it might hide problems that users had in preprocessing. So we have to think about whether that is really the best way.

If you can provide a PR (relative to Gilles' branch) that handles splits on NaN optionally, but does not have a negative impact on datasets without missing values, I'm pretty sure our expert tree growers would appreciate that :)

Andy


On 07/16/2013 12:50 AM, John Prior wrote:
In the course of trying to build a model to predict home prices, I replaced missing values (nan's) with -inf's in order to allow a regression tree (RandomForestRegressor) to split the missing values into their own branch, and then I encountered this bug/error. Exception ValueError: ValueError('Attempting to find a split with an empty sample_mask',) in 'sklearn.tree._tree.Tree.recursive_partition' ignored

I found a relevant bug thread, but the agreed solution seemed to be to throw exceptions whenever a feature value is not finite.

I disagree with this strategy.

Often missing values represent a systematic, structural behavior of the system being modeled. Having an algorithm that explicitly recognizes this fact would be very useful.

From a user perspective, it would be helpful and easy to have regressors/classifiers that handle "nan"s or missing values without having to do special data pre-processing. In particular, CART is well suited to overcome the "problem" of missing data; i.e. missing values "should be" grouped together and considered as an alternate candidate for a node split.

The alternatives of imputation and throwing out samples (brushing) are not as attractive as explicit splits on "nan" or "-inf" since they bias the raw data and assume that there was no "meaning" the pattern of missing values. Sometimes that "nan" really means something important, which is not easy to represent with medians, means, or even model-imputed values. Adding a "has_finite_value" feature doesn't help because the original feature still will have imputed values in it, which could corrupt the information that the finite features values encode.

If exceptions are thrown when a "nan" is encountered, then sklearn would force the user to use imputation or brushing, which I think ignores "real life data" reality. CART is one of the few algorithms that could gracefully handle these exceptions and actually produce better models through this explicit acknowledgement that "missing values happen". It would be a shame to lose this opportunity to advance the state-of-the-art in modeling by just throwing exceptions when nan's are encountered, it would be far better to enhance the algorithm to be able to split on the decision "is nan".

I hope this makes sense. It certainly would make my life easier!

Thanks!


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to