Re: [Scikit-learn-general] Gracefully modeling missing values (nan's) in CART (classification or regression trees).

Andreas Mueller Mon, 22 Jul 2013 05:28:56 -0700

Hi John.

I think there is no doubt that making use of missing values isbeneficial in real applications.

Also, you are right, decision trees are particularly good for handling them.
This is more about implementation and API issues.

Not raising an error doesn't mean it does something useful. We raise anerror because we don't have

an implementation available.

I haven't really caught up with Gilles' rewrite in #2131, but that mightmake implementation more easy.Your way of wanting to split on nan is not the only way to handlemissing values, though.An alternative is to go down both sides of the test. Which is moreappropriate depends on the data, I guess.

There are also general open questions on how to encode missing labels,though. As most estimators don'tsupport working with missing values, I think we mostly thought aboutimputation.

Representing missing values as NaN might create significant overhead. Also,

it might hide problems that users had in preprocessing. So we have tothink about whether that is really the best way.

If you can provide a PR (relative to Gilles' branch) that handles splitson NaN optionally, but does not have anegative impact on datasets without missing values, I'm pretty sure ourexpert tree growers would appreciate that :)


Andy


On 07/16/2013 12:50 AM, John Prior wrote:

In the course of trying to build a model to predict home prices, Ireplaced missing values (nan's) with -inf's in order to allow aregression tree (RandomForestRegressor) to split the missing valuesinto their own branch, and then I encountered this bug/error.Exception ValueError: ValueError('Attempting to find a split with anempty sample_mask',) in 'sklearn.tree._tree.Tree.recursive_partition'ignored
I found a relevant bug thread, but the agreed solution seemed to be tothrow exceptions whenever a feature value is not finite.
I disagree with this strategy.
Often missing values represent a systematic, structural behavior ofthe system being modeled. Having an algorithm that explicitlyrecognizes this fact would be very useful.
From a user perspective, it would be helpful and easy to haveregressors/classifiers that handle "nan"s or missing values withouthaving to do special data pre-processing. In particular, CART is wellsuited to overcome the "problem" of missing data; i.e. missing values"should be" grouped together and considered as an alternate candidatefor a node split.
The alternatives of imputation and throwing out samples (brushing) arenot as attractive as explicit splits on "nan" or "-inf" since theybias the raw data and assume that there was no "meaning" the patternof missing values. Sometimes that "nan" really means somethingimportant, which is not easy to represent with medians, means, or evenmodel-imputed values. Adding a "has_finite_value" feature doesn't helpbecause the original feature still will have imputed values in it,which could corrupt the information that the finite features valuesencode.
If exceptions are thrown when a "nan" is encountered, then sklearnwould force the user to use imputation or brushing, which I thinkignores "real life data" reality. CART is one of the few algorithmsthat could gracefully handle these exceptions and actually producebetter models through this explicit acknowledgement that "missingvalues happen". It would be a shame to lose this opportunity toadvance the state-of-the-art in modeling by just throwing exceptionswhen nan's are encountered, it would be far better to enhance thealgorithm to be able to split on the decision "is nan".
I hope this makes sense. It certainly would make my life easier!

Thanks!


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Gracefully modeling missing values (nan's) in CART (classification or regression trees).

Reply via email to