In the course of trying to build a model to predict home prices, I replaced
missing values (nan's) with -inf's in order to allow a regression tree
(RandomForestRegressor) to split the missing values into their own branch,
and then I encountered this bug/error.
Exception ValueError: ValueError('Attempting to find a split with an empty
sample_mask',) in 'sklearn.tree._tree.Tree.recursive_partition' ignored

I found a relevant bug thread, but the agreed solution seemed to be to
throw exceptions whenever a feature value is not finite.

I disagree with this strategy.

Often missing values represent a systematic, structural behavior of the
system being modeled. Having an algorithm that explicitly recognizes this
fact would be very useful.

>From a user perspective, it would be helpful and easy to have
regressors/classifiers that handle "nan"s or missing values without having
to do special data pre-processing. In particular, CART is well suited to
overcome the "problem" of missing data; i.e. missing values "should be"
grouped together and considered as an alternate candidate for a node split.

The alternatives of imputation and throwing out samples (brushing) are not
as attractive as explicit splits on "nan" or "-inf" since they bias the raw
data and assume that there was no "meaning" the pattern of missing values.
Sometimes that "nan" really means something important, which is not easy to
represent with medians, means, or even model-imputed values. Adding a
"has_finite_value" feature doesn't help because the original feature still
will have imputed values in it, which could corrupt the information that
the finite features values encode.

If exceptions are thrown when a "nan" is encountered, then sklearn would
force the user to use imputation or brushing, which I think ignores "real
life data" reality.  CART is one of the few algorithms that could
gracefully handle these exceptions and actually produce better models
through this explicit acknowledgement that "missing values happen". It
would be a shame to lose this opportunity to advance the state-of-the-art
in modeling by just throwing exceptions when nan's are encountered, it
would be far better to enhance the algorithm to be able to split on the
decision "is nan".

I hope this makes sense. It certainly would make my life easier!

Thanks!
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to