Re: Does DecisionTree model in MLlib deal with missing values?

2015-01-12 Thread Sean Owen
On Sun, Jan 11, 2015 at 9:46 PM, Christopher Thom
christopher.t...@quantium.com.au wrote:
 Is there any plan to extend the data types that would be accepted by the Tree 
 models in Spark? e.g. Many models that we build contain a large number of 
 string-based categorical factors. Currently the only strategy is to map these 
 string values to integers, and store the mapping so the data can be remapped 
 when the model is scored. A viable solution, but cumbersome for models with 
 hundreds of these kinds of factors.

I think there is nothing on the roadmap, except that in the newer ML
API (the bits under spark.ml), there's fuller support for the idea of
a pipeline of transformations, of which performing this encoding could
be one step.

Since it's directly relevant, I don't mind mentioning that we did
build this sort of logic on top of MLlib and PMML. There's nothing
hard about it, just a pile of translation and counting code, such as
in 
https://github.com/OryxProject/oryx/blob/master/oryx-app-common/src/main/java/com/cloudera/oryx/app/rdf/RDFPMMLUtils.java

So there are bits you can reuse out there especially if your goal is
to get to PMML, which will want to represent all the actual
categorical values in its DataDictionary and not encodings.


 Concerning missing data, I haven't been able to figure out how to use NULL 
 values in LabeledPoints, and I'm not sure whether DecisionTrees correctly 
 handle the case of missing data. The only thing I've been able to work out is 
 to use a placeholder value,

Yes, I don't think that's supported. In model training, you can simply
ignore data that can't reach the node because it lacks a feature
needed in a decision rule. This is OK as long as not that much data is
missing.

In scoring you can't not-answer. Again if you refer to PMML, you can
see some ideas about how to handle this:
http://www.dmg.org/v4-2-1/TreeModel.html#xsdType_MISSING-VALUE-STRATEGY

- Make no prediction
- Just copy the last prediction
- Use a model-supplied default for the node
- Use some confidence weighted combination of the answer you'd get by
following both paths

I have opted, in the past, for simply defaulting to the subtree with
more training examples. All of these strategies are approximations,
yes.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does DecisionTree model in MLlib deal with missing values?

2015-01-11 Thread Sean Owen
I do not recall seeing support for missing values.

Categorical values are encoded as 0.0, 1.0, 2.0, ... When training the
model you indicate which are interpreted as categorical with the
categoricalFeaturesInfo parameter, which maps feature offset to count
of distinct categorical values for the feature.

On Sun, Jan 11, 2015 at 6:54 AM, Carter gyz...@hotmail.com wrote:
 Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal
 with missing values? If so, what data structure should I use for the input?

 Moreover, my data has categorical features, but the LabeledPoint requires
 double data type, in this case what can I do?

 Thank you very much.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Does DecisionTree model in MLlib deal with missing values?

2015-01-11 Thread Christopher Thom
Is there any plan to extend the data types that would be accepted by the Tree 
models in Spark? e.g. Many models that we build contain a large number of 
string-based categorical factors. Currently the only strategy is to map these 
string values to integers, and store the mapping so the data can be remapped 
when the model is scored. A viable solution, but cumbersome for models with 
hundreds of these kinds of factors.

Concerning missing data, I haven't been able to figure out how to use NULL 
values in LabeledPoints, and I'm not sure whether DecisionTrees correctly 
handle the case of missing data. The only thing I've been able to work out is 
to use a placeholder value, which is not really what is needed. I think this 
will introduce bias in the model if there is a significant proportion of 
missing data. e.g. suppose we have a factor that is TimeSpentonX. If 20% of 
values are missing, what numeric value should this missing data be replaced 
with? Almost every choice will bias the final model...what we really want is 
the algorithm to just ignore those values.

cheers
chris

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Sunday, 11 January 2015 10:53 PM
To: Carter
Cc: user@spark.apache.org
Subject: Re: Does DecisionTree model in MLlib deal with missing values?

I do not recall seeing support for missing values.

Categorical values are encoded as 0.0, 1.0, 2.0, ... When training the model 
you indicate which are interpreted as categorical with the 
categoricalFeaturesInfo parameter, which maps feature offset to count of 
distinct categorical values for the feature.

On Sun, Jan 11, 2015 at 6:54 AM, Carter gyz...@hotmail.com wrote:
 Hi, I am new to the MLlib in Spark. Can the DecisionTree model in
 MLlib deal with missing values? If so, what data structure should I use for 
 the input?

 Moreover, my data has categorical features, but the LabeledPoint
 requires double data type, in this case what can I do?

 Thank you very much.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-
 model-in-MLlib-deal-with-missing-values-tp21080.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Christopher Thom

QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61 2 9292 6444

W: quantium.com.auwww.quantium.com.au



linkedin.com/company/quantiumwww.linkedin.com/company/quantium

facebook.com/QuantiumAustraliawww.facebook.com/QuantiumAustralia

twitter.com/QuantiumAUwww.twitter.com/QuantiumAU


The contents of this email, including attachments, may be confidential 
information. If you are not the intended recipient, any use, disclosure or 
copying of the information is unauthorised. If you have received this email in 
error, we would be grateful if you would notify us immediately by email reply, 
phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from 
your system.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal
with missing values? If so, what data structure should I use for the input?

Moreover, my data has categorical features, but the LabeledPoint requires
double data type, in this case what can I do?

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-DecisionTree-model-in-MLlib-deal-with-missing-values-tp21080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org