[ https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26579. ---------------------------------- Resolution: Invalid > SparkML DecisionTree, how does the algorithm identify categorical features? > --------------------------------------------------------------------------- > > Key: SPARK-26579 > URL: https://issues.apache.org/jira/browse/SPARK-26579 > Project: Spark > Issue Type: Question > Components: ML > Affects Versions: 2.4.0 > Environment: os: Centos7 > software: pyspark. > Reporter: Xufeng Wang > Priority: Major > > I am confused about the decision tree and other tree based models. My current > project involves data with both nominal and continuous features. I have > converted the nominal data to continuous values using the StringIndexer > transformer from the ml.feature module. Then I vector assembled all the > feature values into a vector type column named features. The feature vector, > as I see it, are all double datatype. > While I keep getting the maxBins should be larger than the largest number for > all categorical features error, as I correct the maxBins size, I still see > some features (continuous type since the beginning) having the bigger than my > maxBins size values. Since the pipeline works with correct maxBins that is > not bigger than some continuous values, I should be able to say that the > algorithm automatically pick which features are categorical and which ones > are continuous. But how did it figure out which is which, as all of the > features are of double datatype? > Another question, if anyone can help, what is the tree type for spark > decision tree. Is it CART or else? > Last question, what are the procedures for treating categorical features in > tree based algorithms. > Thank you in advance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org