[ https://issues.apache.org/jira/browse/SPARK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-32973. ---------------------------------- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29868 [https://github.com/apache/spark/pull/29868] > FeatureHasher does not check categoricalCols in inputCols > --------------------------------------------------------- > > Key: SPARK-32973 > URL: https://issues.apache.org/jira/browse/SPARK-32973 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML > Affects Versions: 2.3.0, 2.4.0, 3.0.0, 3.1.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Trivial > Fix For: 3.1.0 > > > doc related to {{categoricalCols}}: > {code:java} > Numeric columns to treat as categorical features. By default only string and > boolean columns are treated as categorical, so this param can be used to > explicitly specify the numerical columns to treat as categorical. Note, the > relevant columns must also be set in inputCols. {code} > > However, the check to make sure {{categoricalCols}} in {{inputCols}} was > never implemented: > for example, in 2.4.7 and current master(3.1.0): > {code:java} > scala> import org.apache.spark.ml.feature._ > import org.apache.spark.ml.feature._ > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors} > scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", > "string") > df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field] > scala> val n = 100 > n: Int = 100 > scala> val hasher = new FeatureHasher().setInputCols("int", > "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n) > > hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f > scala> hasher.transform(df).show > +----+---+------+--------------------+ > |real|int|string| features| > +----+---+------+--------------------+ > | 2.0| 1| foo|(100,[2,39],[1.0,...| > | 3.0| 2| bar|(100,[2,42],[2.0,...| > +----+---+------+--------------------+ > {code} > > CategoricalCols "real" is not in inputCols ("int", "string"). > > I think there are two options: > 1, remove this comment "Note, the relevant columns must also be set in > inputCols. ", since this requirement seems unnecessary; > 2, add a check to make sure all CategoricalCols are in inputCols. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org