[ 
https://issues.apache.org/jira/browse/SPARK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32973.
----------------------------------
    Fix Version/s: 3.1.0
       Resolution: Fixed

Issue resolved by pull request 29868
[https://github.com/apache/spark/pull/29868]

> FeatureHasher does not check categoricalCols in inputCols
> ---------------------------------------------------------
>
>                 Key: SPARK-32973
>                 URL: https://issues.apache.org/jira/browse/SPARK-32973
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation, ML
>    Affects Versions: 2.3.0, 2.4.0, 3.0.0, 3.1.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Trivial
>             Fix For: 3.1.0
>
>
> doc related to {{categoricalCols}}:
> {code:java}
> Numeric columns to treat as categorical features. By default only string and 
> boolean columns are treated as categorical, so this param can be used to 
> explicitly specify the numerical columns to treat as categorical. Note, the 
> relevant columns must also be set in inputCols. {code}
>  
> However, the check to make sure {{categoricalCols}} in {{inputCols}} was 
> never implemented:
> for example, in 2.4.7 and current master(3.1.0):
> {code:java}
> scala> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.feature._
> scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}
> scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", 
> "string")
> df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
> scala> val n = 100
> n: Int = 100
> scala> val hasher = new FeatureHasher().setInputCols("int", 
> "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n)
>  
> hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
> scala> hasher.transform(df).show
> +----+---+------+--------------------+
> |real|int|string|            features|
> +----+---+------+--------------------+
> | 2.0|  1|   foo|(100,[2,39],[1.0,...|
> | 3.0|  2|   bar|(100,[2,42],[2.0,...|
> +----+---+------+--------------------+
> {code}
>  
> CategoricalCols "real" is not in inputCols ("int", "string").
>  
> I think there are two options:
> 1, remove this comment  "Note, the relevant columns must also be set in 
> inputCols. ", since this requirement seems unnecessary;
> 2, add a check to make sure all CategoricalCols are in inputCols.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to