[ https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley resolved SPARK-11569. --------------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17233 [https://github.com/apache/spark/pull/17233] > StringIndexer transform fails when column contains nulls > -------------------------------------------------------- > > Key: SPARK-11569 > URL: https://issues.apache.org/jira/browse/SPARK-11569 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.4.0, 1.5.0, 1.6.0 > Reporter: Maciej Szymkiewicz > Fix For: 2.2.0 > > > Transforming column containing {{null}} values using {{StringIndexer}} > results in {{java.lang.NullPointerException}} > {code} > from pyspark.ml.feature import StringIndexer > df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) > df.printSchema() > ## root > ## |-- k: string (nullable = true) > ## |-- v: long (nullable = true) > indexer = StringIndexer(inputCol="k", outputCol="kIdx") > indexer.fit(df).transform(df) > ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: > py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. > ## : java.lang.NullPointerException > {code} > Problem disappears when we drop > {code} > df1 = df.na.drop() > indexer.fit(df1).transform(df1) > {code} > or replace {{nulls}} > {code} > from pyspark.sql.functions import col, when > k = col("k") > df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) > indexer.fit(df2).transform(df2) > {code} > and cannot be reproduced using Scala API > {code} > import org.apache.spark.ml.feature.StringIndexer > val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") > df.printSchema > // root > // |-- k: string (nullable = true) > // |-- v: integer (nullable = false) > val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") > indexer.fit(df).transform(df).count > // 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org