[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984540#comment-15984540 ]
Calin Cocan commented on SPARK-12965: ------------------------------------- I have encountered the same problem using StringIndexer and VectorAssembles components. In my opinion this particular issue can be fixed directly on these ML classes. Replacing in StringIndexing fit method val counts = dataset.select(col($(inputCol)).cast(StringType)) with val counts = dataset.select(col(s"`${$(inputCol)}`").cast(StringType)) should do the trick Also a change must be done as well in StringIndexerModel transform method. The call dataset.where(filterer(dataset($(inputCol)))) must be replaced with dataset.where(filterer(dataset(s"`${$(inputCol)}`"))) BTW a similar problem can be encountered in VectorAssembler transform method at this line (105 in spark 2.1): case _: NumericType | BooleanType => dataset(c).cast(DoubleType).as(s"${c}_double_$uid") Changing dataset(columnName) with its backquote columName should fix the problem: case _: NumericType | BooleanType => dataset(c).cast(DoubleType).as(s"${c}_double_$uid") > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > ----------------------------------------------------------------------- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 > Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---+--------+ > |a.b|a_b|a_bIndex| > +---+---+--------+ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---+--------+ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List<Row> rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD<Row> rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org