[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-12965: -------------------------------------- Affects Version/s: (was: 1.6.0) 2.2.0 1.6.3 2.0.2 2.1.0 > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > ----------------------------------------------------------------------- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 > Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---+--------+ > |a.b|a_b|a_bIndex| > +---+---+--------+ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---+--------+ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List<Row> rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD<Row> rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org