[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

Calin Cocan (JIRA) Wed, 26 Apr 2017 03:24:22 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984540#comment-15984540
 ]


Calin Cocan commented on SPARK-12965:
-------------------------------------

I have encountered the same problem using StringIndexer and VectorAssembles 
components.
In my opinion this particular issue can be fixed directly on these ML classes.

Replacing in StringIndexing fit method 
  val counts = dataset.select(col($(inputCol)).cast(StringType))
with 
  val counts = dataset.select(col(s"`${$(inputCol)}`").cast(StringType))

should do the trick
Also a change must be done as well in StringIndexerModel transform method. The 
call
   dataset.where(filterer(dataset($(inputCol))))
must be replaced with
   dataset.where(filterer(dataset(s"`${$(inputCol)}`")))

BTW a similar problem can be encountered in VectorAssembler transform method at 
this line (105 in spark 2.1):
case _: NumericType | BooleanType => 
dataset(c).cast(DoubleType).as(s"${c}_double_$uid")

Changing dataset(columnName) with its backquote columName should fix the 
problem:

case _: NumericType | BooleanType => 
dataset(c).cast(DoubleType).as(s"${c}_double_$uid")

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> -----------------------------------------------------------------------
>
>                 Key: SPARK-12965
>                 URL: https://issues.apache.org/jira/browse/SPARK-12965
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>            Reporter: Joshua Taylor
>         Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---+--------+
> |a.b|a_b|a_bIndex|
> +---+---+--------+
> |foo|foo|     0.0|
> |bar|bar|     1.0|
> +---+---+--------+
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>       public static void main(String[] args) {
>               // Get the contexts
>               SparkConf conf = new SparkConf()
>                               .setMaster("local[*]")
>                               .setAppName("test")
>                               .set("spark.ui.enabled", "false");
>               JavaSparkContext sparkContext = new JavaSparkContext(conf);
>               SQLContext sqlContext = new SQLContext(sparkContext);
>               
>               // Create a schema with a single string column named "a.b"
>               StructType schema = new StructType(new StructField[] {
>                               DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>               });
>               // Create an empty RDD and DataFrame
>               List<Row> rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>               JavaRDD<Row> rdd = sparkContext.parallelize(rows);
>               DataFrame df = sqlContext.createDataFrame(rdd, schema);
>               
>               df = df.withColumn("a_b", df.col("`a.b`"));
>               
>               StringIndexer indexer0 = new StringIndexer();
>               indexer0.setInputCol("a_b");
>               indexer0.setOutputCol("a_bIndex");
>               df = indexer0.fit(df).transform(df);
>               
>               StringIndexer indexer1 = new StringIndexer();
>               indexer1.setInputCol("`a.b`");
>               indexer1.setOutputCol("abIndex");
>               df = indexer1.fit(df).transform(df);
>               
>               df.show();
>       }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

Reply via email to