[ https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-12711: -------------------------------------- Assignee: Grzegorz Chilkiewicz > ML StopWordsRemover does not protect itself from column name duplication > ------------------------------------------------------------------------ > > Key: SPARK-12711 > URL: https://issues.apache.org/jira/browse/SPARK-12711 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 1.6.0 > Reporter: Grzegorz Chilkiewicz > Assignee: Grzegorz Chilkiewicz > Priority: Trivial > Labels: ml, mllib, newbie, suggestion > Fix For: 1.6.1, 2.0.0 > > > At work we were 'taking a closer look' at ML transformers&estimators and I > spotted that anomally. > On first look, resolution looks simple: > Add to StopWordsRemover.transformSchema line (as is done in e.g. > PCA.transformSchema, StandardScaler.transformSchema, > OneHotEncoder.transformSchema): > {code} > require(!schema.fieldNames.contains($(outputCol)), s"Output column > ${$(outputCol)} already exists.") > {code} > Am I correct? Is that a bug? If yes - I am willing to prepare an > appropriate pull request. > Maybe a better idea is to make use of super.transformSchema in > StopWordsRemover (and possibly in all other places)? > Links to files at github, mentioned above: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org