[ 
https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385255#comment-15385255
 ] 

Liwei Lin commented on SPARK-16464:
-----------------------------------

In scala, {{withColumn}}'s behavior is "adding a column or replacing the 
existing column that has the same name" (please refer to 
{Dataset.withColumn|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1708}):

{code}
// results are the same for Spark 1.6.1 and current master

// some setups here

val ds0 = sqlContext.range(1, 4)
ds0.show()
/* prints
  +---+
  | id|
  +---+
  |  1|
  |  2|
  |  3|
  +---+
*/

val ds1 = ds0.withColumn("newId", $"id")
ds1.show()
/* prints
  +---+-----+
  | id|newId|
  +---+-----+
  |  1|    1|
  |  2|    2|
  |  3|    3|
  +---+-----+
*/

val ds2 = ds1.withColumn("newId", $"id" * 2)
ds2.show()
/* prints
  +---+-----+
  | id|newId|
  +---+-----+
  |  1|    2|
  |  2|    4|
  |  3|    6|
  +---+-----+
*/
{code}

> withColumn() allows illegal creation of duplicate column names on DataFrame
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-16464
>                 URL: https://issues.apache.org/jira/browse/SPARK-16464
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR, SQL
>    Affects Versions: 1.6.1
>         Environment: Databricks.com
>            Reporter: Neil Dewar
>            Priority: Minor
>
> If I take an existing DataFrame, I am permitted to use withColumn() to create 
> a duplicate column name.  I assume this should be illegal, and withColumn 
> should be prevented from permitting this.  Some functions subsequently fail 
> due to the duplicate column names.  Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
> sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == 
> sdfCar1$mpg,1,0))
> sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)
> # subset() command fails with message: "Reference 'isEfficient' is ambiguous"
> Note: I only know if this is SparkR - it might affect other languages APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to