[ 
https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370049#comment-15370049
 ] 

Neil Dewar commented on SPARK-16464:
------------------------------------

Thank you Dongjoon,
Let me try to explain a little more.  My point in logging the bug, is not the 
particular error message from the third line of code, but that in the second 
line of code I am able to create a column with a duplicate column name.  

My third line of code was just there to illustrate that downstream issues occur 
if there are two columns with the same name.  I intentionally used a trivial 
logic just to generate any column with the same name.  The 2.0 logic is 
blocking the trivial logic in my third statement, which makes sense, but in my 
second line of code withColumn is still creating a column with a duplicate 
name, which should be illegal.

I just tried two different lines of code in Spark 2.0, after my line 2:
sdfCar1 <- withColumn(sdfCar1, "isEfficient", sdfCar1$mpg * 2.0 )
summary(sdfCar1)

Both of them error with different errors. I'm prevented from using the data 
frame sdfCar1 ... presumably because Spark cannot cope with two columns of the 
same name (not that the error messages report that a duplicate column name is 
the problem).  

Bottom line seems to be that withColumn() allows the creation of two columns 
with the same name on a DataFrame, which should be illegal.  The request is to 
prevent withColumn() function from creating columns which are a duplicate of 
existing column names.

> withColumn() allows illegal creation of duplicate column names on DataFrame
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-16464
>                 URL: https://issues.apache.org/jira/browse/SPARK-16464
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR, SQL
>    Affects Versions: 1.6.1
>         Environment: Databricks.com
>            Reporter: Neil Dewar
>            Priority: Minor
>
> If I take an existing DataFrame, I am permitted to use withColumn() to create 
> a duplicate column name.  I assume this should be illegal, and withColumn 
> should be prevented from permitting this.  Some functions subsequently fail 
> due to the duplicate column names.  Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
> sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == 
> sdfCar1$mpg,1,0))
> sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)
> # subset() command fails with message: "Reference 'isEfficient' is ambiguous"
> Note: I only know if this is SparkR - it might affect other languages APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to