[ 
https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384059#comment-15384059
 ] 

Neil Dewar commented on SPARK-16464:
------------------------------------

Hi Dongjoon,
When a second column has been created with an identical name as an existing 
column, I don't think one version of the column effectively hides the other.  

Going back to the original code that I included in the post:
`sdfCar <- createDataFrame(sqlContext, mtcars)  
sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)  
sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == 
sdfCar1$mpg,1,0))  `

If I run `str(sdfCar1)` it displays both versions of the column with the same 
name, so the original is not hidden.

If I run functions that run on DataFrames such as  `summary(sdfCar1)` or 
`columns(sdfCar1)`, they fail reporting 
> Reference 'isEfficient' is ambiguous, could be: isEfficient#62, 
> isEfficient#63.

Same thing if I run functions that operate on a column, such as 
`var(sdfCar1$isEfficient)`
`factorial(sdfCar1$isEfficient)`
`log10(sdfCar1$isEfficient)`
`sdfCar2 <- subset(sdfCar1, sdfCar1$isEfficient < 0.5)`
`sdfCar1 <- withColumn(sdfCar1, "newCol", ifelse(sdfCar1$isEfficient > 30))`

They all fail reporting the error:
> Reference 'isEfficient' is ambiguous, could be: isEfficient#62, 
> isEfficient#63.

My point here is that the behavior of SparkR is unstable when withColumn() has 
been used to add multiple columns of the same name to a DataFrame.

As most (if not all) SparkR functions expect a DataFrame should not have 
multiple columns with the same name, I think `withColumn()` should prevent this 
from occurring.

> withColumn() allows illegal creation of duplicate column names on DataFrame
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-16464
>                 URL: https://issues.apache.org/jira/browse/SPARK-16464
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR, SQL
>    Affects Versions: 1.6.1
>         Environment: Databricks.com
>            Reporter: Neil Dewar
>            Priority: Minor
>
> If I take an existing DataFrame, I am permitted to use withColumn() to create 
> a duplicate column name.  I assume this should be illegal, and withColumn 
> should be prevented from permitting this.  Some functions subsequently fail 
> due to the duplicate column names.  Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
> sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == 
> sdfCar1$mpg,1,0))
> sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)
> # subset() command fails with message: "Reference 'isEfficient' is ambiguous"
> Note: I only know if this is SparkR - it might affect other languages APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to