[ https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384059#comment-15384059 ]
Neil Dewar commented on SPARK-16464: ------------------------------------ Hi Dongjoon, When a second column has been created with an identical name as an existing column, I don't think one version of the column effectively hides the other. Going back to the original code that I included in the post: `sdfCar <- createDataFrame(sqlContext, mtcars) sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20) sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == sdfCar1$mpg,1,0)) ` If I run `str(sdfCar1)` it displays both versions of the column with the same name, so the original is not hidden. If I run functions that run on DataFrames such as `summary(sdfCar1)` or `columns(sdfCar1)`, they fail reporting > Reference 'isEfficient' is ambiguous, could be: isEfficient#62, > isEfficient#63. Same thing if I run functions that operate on a column, such as `var(sdfCar1$isEfficient)` `factorial(sdfCar1$isEfficient)` `log10(sdfCar1$isEfficient)` `sdfCar2 <- subset(sdfCar1, sdfCar1$isEfficient < 0.5)` `sdfCar1 <- withColumn(sdfCar1, "newCol", ifelse(sdfCar1$isEfficient > 30))` They all fail reporting the error: > Reference 'isEfficient' is ambiguous, could be: isEfficient#62, > isEfficient#63. My point here is that the behavior of SparkR is unstable when withColumn() has been used to add multiple columns of the same name to a DataFrame. As most (if not all) SparkR functions expect a DataFrame should not have multiple columns with the same name, I think `withColumn()` should prevent this from occurring. > withColumn() allows illegal creation of duplicate column names on DataFrame > --------------------------------------------------------------------------- > > Key: SPARK-16464 > URL: https://issues.apache.org/jira/browse/SPARK-16464 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL > Affects Versions: 1.6.1 > Environment: Databricks.com > Reporter: Neil Dewar > Priority: Minor > > If I take an existing DataFrame, I am permitted to use withColumn() to create > a duplicate column name. I assume this should be illegal, and withColumn > should be prevented from permitting this. Some functions subsequently fail > due to the duplicate column names. Example: > sdfCar <- createDataFrame(sqlContext, mtcars) > sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20) > sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == > sdfCar1$mpg,1,0)) > sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient) > # subset() command fails with message: "Reference 'isEfficient' is ambiguous" > Note: I only know if this is SparkR - it might affect other languages APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org