[ https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370049#comment-15370049 ]
Neil Dewar commented on SPARK-16464: ------------------------------------ Thank you Dongjoon, Let me try to explain a little more. My point in logging the bug, is not the particular error message from the third line of code, but that in the second line of code I am able to create a column with a duplicate column name. My third line of code was just there to illustrate that downstream issues occur if there are two columns with the same name. I intentionally used a trivial logic just to generate any column with the same name. The 2.0 logic is blocking the trivial logic in my third statement, which makes sense, but in my second line of code withColumn is still creating a column with a duplicate name, which should be illegal. I just tried two different lines of code in Spark 2.0, after my line 2: sdfCar1 <- withColumn(sdfCar1, "isEfficient", sdfCar1$mpg * 2.0 ) summary(sdfCar1) Both of them error with different errors. I'm prevented from using the data frame sdfCar1 ... presumably because Spark cannot cope with two columns of the same name (not that the error messages report that a duplicate column name is the problem). Bottom line seems to be that withColumn() allows the creation of two columns with the same name on a DataFrame, which should be illegal. The request is to prevent withColumn() function from creating columns which are a duplicate of existing column names. > withColumn() allows illegal creation of duplicate column names on DataFrame > --------------------------------------------------------------------------- > > Key: SPARK-16464 > URL: https://issues.apache.org/jira/browse/SPARK-16464 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL > Affects Versions: 1.6.1 > Environment: Databricks.com > Reporter: Neil Dewar > Priority: Minor > > If I take an existing DataFrame, I am permitted to use withColumn() to create > a duplicate column name. I assume this should be illegal, and withColumn > should be prevented from permitting this. Some functions subsequently fail > due to the duplicate column names. Example: > sdfCar <- createDataFrame(sqlContext, mtcars) > sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20) > sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == > sdfCar1$mpg,1,0)) > sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient) > # subset() command fails with message: "Reference 'isEfficient' is ambiguous" > Note: I only know if this is SparkR - it might affect other languages APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org