Luke created SPARK-28189: ---------------------------- Summary: Pyspark - df.drop is Case Sensitive when Referring to Upstream Tables Key: SPARK-28189 URL: https://issues.apache.org/jira/browse/SPARK-28189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Luke
Column names in general are case insensitive in Pyspark, and df.drop("col") in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(valuesA, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org