Jurriaan Pruis created SPARK-15127: -------------------------------------- Summary: Column names are handled incorrectly when they originate from a single Dataframe Key: SPARK-15127 URL: https://issues.apache.org/jira/browse/SPARK-15127 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 1.6.1, 2.0.0 Environment: Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS Reporter: Jurriaan Pruis
I think I found a bug in the way columns are handled in (py)Spark h3. How to reproduce {code} df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b']) example = sc.parallelize([[1],[2]]).toDF(['id']) df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show() {code} Results in: {code} +---+---+-----+ | id| a| b| +---+---+-----+ | 1| A|Not B| +---+---+-----+ {code} Expected result: {code} +---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+ {code} When using the aliases in the select statement it does work properly {code} example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show() {code} Results in expected result: {code} +---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+ {code} Not sure if this is expected behaviour. It also works when creating a new Dataframe using toDF(): {code} df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') df_a = df_a.toDF(*df_a.columns) df_b = df_b.toDF(*df_b.columns) example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show() {code} Results in expected result: {code} +---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+ {code} But not when doing this with a select (which according to the docs, should return a *new* Dataframe) {code} df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') df_a = df_a.select(*df_a.columns) df_b = df_b.select(*df_b.columns) example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show() {code} Results in: {code} +---+---+-----+ | id| a| b| +---+---+-----+ | 1| A|Not B| +---+---+-----+ {code} At least something is unclear in the documentation here, and maybe this is a Column handing bug too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org