Jurriaan Pruis created SPARK-15127:
--------------------------------------

             Summary: Column names are handled incorrectly when they originate 
from a single Dataframe
                 Key: SPARK-15127
                 URL: https://issues.apache.org/jira/browse/SPARK-15127
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core, SQL
    Affects Versions: 1.6.1, 2.0.0
         Environment: Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS
            Reporter: Jurriaan Pruis


I think I found a bug in the way columns are handled in (py)Spark

h3. How to reproduce
{code}
df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 
'b'])

example = sc.parallelize([[1],[2]]).toDF(['id'])

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')

example.join(df_a, 'id').drop(df_a['id']).join(df_b, 
'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
{code}
Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

Expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

When using the aliases in the select statement it does work properly
{code}
example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 
'df_b.b').show()
{code}

Results in expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

Not sure if this is expected behaviour.


It also works when creating a new Dataframe using toDF():
{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.toDF(*df_a.columns)
df_b = df_b.toDF(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 
'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
{code}

Results in expected result:
{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

But not when doing this with a select (which according to the docs, should 
return a *new* Dataframe)

{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.select(*df_a.columns)
df_b = df_b.select(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 
'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
{code}

Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

At least something is unclear in the documentation here, and maybe this is a 
Column handing bug too.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to