[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-40149: --- Assignee: Wenchen Fan > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40149: Assignee: (was: Apache Spark) > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40149: Assignee: Apache Spark > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Assignee: Apache Spark >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org