[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26572: -- Affects Version/s: 2.1.3 > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.2, 2.3.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Assignee: Peter Toth >Priority: Major > Labels: correctness > Fix For: 2.3.4, 2.4.1, 3.0.0 > > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the > // DF containing the "monotonically_increasing_id" expression is on the > left side of the join. > val baseTable = Seq((1), (1)).toDF("idx") > val distinctWithId = baseTable.distinct.withColumn("id", > functions.monotonically_increasing_id()) > val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") > val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") > monotonicallyOnLeft.show // Wrong > monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 > } > {code} > It produces the following output: > {code:java} > Wrong: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187457 | > +---++ > Right: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187456 | > +---++ > {code} > We assume that the join operator triggers a pushdown of expressions > (monotonically_increasing_id in this case) which gets pushed down to be > executed before distinct. This produces non-distinct rows with unique id's. > However it seems like this behavior only appears if the table with the > projected expression is on the left side of the join in Spark 2.2.2 (for > version 2.4.0 it fails on both joins). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26572: -- Affects Version/s: 2.0.2 > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Assignee: Peter Toth >Priority: Major > Labels: correctness > Fix For: 2.3.4, 2.4.1, 3.0.0 > > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the > // DF containing the "monotonically_increasing_id" expression is on the > left side of the join. > val baseTable = Seq((1), (1)).toDF("idx") > val distinctWithId = baseTable.distinct.withColumn("id", > functions.monotonically_increasing_id()) > val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") > val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") > monotonicallyOnLeft.show // Wrong > monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 > } > {code} > It produces the following output: > {code:java} > Wrong: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187457 | > +---++ > Right: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187456 | > +---++ > {code} > We assume that the join operator triggers a pushdown of expressions > (monotonically_increasing_id in this case) which gets pushed down to be > executed before distinct. This produces non-distinct rows with unique id's. > However it seems like this behavior only appears if the table with the > projected expression is on the left side of the join in Spark 2.2.2 (for > version 2.4.0 it fails on both joins). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26572: - Labels: correctness (was: ) > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Priority: Major > Labels: correctness > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the > // DF containing the "monotonically_increasing_id" expression is on the > left side of the join. > val baseTable = Seq((1), (1)).toDF("idx") > val distinctWithId = baseTable.distinct.withColumn("id", > functions.monotonically_increasing_id()) > val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") > val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") > monotonicallyOnLeft.show // Wrong > monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 > } > {code} > It produces the following output: > {code:java} > Wrong: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187457 | > +---++ > Right: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187456 | > +---++ > {code} > We assume that the join operator triggers a pushdown of expressions > (monotonically_increasing_id in this case) which gets pushed down to be > executed before distinct. This produces non-distinct rows with unique id's. > However it seems like this behavior only appears if the table with the > projected expression is on the left side of the join in Spark 2.2.2 (for > version 2.4.0 it fails on both joins). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26572: - Affects Version/s: 2.3.2 > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Priority: Major > Labels: correctness > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the > // DF containing the "monotonically_increasing_id" expression is on the > left side of the join. > val baseTable = Seq((1), (1)).toDF("idx") > val distinctWithId = baseTable.distinct.withColumn("id", > functions.monotonically_increasing_id()) > val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") > val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") > monotonicallyOnLeft.show // Wrong > monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 > } > {code} > It produces the following output: > {code:java} > Wrong: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187457 | > +---++ > Right: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187456 | > +---++ > {code} > We assume that the join operator triggers a pushdown of expressions > (monotonically_increasing_id in this case) which gets pushed down to be > executed before distinct. This produces non-distinct rows with unique id's. > However it seems like this behavior only appears if the table with the > projected expression is on the left side of the join in Spark 2.2.2 (for > version 2.4.0 it fails on both joins). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sören Reichardt updated SPARK-26572: Description: When joining a table with projected monotonically_increasing_id column after calling distinct with another table the operators do not get executed in the right order. Here is a minimal example: {code:java} import org.apache.spark.sql.{DataFrame, SparkSession, functions} object JoinBug extends App { // Spark session setup val session = SparkSession.builder().master("local[*]").getOrCreate() import session.sqlContext.implicits._ session.sparkContext.setLogLevel("error") // Bug in Spark: "monotonically_increasing_id" is pushed down when it shouldn't be. Push down only happens when the // DF containing the "monotonically_increasing_id" expression is on the left side of the join. val baseTable = Seq((1), (1)).toDF("idx") val distinctWithId = baseTable.distinct.withColumn("id", functions.monotonically_increasing_id()) val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") monotonicallyOnLeft.show // Wrong monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 } {code} It produces the following output: {code:java} Wrong: +---++ |idx| id | +---++ | 1|369367187456 | | 1|369367187457 | +---++ Right: +---++ |idx| id | +---++ | 1|369367187456 | | 1|369367187456 | +---++ {code} We assume that the join operator triggers a pushdown of expressions (monotonically_increasing_id in this case) which gets pushed down to be executed before distinct. This produces non-distinct rows with unique id's. However it seems like this behavior only appears if the table with the projected expression is on the left side of the join in Spark 2.2.2 (for version 2.4.0 it fails on both joins). was: When joining a table with projected monotonically_increasing_id column after calling distinct with another table the operators do not get executed in the right order. Here is a minimal example: {code:java} import org.apache.spark.sql.{DataFrame, SparkSession, functions} object JoinBug extends App { // Spark session setup val session = SparkSession.builder().master("local[*]").getOrCreate() import session.sqlContext.implicits._ session.sparkContext.setLogLevel("error") // Bug in Spark: "monotonically_increasing_id" is pushed down when it shouldn't be. Push down only happens when the // DF containing the "monotonically_increasing_id" expression is on the left side of the join. val baseTable = Seq((1), (1)).toDF("idx") val distinctWithId = baseTable.distinct.withColumn("id", functions.monotonically_increasing_id()) val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") monotonicallyOnLeft.show // Wrong monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 } {code} It produces the following output: {code:java} Wrong: +---++ |idx| id | +---++ | 1|369367187456 | | 1|369367187457 | +---++ Right: +---++ |idx| id | +---++ | 1|369367187456 | | 1|369367187456 | +---++ {code} We assume that the join operator triggers a pushdown of expressions (monotonically_increasing_id in this case) which gets pushed down to be executed before distinct. This produces non-distinct rows with unique id's. However it seems like this behavior only appears if the table with the projected expression is on the left side of the join. > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Priority: Major > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the
[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output
[ https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sören Reichardt updated SPARK-26572: Summary: Join on distinct column with monotonically_increasing_id produces wrong output (was: Join on distinct column with monotonically_increasing_id produced wrong output) > Join on distinct column with monotonically_increasing_id produces wrong output > -- > > Key: SPARK-26572 > URL: https://issues.apache.org/jira/browse/SPARK-26572 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.4.0 > Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5 >Reporter: Sören Reichardt >Priority: Major > > When joining a table with projected monotonically_increasing_id column after > calling distinct with another table the operators do not get executed in the > right order. > Here is a minimal example: > {code:java} > import org.apache.spark.sql.{DataFrame, SparkSession, functions} > object JoinBug extends App { > // Spark session setup > val session = SparkSession.builder().master("local[*]").getOrCreate() > import session.sqlContext.implicits._ > session.sparkContext.setLogLevel("error") > // Bug in Spark: "monotonically_increasing_id" is pushed down when it > shouldn't be. Push down only happens when the > // DF containing the "monotonically_increasing_id" expression is on the > left side of the join. > val baseTable = Seq((1), (1)).toDF("idx") > val distinctWithId = baseTable.distinct.withColumn("id", > functions.monotonically_increasing_id()) > val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx") > val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx") > monotonicallyOnLeft.show // Wrong > monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0 > } > {code} > It produces the following output: > {code:java} > Wrong: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187457 | > +---++ > Right: > +---++ > |idx| id | > +---++ > | 1|369367187456 | > | 1|369367187456 | > +---++ > {code} > We assume that the join operator triggers a pushdown of expressions > (monotonically_increasing_id in this case) which gets pushed down to be > executed before distinct. This produces non-distinct rows with unique id's. > However it seems like this behavior only appears if the table with the > projected expression is on the left side of the join. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org