[ https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emlyn Corrin updated SPARK-24051: --------------------------------- Description: I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) query, demonstrated by the Java program below. It was simplified from a much more complex query, but I'm having trouble simplifying it further without removing the erroneous behaviour. {code:java} package sparktest; import org.apache.spark.SparkConf; import org.apache.spark.sql.*; import org.apache.spark.sql.expressions.Window; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.Arrays; public class Main { public static void main(String[] args) { SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("local[*]"); SparkSession session = SparkSession.builder().config(conf).getOrCreate(); Row[] arr1 = new Row[]{ RowFactory.create(1, 42), RowFactory.create(2, 99)}; StructType sch1 = new StructType(new StructField[]{ new StructField("a", DataTypes.IntegerType, true, Metadata.empty()), new StructField("b", DataTypes.IntegerType, true, Metadata.empty())}); Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); ds1.show(); Row[] arr2 = new Row[]{ RowFactory.create(3)}; StructType sch2 = new StructType(new StructField[]{ new StructField("a", DataTypes.IntegerType, true, Metadata.empty())}); Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) .withColumn("b", functions.lit(0)); ds2.show(); Column[] cols = new Column[]{ new Column("a"), new Column("b").as("b"), functions.count(functions.lit(1)) .over(Window.partitionBy()) .as("n")}; Dataset<Row> ds = ds1 .select(cols) .union(ds2.select(cols)) .where(new Column("n").geq(1)) .drop("n"); ds.show(); //ds.explain(true); } } {code} It just calculates the union of 2 datasets, {code} +---+---+ | a| b| +---+---+ | 1| 42| | 2| 99| +---+---+ {code} with {code} +---+---+ | a| b| +---+---+ | 3| 0| +---+---+ {code} The expected result is: {code} +---+---+ | a| b| +---+---+ | 1| 42| | 2| 99| | 3| 0| +---+---+ {code} but instead it prints: {code} +---+---+ | a| b| +---+---+ | 1| 0| | 2| 0| | 3| 0| +---+---+ {code} notice how the value in column c is always zero, overriding the original values in rows 1 and 2. Making seemingly trivial changes, such as replacing {{new Column("b").as("b", Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} clause after the union, make it behave correctly again. was: I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) query, demonstrated by the Java program below. It was simplified from a much more complex query, but I'm having trouble simplifying it further without removing the erroneous behaviour. {code:java} package sparktest; import org.apache.spark.SparkConf; import org.apache.spark.sql.*; import org.apache.spark.sql.expressions.Window; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.Arrays; public class Main { public static void main(String[] args) { SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("local[*]"); SparkSession session = SparkSession.builder().config(conf).getOrCreate(); Row[] arr1 = new Row[]{ RowFactory.create(1, 42), RowFactory.create(2, 99)}; StructType sch1 = new StructType(new StructField[]{ new StructField("a", DataTypes.IntegerType, true, Metadata.empty()), new StructField("b", DataTypes.IntegerType, true, Metadata.empty())}); Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); ds1.show(); Row[] arr2 = new Row[]{ RowFactory.create(3)}; StructType sch2 = new StructType(new StructField[]{ new StructField("a", DataTypes.IntegerType, true, Metadata.empty())}); Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) .withColumn("b", functions.lit(0)); ds2.show(); Column[] cols = new Column[]{ new Column("a"), new Column("b").as("b", Metadata.empty()), functions.count(functions.lit(1)) .over(Window.partitionBy(new Column("a"))) .as("n")}; Dataset<Row> ds = ds1 .select(cols) .union(ds2.select(cols)) .where(new Column("n").geq(1)) .drop("n"); ds.show(); //ds.explain(true); } } {code} It just calculates the union of 2 datasets, {code} +---+---+ | a| b| +---+---+ | 1| 42| | 2| 99| +---+---+ {code} with {code} +---+---+ | a| b| +---+---+ | 3| 0| +---+---+ {code} The expected result is: {code} +---+---+ | a| b| +---+---+ | 1| 42| | 2| 99| | 3| 0| +---+---+ {code} but instead it prints: {code} +---+---+ | a| b| +---+---+ | 1| 0| | 2| 0| | 3| 0| +---+---+ {code} notice how the value in column c is always zero, overriding the original values in rows 1 and 2. Making seemingly trivial changes, such as replacing {{new Column("b").as("b", Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} clause after the union, make it behave correctly again. > Incorrect results > ----------------- > > Key: SPARK-24051 > URL: https://issues.apache.org/jira/browse/SPARK-24051 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Emlyn Corrin > Priority: Major > > I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) > query, demonstrated by the Java program below. It was simplified from a much > more complex query, but I'm having trouble simplifying it further without > removing the erroneous behaviour. > {code:java} > package sparktest; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.*; > import org.apache.spark.sql.expressions.Window; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.Arrays; > public class Main { > public static void main(String[] args) { > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("local[*]"); > SparkSession session = > SparkSession.builder().config(conf).getOrCreate(); > Row[] arr1 = new Row[]{ > RowFactory.create(1, 42), > RowFactory.create(2, 99)}; > StructType sch1 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty()), > new StructField("b", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); > ds1.show(); > Row[] arr2 = new Row[]{ > RowFactory.create(3)}; > StructType sch2 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) > .withColumn("b", functions.lit(0)); > ds2.show(); > Column[] cols = new Column[]{ > new Column("a"), > new Column("b").as("b"), > functions.count(functions.lit(1)) > .over(Window.partitionBy()) > .as("n")}; > Dataset<Row> ds = ds1 > .select(cols) > .union(ds2.select(cols)) > .where(new Column("n").geq(1)) > .drop("n"); > ds.show(); > //ds.explain(true); > } > } > {code} > It just calculates the union of 2 datasets, > {code} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > +---+---+ > {code} > with > {code} > +---+---+ > | a| b| > +---+---+ > | 3| 0| > +---+---+ > {code} > The expected result is: > {code} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > | 3| 0| > +---+---+ > {code} > but instead it prints: > {code} > +---+---+ > | a| b| > +---+---+ > | 1| 0| > | 2| 0| > | 3| 0| > +---+---+ > {code} > notice how the value in column c is always zero, overriding the original > values in rows 1 and 2. > Making seemingly trivial changes, such as replacing {{new Column("b").as("b", > Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} > clause after the union, make it behave correctly again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org