[ https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449122#comment-16449122 ]
Hyukjin Kwon commented on SPARK-24051: -------------------------------------- Thanks for elaborating it. The details look very helpful. Will try to have some time to take a look as well. > Incorrect results for certain queries using Java and Python APIs on Spark > 2.3.0 > ------------------------------------------------------------------------------- > > Key: SPARK-24051 > URL: https://issues.apache.org/jira/browse/SPARK-24051 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Emlyn Corrin > Priority: Major > > I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) > query, demonstrated by the Java program below. It was simplified from a much > more complex query, but I'm having trouble simplifying it further without > removing the erroneous behaviour. > {code:java} > package sparktest; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.*; > import org.apache.spark.sql.expressions.Window; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.Arrays; > public class Main { > public static void main(String[] args) { > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("local[*]"); > SparkSession session = > SparkSession.builder().config(conf).getOrCreate(); > Row[] arr1 = new Row[]{ > RowFactory.create(1, 42), > RowFactory.create(2, 99)}; > StructType sch1 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty()), > new StructField("b", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); > ds1.show(); > Row[] arr2 = new Row[]{ > RowFactory.create(3)}; > StructType sch2 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) > .withColumn("b", functions.lit(0)); > ds2.show(); > Column[] cols = new Column[]{ > new Column("a"), > new Column("b").as("b"), > functions.count(functions.lit(1)) > .over(Window.partitionBy()) > .as("n")}; > Dataset<Row> ds = ds1 > .select(cols) > .union(ds2.select(cols)) > .where(new Column("n").geq(1)) > .drop("n"); > ds.show(); > //ds.explain(true); > } > } > {code} > It just calculates the union of 2 datasets, > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > +---+---+ > {code} > with > {code:java} > +---+---+ > | a| b| > +---+---+ > | 3| 0| > +---+---+ > {code} > The expected result is: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > | 3| 0| > +---+---+ > {code} > but instead it prints: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 0| > | 2| 0| > | 3| 0| > +---+---+ > {code} > notice how the value in column c is always zero, overriding the original > values in rows 1 and 2. > Making seemingly trivial changes, such as replacing {{new > Column("b").as("b"),}} with just {{new Column("b"),}} or removing the > {{where}} clause after the union, make it behave correctly again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org