[jira] [Updated] (SPARK-24051) Incorrect results

Emlyn Corrin (JIRA) Mon, 23 Apr 2018 02:00:51 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Emlyn Corrin updated SPARK-24051:
---------------------------------
    Description: 
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("SparkTest")
                .setMaster("local[*]");
        SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

        Row[] arr1 = new Row[]{
                RowFactory.create(1, 42),
                RowFactory.create(2, 99)};
        StructType sch1 = new StructType(new StructField[]{
                new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
                new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
        Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
        ds1.show();

        Row[] arr2 = new Row[]{
                RowFactory.create(3)};
        StructType sch2 = new StructType(new StructField[]{
                new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
        Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
                .withColumn("b", functions.lit(0));
        ds2.show();

        Column[] cols = new Column[]{
                new Column("a"),
                new Column("b").as("b"),
                functions.count(functions.lit(1))
                        .over(Window.partitionBy())
                        .as("n")};
        Dataset<Row> ds = ds1
                .select(cols)
                .union(ds2.select(cols))
                .where(new Column("n").geq(1))
                .drop("n");
        ds.show();
        //ds.explain(true);
    }
}
{code}
It just calculates the union of 2 datasets,
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
+---+---+
{code}
with
{code}
+---+---+
|  a|  b|
+---+---+
|  3|  0|
+---+---+
{code}
The expected result is:
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
|  3|  0|
+---+---+
{code}
but instead it prints:
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  0|
|  2|  0|
|  3|  0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values 
in rows 1 and 2.
Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
clause after the union, make it behave correctly again.

  was:
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("SparkTest")
                .setMaster("local[*]");
        SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

        Row[] arr1 = new Row[]{
                RowFactory.create(1, 42),
                RowFactory.create(2, 99)};
        StructType sch1 = new StructType(new StructField[]{
                new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
                new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
        Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
        ds1.show();

        Row[] arr2 = new Row[]{
                RowFactory.create(3)};
        StructType sch2 = new StructType(new StructField[]{
                new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
        Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
                .withColumn("b", functions.lit(0));
        ds2.show();

        Column[] cols = new Column[]{
                new Column("a"),
                new Column("b").as("b", Metadata.empty()),
                functions.count(functions.lit(1))
                        .over(Window.partitionBy(new Column("a")))
                        .as("n")};
        Dataset<Row> ds = ds1
                .select(cols)
                .union(ds2.select(cols))
                .where(new Column("n").geq(1))
                .drop("n");
        ds.show();
        //ds.explain(true);
    }
}
{code}
It just calculates the union of 2 datasets,
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
+---+---+
{code}
with
{code}
+---+---+
|  a|  b|
+---+---+
|  3|  0|
+---+---+
{code}
The expected result is:
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
|  3|  0|
+---+---+
{code}
but instead it prints:
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  0|
|  2|  0|
|  3|  0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values 
in rows 1 and 2.
Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
clause after the union, make it behave correctly again.


> Incorrect results
> -----------------
>
>                 Key: SPARK-24051
>                 URL: https://issues.apache.org/jira/browse/SPARK-24051
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Emlyn Corrin
>            Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
>     public static void main(String[] args) {
>         SparkConf conf = new SparkConf()
>                 .setAppName("SparkTest")
>                 .setMaster("local[*]");
>         SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
>         Row[] arr1 = new Row[]{
>                 RowFactory.create(1, 42),
>                 RowFactory.create(2, 99)};
>         StructType sch1 = new StructType(new StructField[]{
>                 new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
>                 new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
>         Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
>         ds1.show();
>         Row[] arr2 = new Row[]{
>                 RowFactory.create(3)};
>         StructType sch2 = new StructType(new StructField[]{
>                 new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
>         Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
>                 .withColumn("b", functions.lit(0));
>         ds2.show();
>         Column[] cols = new Column[]{
>                 new Column("a"),
>                 new Column("b").as("b"),
>                 functions.count(functions.lit(1))
>                         .over(Window.partitionBy())
>                         .as("n")};
>         Dataset<Row> ds = ds1
>                 .select(cols)
>                 .union(ds2.select(cols))
>                 .where(new Column("n").geq(1))
>                 .drop("n");
>         ds.show();
>         //ds.explain(true);
>     }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
> Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
> Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
> clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results

Reply via email to