[jira] [Assigned] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24051:


Assignee: Apache Spark

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Assignee: Apache Spark
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24051:


Assignee: (was: Apache Spark)

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org