[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

Saif Addin Ellafi (JIRA) Mon, 18 Jan 2016 11:03:12 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saif Addin Ellafi updated SPARK-12880:
--------------------------------------
    Description: 
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("yyyymm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
200509").groupBy("yyyymm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2672666129980398E7|
|200509|     0|          0|2.7104834668856387E9|
|200509|     0|          1| 1.151339011298214E8|
+------+------+-----------+--------------------+


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
200509").groupBy("yyyymm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2357681589980595E7|
|200509|     0|          0| 2.709930867575646E9|
|200509|     0|          1|1.1595048973981345E8|
+------+------+-----------+--------------------+

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent
Happens when data is large (This case is 1.4 billion rows. Does not happen if I 
use limit 100000)


  was:
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("yyyymm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
200509").groupBy("yyyymm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2672666129980398E7|
|200509|     0|          0|2.7104834668856387E9|
|200509|     0|          1| 1.151339011298214E8|
+------+------+-----------+--------------------+


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
200509").groupBy("yyyymm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2357681589980595E7|
|200509|     0|          0| 2.709930867575646E9|
|200509|     0|          1|1.1595048973981345E8|
+------+------+-----------+--------------------+

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent



> Different results on groupBy after window function
> --------------------------------------------------
>
>                 Key: SPARK-12880
>                 URL: https://issues.apache.org/jira/browse/SPARK-12880
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Saif Addin Ellafi
>            Priority: Critical
>
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("yyyymm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
> 200509").groupBy("yyyymm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +------+------+-----------+--------------------+
> |yyyymm|closed|ever_closed|              result|
> +------+------+-----------+--------------------+
> |200509|     1|          1|1.2672666129980398E7|
> |200509|     0|          0|2.7104834668856387E9|
> |200509|     0|          1| 1.151339011298214E8|
> +------+------+-----------+--------------------+
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 
> 200509").groupBy("yyyymm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +------+------+-----------+--------------------+
> |yyyymm|closed|ever_closed|              result|
> +------+------+-----------+--------------------+
> |200509|     1|          1|1.2357681589980595E7|
> |200509|     0|          0| 2.709930867575646E9|
> |200509|     0|          1|1.1595048973981345E8|
> +------+------+-----------+--------------------+
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent
> Happens when data is large (This case is 1.4 billion rows. Does not happen if 
> I use limit 100000)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

Reply via email to