[ https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saif Addin Ellafi updated SPARK-12880: -------------------------------------- Description: scala> val overVint = Window.partitionBy("product", "bnd", "age").orderBy(asc("yyyymm")) scala> val df_data2 = df_data.withColumn("result", lag("baleom", 1).over(overVint)) scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show +------+------+-----------+--------------------+ |yyyymm|closed|ever_closed| result| +------+------+-----------+--------------------+ |200509| 1| 1|1.2672666129980398E7| |200509| 0| 0|2.7104834668856387E9| |200509| 0| 1| 1.151339011298214E8| +------+------+-----------+--------------------+ scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show +------+------+-----------+--------------------+ |yyyymm|closed|ever_closed| result| +------+------+-----------+--------------------+ |200509| 1| 1|1.2357681589980595E7| |200509| 0| 0| 2.709930867575646E9| |200509| 0| 1|1.1595048973981345E8| +------+------+-----------+--------------------+ Does NOT happen with columns not of the window function Happens both in cluster mode and local mode Before group by operation, data looks good and is consistent Happens when data is large (This case is 1.4 billion rows. Does not happen if I use limit 100000) was: scala> val overVint = Window.partitionBy("product", "bnd", "age").orderBy(asc("yyyymm")) scala> val df_data2 = df_data.withColumn("result", lag("baleom", 1).over(overVint)) scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show +------+------+-----------+--------------------+ |yyyymm|closed|ever_closed| result| +------+------+-----------+--------------------+ |200509| 1| 1|1.2672666129980398E7| |200509| 0| 0|2.7104834668856387E9| |200509| 0| 1| 1.151339011298214E8| +------+------+-----------+--------------------+ scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show +------+------+-----------+--------------------+ |yyyymm|closed|ever_closed| result| +------+------+-----------+--------------------+ |200509| 1| 1|1.2357681589980595E7| |200509| 0| 0| 2.709930867575646E9| |200509| 0| 1|1.1595048973981345E8| +------+------+-----------+--------------------+ Does NOT happen with columns not of the window function Happens both in cluster mode and local mode Before group by operation, data looks good and is consistent > Different results on groupBy after window function > -------------------------------------------------- > > Key: SPARK-12880 > URL: https://issues.apache.org/jira/browse/SPARK-12880 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Saif Addin Ellafi > Priority: Critical > > scala> val overVint = Window.partitionBy("product", "bnd", > "age").orderBy(asc("yyyymm")) > scala> val df_data2 = df_data.withColumn("result", lag("baleom", > 1).over(overVint)) > scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = > 200509").groupBy("yyyymm", "closed", > "ever_closed").agg(sum("result").as("result")).show > +------+------+-----------+--------------------+ > |yyyymm|closed|ever_closed| result| > +------+------+-----------+--------------------+ > |200509| 1| 1|1.2672666129980398E7| > |200509| 0| 0|2.7104834668856387E9| > |200509| 0| 1| 1.151339011298214E8| > +------+------+-----------+--------------------+ > scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = > 200509").groupBy("yyyymm", "closed", > "ever_closed").agg(sum("result").as("result")).show > +------+------+-----------+--------------------+ > |yyyymm|closed|ever_closed| result| > +------+------+-----------+--------------------+ > |200509| 1| 1|1.2357681589980595E7| > |200509| 0| 0| 2.709930867575646E9| > |200509| 0| 1|1.1595048973981345E8| > +------+------+-----------+--------------------+ > Does NOT happen with columns not of the window function > Happens both in cluster mode and local mode > Before group by operation, data looks good and is consistent > Happens when data is large (This case is 1.4 billion rows. Does not happen if > I use limit 100000) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org