RE: pyspark.sql.functions.last not working as expected
Is the issue that the default rangeBetween = rangeBetween(-sys.maxsize, 0)? That would explain the behavior below. Is this default documented somewhere? From: Alexander Peletz [mailto:alexand...@slalom.com] Sent: Wednesday, August 17, 2016 8:48 PM To: user <user@spark.apache.org> Subject: RE: pyspark.sql.functions.last not working as expected So here is the test case from the commit adding the first/last methods here: https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_10957_commits_defcc02a8885e884d5140b11705b764a51753162=DQMFAg=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4=CG8rLdeX3MXY5ww7a21fC9kG6o3ZUyyJJt9UELdrJCU=U4fAsecBuw3BMPGSuZ0j23u043-e_FCrWhQG3FRmZVk=GXGsh4z0LCZ3bYloltUC9GBaLVd_ehJ5-m17sv0Gne0=> + test("last/first with ignoreNulls") { +val nullStr: String = null +val df = Seq( + ("a", 0, nullStr), + ("a", 1, "x"), + ("a", 2, "y"), + ("a", 3, "z"), + ("a", 4, nullStr), + ("b", 1, nullStr), + ("b", 2, nullStr)). + toDF("key", "order", "value") +val window = Window.partitionBy($"key").orderBy($"order") +checkAnswer( + df.select( +$"key", +$"order", +first($"value").over(window), +first($"value", ignoreNulls = false).over(window), +first($"value", ignoreNulls = true).over(window), +last($"value").over(window), +last($"value", ignoreNulls = false).over(window), +last($"value", ignoreNulls = true).over(window)), + Seq( +Row("a", 0, null, null, null, null, null, null), +Row("a", 1, null, null, "x", "x", "x", "x"), +Row("a", 2, null, null, "x", "y", "y", "y"), +Row("a", 3, null, null, "x", "z", "z", "z"), +Row("a", 4, null, null, "x", null, null, "z"), +Row("b", 1, null, null, null, null, null, null), +Row("b", 2, null, null, null, null, null, null))) + } I would expect the correct results to be as follows instead of what is used above. Shouldn't we always return the first or last value in the partition based on the ordering? It looks something else is going on... can someone explain? + Seq( +Row("a", 0, null, null, "x", null, null, "z"), +Row("a", 1, null, null, "x", null, null, "z"), +Row("a", 2, null, null, "x", null, null, "z"), +Row("a", 3, null, null, "x", null, null, "z"), +Row("a", 4, null, null, "x", null, null, "z"), +Row("b", 1, null, null, null, null, null, null), +Row("b", 2, null, null, null, null, null, null))) From: Alexander Peletz [mailto:alexand...@slalom.com] Sent: Wednesday, August 17, 2016 11:57 AM To: user <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: pyspark.sql.functions.last not working as expected Hi, I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone else experienced this? I get the sense that last() is working correctly within a given data partition but not across the entire RDD. First() seems to work as expected so I can work around this by having a window that is in reverse order and use first() instead of last() but it would be great if last() actually worked. Thanks, Alexander Alexander Peletz Consultant slalom Fortune 100 Best Companies to Work For 2016 Glassdoor Best Places to Work 2016 Consulting Magazine Best Firms to Work For 2015 316 Stuart Street, Suite 300 Boston, MA 02116 706.614.5033 cell | 617.316.5400 office alexand...@slalom.com<mailto:alexand...@slalom.com>
RE: pyspark.sql.functions.last not working as expected
So here is the test case from the commit adding the first/last methods here: https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162 + test("last/first with ignoreNulls") { +val nullStr: String = null +val df = Seq( + ("a", 0, nullStr), + ("a", 1, "x"), + ("a", 2, "y"), + ("a", 3, "z"), + ("a", 4, nullStr), + ("b", 1, nullStr), + ("b", 2, nullStr)). + toDF("key", "order", "value") +val window = Window.partitionBy($"key").orderBy($"order") +checkAnswer( + df.select( +$"key", +$"order", +first($"value").over(window), +first($"value", ignoreNulls = false).over(window), +first($"value", ignoreNulls = true).over(window), +last($"value").over(window), +last($"value", ignoreNulls = false).over(window), +last($"value", ignoreNulls = true).over(window)), + Seq( +Row("a", 0, null, null, null, null, null, null), +Row("a", 1, null, null, "x", "x", "x", "x"), +Row("a", 2, null, null, "x", "y", "y", "y"), +Row("a", 3, null, null, "x", "z", "z", "z"), +Row("a", 4, null, null, "x", null, null, "z"), +Row("b", 1, null, null, null, null, null, null), +Row("b", 2, null, null, null, null, null, null))) + } I would expect the correct results to be as follows instead of what is used above. Shouldn't we always return the first or last value in the partition based on the ordering? It looks something else is going on... can someone explain? + Seq( +Row("a", 0, null, null, "x", null, null, "z"), +Row("a", 1, null, null, "x", null, null, "z"), +Row("a", 2, null, null, "x", null, null, "z"), +Row("a", 3, null, null, "x", null, null, "z"), +Row("a", 4, null, null, "x", null, null, "z"), +Row("b", 1, null, null, null, null, null, null), +Row("b", 2, null, null, null, null, null, null))) From: Alexander Peletz [mailto:alexand...@slalom.com] Sent: Wednesday, August 17, 2016 11:57 AM To: user <user@spark.apache.org> Subject: pyspark.sql.functions.last not working as expected Hi, I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone else experienced this? I get the sense that last() is working correctly within a given data partition but not across the entire RDD. First() seems to work as expected so I can work around this by having a window that is in reverse order and use first() instead of last() but it would be great if last() actually worked. Thanks, Alexander Alexander Peletz Consultant slalom Fortune 100 Best Companies to Work For 2016 Glassdoor Best Places to Work 2016 Consulting Magazine Best Firms to Work For 2015 316 Stuart Street, Suite 300 Boston, MA 02116 706.614.5033 cell | 617.316.5400 office alexand...@slalom.com<mailto:alexand...@slalom.com>
pyspark.sql.functions.last not working as expected
Hi, I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone else experienced this? I get the sense that last() is working correctly within a given data partition but not across the entire RDD. First() seems to work as expected so I can work around this by having a window that is in reverse order and use first() instead of last() but it would be great if last() actually worked. Thanks, Alexander Alexander Peletz Consultant slalom Fortune 100 Best Companies to Work For 2016 Glassdoor Best Places to Work 2016 Consulting Magazine Best Firms to Work For 2015 316 Stuart Street, Suite 300 Boston, MA 02116 706.614.5033 cell | 617.316.5400 office alexand...@slalom.com<mailto:alexand...@slalom.com>