RE: pyspark.sql.functions.last not working as expected

Alexander Peletz Wed, 17 Aug 2016 17:48:18 -0700

So here is the test case from the commit adding the first/last methods here: 
https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162




+  test("last/first with ignoreNulls") {


+    val nullStr: String = null


+    val df = Seq(


+      ("a", 0, nullStr),


+      ("a", 1, "x"),


+      ("a", 2, "y"),


+      ("a", 3, "z"),


+      ("a", 4, nullStr),


+      ("b", 1, nullStr),


+      ("b", 2, nullStr)).


+      toDF("key", "order", "value")


+    val window = Window.partitionBy($"key").orderBy($"order")


+    checkAnswer(


+      df.select(


+        $"key",


+        $"order",


+        first($"value").over(window),


+        first($"value", ignoreNulls = false).over(window),


+        first($"value", ignoreNulls = true).over(window),


+        last($"value").over(window),


+        last($"value", ignoreNulls = false).over(window),


+        last($"value", ignoreNulls = true).over(window)),


+      Seq(


+        Row("a", 0, null, null, null, null, null, null),


+        Row("a", 1, null, null, "x", "x", "x", "x"),


+        Row("a", 2, null, null, "x", "y", "y", "y"),


+        Row("a", 3, null, null, "x", "z", "z", "z"),


+        Row("a", 4, null, null, "x", null, null, "z"),


+        Row("b", 1, null, null, null, null, null, null),


+        Row("b", 2, null, null, null, null, null, null)))


+  }



I would expect the correct results to be as follows instead of what is used 
above. Shouldn't we always return the first or last value in the partition 
based on the ordering? It looks something else is going on... can someone 
explain?

+      Seq(

+        Row("a", 0, null, null, "x", null, null, "z"),

+        Row("a", 1, null, null, "x", null, null, "z"),

+        Row("a", 2, null, null, "x", null, null, "z"),

+        Row("a", 3, null, null, "x", null, null, "z"),

+        Row("a", 4, null, null, "x", null, null, "z"),

+        Row("b", 1, null, null, null, null, null, null),

+        Row("b", 2, null, null, null, null, null, null)))



From: Alexander Peletz [mailto:alexand...@slalom.com]
Sent: Wednesday, August 17, 2016 11:57 AM
To: user <user@spark.apache.org>
Subject: pyspark.sql.functions.last not working as expected

Hi,

I am using Spark 2.0 and I am getting unexpected results using the last() 
method. Has anyone else experienced this? I get the sense that last() is 
working correctly within a given data partition but not across the entire RDD. 
First() seems to work as expected so I can work around this by having a window 
that is in reverse order and use first() instead of last() but it would be 
great if last() actually worked.


Thanks,
Alexander


Alexander Peletz
Consultant

slalom

Fortune 100 Best Companies to Work For 2016
Glassdoor Best Places to Work 2016
Consulting Magazine Best Firms to Work For 2015

316 Stuart Street, Suite 300
Boston, MA 02116
706.614.5033 cell | 617.316.5400 office
alexand...@slalom.com<mailto:alexand...@slalom.com>

RE: pyspark.sql.functions.last not working as expected

Reply via email to