from:"Alexander Peletz"

RE: pyspark.sql.functions.last not working as expected

2016-08-18 Thread Alexander Peletz

Is the issue that the default rangeBetween = rangeBetween(-sys.maxsize, 0)? 
That would explain the behavior below. Is this default documented somewhere?

From: Alexander Peletz [mailto:alexand...@slalom.com]
Sent: Wednesday, August 17, 2016 8:48 PM
To: user <user@spark.apache.org>
Subject: RE: pyspark.sql.functions.last not working as expected

So here is the test case from the commit adding the first/last methods here: 
https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_10957_commits_defcc02a8885e884d5140b11705b764a51753162=DQMFAg=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4=CG8rLdeX3MXY5ww7a21fC9kG6o3ZUyyJJt9UELdrJCU=U4fAsecBuw3BMPGSuZ0j23u043-e_FCrWhQG3FRmZVk=GXGsh4z0LCZ3bYloltUC9GBaLVd_ehJ5-m17sv0Gne0=>



+  test("last/first with ignoreNulls") {


+val nullStr: String = null


+val df = Seq(


+  ("a", 0, nullStr),


+  ("a", 1, "x"),


+  ("a", 2, "y"),


+  ("a", 3, "z"),


+  ("a", 4, nullStr),


+  ("b", 1, nullStr),


+  ("b", 2, nullStr)).


+  toDF("key", "order", "value")


+val window = Window.partitionBy($"key").orderBy($"order")


+checkAnswer(


+  df.select(


+$"key",


+$"order",


+first($"value").over(window),


+first($"value", ignoreNulls = false).over(window),


+first($"value", ignoreNulls = true).over(window),


+last($"value").over(window),


+last($"value", ignoreNulls = false).over(window),


+last($"value", ignoreNulls = true).over(window)),


+  Seq(


+Row("a", 0, null, null, null, null, null, null),


+Row("a", 1, null, null, "x", "x", "x", "x"),


+Row("a", 2, null, null, "x", "y", "y", "y"),


+Row("a", 3, null, null, "x", "z", "z", "z"),


+Row("a", 4, null, null, "x", null, null, "z"),


+Row("b", 1, null, null, null, null, null, null),


+Row("b", 2, null, null, null, null, null, null)))


+  }



I would expect the correct results to be as follows instead of what is used 
above. Shouldn't we always return the first or last value in the partition 
based on the ordering? It looks something else is going on... can someone 
explain?

+  Seq(

+Row("a", 0, null, null, "x", null, null, "z"),

+Row("a", 1, null, null, "x", null, null, "z"),

+Row("a", 2, null, null, "x", null, null, "z"),

+Row("a", 3, null, null, "x", null, null, "z"),

+Row("a", 4, null, null, "x", null, null, "z"),

+Row("b", 1, null, null, null, null, null, null),

+Row("b", 2, null, null, null, null, null, null)))



From: Alexander Peletz [mailto:alexand...@slalom.com]
Sent: Wednesday, August 17, 2016 11:57 AM
To: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: pyspark.sql.functions.last not working as expected

Hi,

I am using Spark 2.0 and I am getting unexpected results using the last() 
method. Has anyone else experienced this? I get the sense that last() is 
working correctly within a given data partition but not across the entire RDD. 
First() seems to work as expected so I can work around this by having a window 
that is in reverse order and use first() instead of last() but it would be 
great if last() actually worked.


Thanks,
Alexander


Alexander Peletz
Consultant

slalom

Fortune 100 Best Companies to Work For 2016
Glassdoor Best Places to Work 2016
Consulting Magazine Best Firms to Work For 2015

316 Stuart Street, Suite 300
Boston, MA 02116
706.614.5033 cell | 617.316.5400 office
alexand...@slalom.com<mailto:alexand...@slalom.com>

RE: pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz

So here is the test case from the commit adding the first/last methods here: 
https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162



+  test("last/first with ignoreNulls") {


+val nullStr: String = null


+val df = Seq(


+  ("a", 0, nullStr),


+  ("a", 1, "x"),


+  ("a", 2, "y"),


+  ("a", 3, "z"),


+  ("a", 4, nullStr),


+  ("b", 1, nullStr),


+  ("b", 2, nullStr)).


+  toDF("key", "order", "value")


+val window = Window.partitionBy($"key").orderBy($"order")


+checkAnswer(


+  df.select(


+$"key",


+$"order",


+first($"value").over(window),


+first($"value", ignoreNulls = false).over(window),


+first($"value", ignoreNulls = true).over(window),


+last($"value").over(window),


+last($"value", ignoreNulls = false).over(window),


+last($"value", ignoreNulls = true).over(window)),


+  Seq(


+Row("a", 0, null, null, null, null, null, null),


+Row("a", 1, null, null, "x", "x", "x", "x"),


+Row("a", 2, null, null, "x", "y", "y", "y"),


+Row("a", 3, null, null, "x", "z", "z", "z"),


+Row("a", 4, null, null, "x", null, null, "z"),


+Row("b", 1, null, null, null, null, null, null),


+Row("b", 2, null, null, null, null, null, null)))


+  }



I would expect the correct results to be as follows instead of what is used 
above. Shouldn't we always return the first or last value in the partition 
based on the ordering? It looks something else is going on... can someone 
explain?

+  Seq(

+Row("a", 0, null, null, "x", null, null, "z"),

+Row("a", 1, null, null, "x", null, null, "z"),

+Row("a", 2, null, null, "x", null, null, "z"),

+Row("a", 3, null, null, "x", null, null, "z"),

+Row("a", 4, null, null, "x", null, null, "z"),

+Row("b", 1, null, null, null, null, null, null),

+Row("b", 2, null, null, null, null, null, null)))



From: Alexander Peletz [mailto:alexand...@slalom.com]
Sent: Wednesday, August 17, 2016 11:57 AM
To: user <user@spark.apache.org>
Subject: pyspark.sql.functions.last not working as expected

Hi,

I am using Spark 2.0 and I am getting unexpected results using the last() 
method. Has anyone else experienced this? I get the sense that last() is 
working correctly within a given data partition but not across the entire RDD. 
First() seems to work as expected so I can work around this by having a window 
that is in reverse order and use first() instead of last() but it would be 
great if last() actually worked.


Thanks,
Alexander


Alexander Peletz
Consultant

slalom

Fortune 100 Best Companies to Work For 2016
Glassdoor Best Places to Work 2016
Consulting Magazine Best Firms to Work For 2015

316 Stuart Street, Suite 300
Boston, MA 02116
706.614.5033 cell | 617.316.5400 office
alexand...@slalom.com<mailto:alexand...@slalom.com>

pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz

Hi,

I am using Spark 2.0 and I am getting unexpected results using the last() 
method. Has anyone else experienced this? I get the sense that last() is 
working correctly within a given data partition but not across the entire RDD. 
First() seems to work as expected so I can work around this by having a window 
that is in reverse order and use first() instead of last() but it would be 
great if last() actually worked.


Thanks,
Alexander


Alexander Peletz
Consultant

slalom

Fortune 100 Best Companies to Work For 2016
Glassdoor Best Places to Work 2016
Consulting Magazine Best Firms to Work For 2015

316 Stuart Street, Suite 300
Boston, MA 02116
706.614.5033 cell | 617.316.5400 office
alexand...@slalom.com<mailto:alexand...@slalom.com>

RE: pyspark.sql.functions.last not working as expected

RE: pyspark.sql.functions.last not working as expected

pyspark.sql.functions.last not working as expected

3 matches

Site Navigation

Mail list logo

Footer information