Maciej Szymkiewicz created SPARK-9978:
-----------------------------------------

             Summary: Window functions require partitionBy to work as expected
                 Key: SPARK-9978
                 URL: https://issues.apache.org/jira/browse/SPARK-9978
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.4.1
            Reporter: Maciej Szymkiewicz


I am trying to reproduce following query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

+----+--+
|   x|rn|
+----+--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
+----+--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
    df["x"],
    rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

+----+--+
|   x|rn|
+----+--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
+----+--+
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to