Maciej Szymkiewicz created SPARK-9978: -----------------------------------------
Summary: Window functions require partitionBy to work as expected Key: SPARK-9978 URL: https://issues.apache.org/jira/browse/SPARK-9978 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: Maciej Szymkiewicz I am trying to reproduce following query: {code} df.registerTempTable("df") sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show() +----+--+ | x|rn| +----+--+ |0.25| 1| | 0.5| 2| |0.75| 3| +----+--+ {code} using PySpark API. Unfortunately it doesn't work as expected: {code} from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() +----+--+ | x|rn| +----+--+ | 0.5| 1| |0.25| 1| |0.75| 1| +----+--+ {code} As a workaround It is possible to call partitionBy without additional arguments: {code} df.select( df["x"], rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") ).show() +----+--+ | x|rn| +----+--+ |0.25| 1| | 0.5| 2| |0.75| 3| +----+--+ {code} but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API: {code} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.rowNumber case class Record(x: Double) val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75)) df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show +----+--+ | x|rn| +----+--+ |0.25| 1| | 0.5| 2| |0.75| 3| +----+--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org