Jose Antonio created SPARK-11481:
------------------------------------

             Summary: orderBy with multiple columns in WindowSpec does not work 
properly
                 Key: SPARK-11481
                 URL: https://issues.apache.org/jira/browse/SPARK-11481
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.5.1
         Environment: All
            Reporter: Jose Antonio


When using multiple columns in the orderBy of a WindowSpec the order by seems 
to work only for the first column.

A possible workaround is to sort previosly the DataFrame and then apply the 
window spec over the sorted DataFrame

e.g. 
THIS NOT WORKS:
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
'mib_id', 'day').rowsBetween(-sys.maxsize, 0)

df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))

THIS WORKS WELL:
df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
'mib_id', 'day').rowsBetween(-sys.maxsize, 0)

df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))

Also, can anybody confirm that this is a true workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to