[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-------------------------------------
    Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
avg($"number").over(myWindow)){code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window orderBy causing full-DF orderBy 
> ---------------------------------------
>
>                 Key: SPARK-29265
>                 URL: https://issues.apache.org/jira/browse/SPARK-29265
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>         Environment: Any
>            Reporter: Florentino Sainz
>            Priority: Minor
>         Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to