[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Florentino Sainz updated SPARK-29265: ------------------------------------- Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", avg($"number").over(myWindow)){code} As a user, I would expect either: 1- Error/warning (because trying to sort on one of the columns of the window partitionBy) 2- A mostly-useless operation which just orders the rows inside each Window but doesn't affect performance too much. Currently what I see: *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is performing a global orderBy of the whole DataFrame. Similar to dataframe.orderBy("word").* *In my real environment, my program just didn't finish in time/crashed thus causing my program to be very slow or crash (because as it's a global orderBy, it will just go to one executor).* In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too. > Window orderBy causing full-DF orderBy > --------------------------------------- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.0, 2.4.3, 2.4.4 > Environment: Any > Reporter: Florentino Sainz > Priority: Minor > Attachments: Test.scala, TestSpark.zip > > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > avg($"number").over(myWindow)){code} > > As a user, I would expect either: > 1- Error/warning (because trying to sort on one of the columns of the window > partitionBy) > 2- A mostly-useless operation which just orders the rows inside each Window > but doesn't affect performance too much. > > Currently what I see: > *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is > performing a global orderBy of the whole DataFrame. Similar to > dataframe.orderBy("word").* > *In my real environment, my program just didn't finish in time/crashed thus > causing my program to be very slow or crash (because as it's a global > orderBy, it will just go to one executor).* > > In the test I can see how all elements of my DF are in a single partition > (side-effect of the global orderBy) > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org