Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21252#discussion_r186318646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -127,13 +127,36 @@ case class TakeOrderedAndProjectExec( projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryExecNode { + val sortInMemThreshold = conf.sortInMemForLimitThreshold + + override def requiredChildDistribution: Seq[Distribution] = { + if (limit < sortInMemThreshold) { + super.requiredChildDistribution + } else { + Seq(AllTuples) + } + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = { + if (limit < sortInMemThreshold) { + super.requiredChildOrdering + } else { + Seq(sortOrder) + } + } --- End diff -- It was only shuffling limit-ed rows from each partitions. The final sort was also only applied on limited-data form all partitions. Now it requires sorting and shuffling on whole data. I suspect that it'd degrade performance in the end.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org