natangsalgia commented on PR #25328: URL: https://github.com/apache/spark/pull/25328#issuecomment-2377480524
@cloud-fan : > 2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck. We see cases where sorting the pre-sorted data results in 30+ mins to runtime when reading terabytes of data. There was similar discussion in the email list this Aug[[1]]. Are there plans to remove this config that can potentially break Spark users with large datasets that benefit from this? [1]: https://lists.apache.org/thread/10j29fspp5vs9p7w5c20f8sg1pbmq0hr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org