natangsalgia commented on PR #25328:
URL: https://github.com/apache/spark/pull/25328#issuecomment-2377480524

   @cloud-fan :
   
   > 2. The `outputOrdering` is not very useful. We can only apply it if a) all 
the bucket columns are read. b) there is only one file in each bucket. This 
condition is really hard to meet, and even if we meet, sorting an already 
sorted file is pretty fast and avoiding the sort is not that useful. I think 
it's worth to give up this optimization so that explain don't need to get stuck.
   
   We see cases where sorting the pre-sorted data results in 30+ mins to 
runtime when reading terabytes of data. There was similar discussion in the 
email list this Aug[[1]].
   
   Are there plans to remove this config that can potentially break Spark users 
with large datasets that benefit from this?
   
   [1]: https://lists.apache.org/thread/10j29fspp5vs9p7w5c20f8sg1pbmq0hr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to