HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768780252


   Probably the discussion would be more constructive/productive if each idea 
may bring the explanation with the actual case, which storage is expected to 
get benefit on that, how the idea can boost the performance or help avoiding 
too many concurrent requests on the fly. After that we could determine "who" 
should decide the parallelism and "who" should provide the information for 
hints.
   (e.g. I wouldn't think higher parallelism always gives the best performance 
where storage only is ready for lower parallelism.)
   
   My feeling is that these things are all about "optimizations" whereas the 
use case I described is about "requirement", so if we agree with the necessity 
of latter, former looks to be beyond of topic and we'd better discussing in 
other thread. (Either mailing list or sketched WIP PR if someone has some 
ideas.)
   
   Btw, one thing might need to revisit would be, if I understand correctly, 
once the data source requires distribution/ordering, end users lose the 
possibility of giving hint (based on heuristic, knowing the output data source 
and the characteristic of output data roughly) on the number of output 
partitions. Would it be addressed when we replace using default number of 
shuffle partitions with None, or AQE may disregard it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to