HeartSaVioR commented on pull request #31355: URL: https://github.com/apache/spark/pull/31355#issuecomment-768780252
Probably the discussion would be more constructive/productive if each idea may bring the explanation with the actual case, which storage is expected to get benefit on that, how the idea can boost the performance or help avoiding too many concurrent requests on the fly. After that we could determine "who" should decide the parallelism and "who" should provide the information for hints. (e.g. I wouldn't think higher parallelism always gives the best performance where storage only is ready for lower parallelism.) My feeling is that these things are all about "optimizations" whereas the use case I described is about "requirement", so if we agree with the necessity of latter, former looks to be beyond of topic and we'd better discussing in other thread. (Either mailing list or sketched WIP PR if someone has some ideas.) Btw, one thing might need to revisit would be, if I understand correctly, once the data source requires distribution/ordering, end users lose the possibility of giving hint (based on heuristic, knowing the output data source and the characteristic of output data roughly) on the number of output partitions. Would it be addressed when we replace using default number of shuffle partitions with None, or AQE may disregard it? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org