jon-chuang commented on issue #1221: URL: https://github.com/apache/arrow-datafusion/issues/1221#issuecomment-968350452
Regarding shuffling, I saw in some benchmarks for [TiDB's distributed query engine](https://www.youtube.com/watch?v=mmzoSkEhYrA) (incidentally also relying on columnar storage) that an MPP style shuffle seemed to produce better results than map reduce style of Apache Spark. I think there are some open questions, such as whether Java could be the cause of this discrepancy. But maybe it's also worth thinking about how to optimize the shuffles. I don't know enough about DataFusion to know if it takes into account data movement when generating query plans. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
