[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-17 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-729336622 I misread some codes here. Although I think `treeReduce` could resolve the issue of handling all priority queues at the driver, it seems a general issue for `reduce` API and ther

[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-17 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-729330116 > There will not be an additional map task - it will get pipelined with the `mapPartitions` - with the `iter.reduceLeft` in `reduce` working on a single element. Essentially, I a

[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-17 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-729205954 > I guess I'd just say, do you have any evidence it speeds things up? If not I wouldn't change it. if it clearly helps in more imaginable cases than it hurts, OK. Yeah, it

[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-17 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-729168362 > I suppose the potential overhead of treeReduce is that it may proceed in several phases, whereas reduce does it in one pass. If the executor-side reduce is unnecessary because

[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-17 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-728773432 > I am not sure I follow - the `reduce` will reduce it at driver - based on the individual priority queues per partition - while `treeReduce` will progressively reduce it in exec

[GitHub] [spark] viirya commented on pull request #30392: [SPARK-33465][CORE] RDD.takeOrdered should get rid of usage of reduce or use treeReduce instead

2020-11-16 Thread GitBox
viirya commented on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-728735611 I will run benchmark later if the test is passed. This is an automated message from the Apache Git Service. To re