viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-729336622
I misread some codes here. Although I think `treeReduce` could resolve the
issue of handling all priority queues at the driver, it seems a general issue
for `reduce` API and ther
viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-729330116
> There will not be an additional map task - it will get pipelined with the
`mapPartitions` - with the `iter.reduceLeft` in `reduce` working on a single
element. Essentially, I a
viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-729205954
> I guess I'd just say, do you have any evidence it speeds things up? If not
I wouldn't change it. if it clearly helps in more imaginable cases than it
hurts, OK.
Yeah, it
viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-729168362
> I suppose the potential overhead of treeReduce is that it may proceed in
several phases, whereas reduce does it in one pass. If the executor-side reduce
is unnecessary because
viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-728773432
> I am not sure I follow - the `reduce` will reduce it at driver - based on
the individual priority queues per partition - while `treeReduce` will
progressively reduce it in exec
viirya commented on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-728735611
I will run benchmark later if the test is passed.
This is an automated message from the Apache Git Service.
To re