CoGbkResult size sampling performance issues

Steve Niemitz Wed, 16 Mar 2022 16:09:11 -0700

The thread a couple days ago about CoGroupByKey being possibly broken in
beam 2.36.0 [1] had an interesting thing in it that had me thinking.  The
CoGbkResultCoder doesn't override getEncodedElementByteSize (nor does it
know how to actually compute the size in any efficient way), so sampling a
CoGbkResult element size will require iterating over the entire result.
This can be incredibly expensive, given that CoGbkResult objects can have
an arbitrarily large number of elements (even more than that could fit into
memory) and will require paging them in from shuffle.  These need to be all
encoded, even if they never would have been to begin with (given that
consumer of the CoGbkResult will fuse to the operation itself).


Making CoGbkResultCoder.getEncodedElementByteSize simply return 0 provided
a significant performance boost in a few of the jobs I tested it with, and
in fact fixed some jobs that had been failing due to OOM issues.  The
downside is obviously that this estimate will be incorrect now in the
dataflow UI, but given that things like GBK are also incorrect, do we
particularly care?

Curious what the community thinks here, imo the benefit far outweighs the
negative.


[1] https://lists.apache.org/thread/5y56kbgm3q0m1byzf7186rrkomrcfldm

CoGbkResult size sampling performance issues

Reply via email to