I'm wondering if it would make sense to have a built-in Beam transformation
for calculating the Cartesian product of PCollections.

Just this past week, I've encountered two separate cases where calculating
a Cartesian product was a bottleneck. The in-memory option of using
something like Python's itertools.product() is convenient, but it only
scales to a single node.

Unfortunately, implementing a scalable Cartesian product seems to be
somewhat non-trivial. I found two version of this question on
StackOverflow, but neither contains a code solution:
https://stackoverflow.com/questions/35008721/how-to-get-the-cartesian-product-of-two-pcollections
https://stackoverflow.com/questions/41050477/how-to-do-a-cartesian-product-of-two-pcollections-in-dataflow/

There's a fair amount of nuance in an efficient and scalable
implementation. My team has an internal implementation of a
CartesianProduct transform, based on using hashing to split a pcollection
into a finite number of groups and CoGroupByKey. On the other hand, if any
of the input pcollections are small, using side inputs would probably be
the way to go to avoid the need for a shuffle.

Any thoughts?

Cheers,
Stephan

Reply via email to