I'd like to write a Beam PTransform that applies an *existing* Beam
transform to each set of grouped values, separately, and combines the
result. Is anything like this possible with Beam using the Python SDK?

Here are the closest things I've come up with:
1. If each set of *inputs* to my transform fit into memory, I could use
GroupByKey followed by FlatMap.
2. If each set of *outputs* from my transform fit into memory, I could use
CombinePerKey.
3. If I knew the static number of groups ahead of time, I could use
Partition, followed by applying my transform multiple times, followed by
Flatten.

In my scenario, none of these holds true. For example, currently I have ~20
groups of values, with each group holding ~1 TB of data. My custom
transform simply shuffles this TB of data around, so each set of outputs is
also 1TB in size.

In my particular case, it seems my options are to either relax these
constraints, or to manually convert each step of my existing transform to
apply per key. This conversion process is tedious, but very
straightforward, e.g., the GroupByKey and ParDo that my transform is built
out of just need to deal with an expanded key.

I wonder, could this be something built into Beam itself, e.g,. as
TransformPerKey? The ptranforms that result from combining other Beam
transforms (e.g., _ChainPTransform in Python) are private, so this seems
like something that would need to exist in Beam itself, if it could exist
at all.

Cheers,
Stephan

Reply via email to