Hey dev@! Myself and a few other Spotify data engineers have put together a design doc for SMB Join support in Beam <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>, and have a working Java implementation we've started to put up for PR ([0 <https://github.com/apache/beam/pull/8823>], [1 <https://github.com/apache/beam/pull/8824>], [2 <https://github.com/apache/beam/pull/8486>]). There's more detailed information in the document, but the tl;dr is that SMB is a strategy to optimize joins for file-based sources by modifying the initial write operation to write records in sorted buckets based on the desired join key. This means that subsequent joins of datasets written in this way are only sequential file reads, no shuffling involved. We've seen some pretty substantial performance speedups with our implementation and would love to get it checked in to Beam's Java SDK.
We'd appreciate any suggestions or feedback on our proposal--the design doc should be public to comment on. Thanks! Claire / Neville