Hey dev@!

Myself and a few other Spotify data engineers have put together a design
doc for SMB Join support in Beam
<https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
and
have a working Java implementation we've started to put up for PR ([0
<https://github.com/apache/beam/pull/8823>], [1
<https://github.com/apache/beam/pull/8824>], [2
<https://github.com/apache/beam/pull/8486>]). There's more detailed
information in the document, but the tl;dr is that SMB is a strategy to
optimize joins for file-based sources by modifying the initial write
operation to write records in sorted buckets based on the desired join key.
This means that subsequent joins of datasets written in this way are only
sequential file reads, no shuffling involved. We've seen some pretty
substantial performance speedups with our implementation and would love to
get it checked in to Beam's Java SDK.

We'd appreciate any suggestions or feedback on our proposal--the design doc
should be public to comment on.

Thanks!
Claire / Neville

Reply via email to