Thank you Claire, this looks promising. Explicitly adding a few folks that might have feedback: +Ismaël Mejía <[email protected]> +Robert Bradshaw <[email protected]> +Lukasz Cwik <[email protected]> +Chamikara Jayalath <[email protected]>
On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <[email protected]> wrote: > Hey dev@! > > Myself and a few other Spotify data engineers have put together a design > doc for SMB Join support in Beam > <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>, > and > have a working Java implementation we've started to put up for PR ([0 > <https://github.com/apache/beam/pull/8823>], [1 > <https://github.com/apache/beam/pull/8824>], [2 > <https://github.com/apache/beam/pull/8486>]). There's more detailed > information in the document, but the tl;dr is that SMB is a strategy to > optimize joins for file-based sources by modifying the initial write > operation to write records in sorted buckets based on the desired join key. > This means that subsequent joins of datasets written in this way are only > sequential file reads, no shuffling involved. We've seen some pretty > substantial performance speedups with our implementation and would love to > get it checked in to Beam's Java SDK. > > We'd appreciate any suggestions or feedback on our proposal--the design > doc should be public to comment on. > > Thanks! > Claire / Neville >
