Thank you Claire, this looks promising. Explicitly adding a few folks that might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert Bradshaw <rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara Jayalath <chamik...@google.com>
On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <claire.d.mcgi...@gmail.com> wrote: > Hey dev@! > > Myself and a few other Spotify data engineers have put together a design > doc for SMB Join support in Beam > <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>, > and > have a working Java implementation we've started to put up for PR ([0 > <https://github.com/apache/beam/pull/8823>], [1 > <https://github.com/apache/beam/pull/8824>], [2 > <https://github.com/apache/beam/pull/8486>]). There's more detailed > information in the document, but the tl;dr is that SMB is a strategy to > optimize joins for file-based sources by modifying the initial write > operation to write records in sorted buckets based on the desired join key. > This means that subsequent joins of datasets written in this way are only > sequential file reads, no shuffling involved. We've seen some pretty > substantial performance speedups with our implementation and would love to > get it checked in to Beam's Java SDK. > > We'd appreciate any suggestions or feedback on our proposal--the design > doc should be public to comment on. > > Thanks! > Claire / Neville >