Thank you Claire, this looks promising. Explicitly adding a few folks that
might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert Bradshaw
<rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara Jayalath
<chamik...@google.com>

On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <claire.d.mcgi...@gmail.com>
wrote:

> Hey dev@!
>
> Myself and a few other Spotify data engineers have put together a design
> doc for SMB Join support in Beam
> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
>  and
> have a working Java implementation we've started to put up for PR ([0
> <https://github.com/apache/beam/pull/8823>], [1
> <https://github.com/apache/beam/pull/8824>], [2
> <https://github.com/apache/beam/pull/8486>]). There's more detailed
> information in the document, but the tl;dr is that SMB is a strategy to
> optimize joins for file-based sources by modifying the initial write
> operation to write records in sorted buckets based on the desired join key.
> This means that subsequent joins of datasets written in this way are only
> sequential file reads, no shuffling involved. We've seen some pretty
> substantial performance speedups with our implementation and would love to
> get it checked in to Beam's Java SDK.
>
> We'd appreciate any suggestions or feedback on our proposal--the design
> doc should be public to comment on.
>
> Thanks!
> Claire / Neville
>

Reply via email to