Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

Ahmet Altay Fri, 21 Jun 2019 10:38:59 -0700

Thank you Claire, this looks promising. Explicitly adding a few folks that
might have feedback: +Ismaël Mejía <[email protected]> +Robert Bradshaw
<[email protected]> +Lukasz Cwik <[email protected]> +Chamikara Jayalath
<[email protected]>


On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <[email protected]>
wrote:

> Hey dev@!
>
> Myself and a few other Spotify data engineers have put together a design
> doc for SMB Join support in Beam
> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
>  and
> have a working Java implementation we've started to put up for PR ([0
> <https://github.com/apache/beam/pull/8823>], [1
> <https://github.com/apache/beam/pull/8824>], [2
> <https://github.com/apache/beam/pull/8486>]). There's more detailed
> information in the document, but the tl;dr is that SMB is a strategy to
> optimize joins for file-based sources by modifying the initial write
> operation to write records in sorted buckets based on the desired join key.
> This means that subsequent joins of datasets written in this way are only
> sequential file reads, no shuffling involved. We've seen some pretty
> substantial performance speedups with our implementation and would love to
> get it checked in to Beam's Java SDK.
>
> We'd appreciate any suggestions or feedback on our proposal--the design
> doc should be public to comment on.
>
> Thanks!
> Claire / Neville
>

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

Reply via email to