Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

Neville Li Thu, 27 Jun 2019 08:13:15 -0700

Ping again. Any chance someone takes a look to get this thing going? It's
just a design doc and basic metadata/IO impl. We're not talking about
actual source/sink code yet (already done but saved for future PRs).


On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay <al...@google.com> wrote:

> Thank you Claire, this looks promising. Explicitly adding a few folks that
> might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert Bradshaw
> <rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara Jayalath
> <chamik...@google.com>
>
> On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <claire.d.mcgi...@gmail.com>
> wrote:
>
>> Hey dev@!
>>
>> Myself and a few other Spotify data engineers have put together a design
>> doc for SMB Join support in Beam
>> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
>>  and
>> have a working Java implementation we've started to put up for PR ([0
>> <https://github.com/apache/beam/pull/8823>], [1
>> <https://github.com/apache/beam/pull/8824>], [2
>> <https://github.com/apache/beam/pull/8486>]). There's more detailed
>> information in the document, but the tl;dr is that SMB is a strategy to
>> optimize joins for file-based sources by modifying the initial write
>> operation to write records in sorted buckets based on the desired join key.
>> This means that subsequent joins of datasets written in this way are only
>> sequential file reads, no shuffling involved. We've seen some pretty
>> substantial performance speedups with our implementation and would love to
>> get it checked in to Beam's Java SDK.
>>
>> We'd appreciate any suggestions or feedback on our proposal--the design
>> doc should be public to comment on.
>>
>> Thanks!
>> Claire / Neville
>>
>

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

Reply via email to