Thanks added few comments.

If I understood correctly, you basically assign elements with keys to
different buckets which are written to unique files and merge files for the
same key while reading ?

Some of my concerns are.

(1)  Seems like you rely on an in-memory sorting of buckets. Will this end
up limiting the size of a PCollection you can process ?
(2) Seems like you rely on Reshuffle.viaRandomKey() which is actually
implemented using a shuffle (which you try to replace with this proposal).
(3) I think (at least some of the) shuffle implementations are implemented
in ways similar to this (writing to files and merging). So I'm wondering if
the performance benefits you see are for a very specific case and may limit
the functionality in other ways.

Thanks,
Cham


On Thu, Jun 27, 2019 at 8:12 AM Neville Li <neville....@gmail.com> wrote:

> Ping again. Any chance someone takes a look to get this thing going? It's
> just a design doc and basic metadata/IO impl. We're not talking about
> actual source/sink code yet (already done but saved for future PRs).
>
> On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay <al...@google.com> wrote:
>
>> Thank you Claire, this looks promising. Explicitly adding a few folks
>> that might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert
>> Bradshaw <rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara
>> Jayalath <chamik...@google.com>
>>
>> On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <
>> claire.d.mcgi...@gmail.com> wrote:
>>
>>> Hey dev@!
>>>
>>> Myself and a few other Spotify data engineers have put together a design
>>> doc for SMB Join support in Beam
>>> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
>>>  and
>>> have a working Java implementation we've started to put up for PR ([0
>>> <https://github.com/apache/beam/pull/8823>], [1
>>> <https://github.com/apache/beam/pull/8824>], [2
>>> <https://github.com/apache/beam/pull/8486>]). There's more detailed
>>> information in the document, but the tl;dr is that SMB is a strategy to
>>> optimize joins for file-based sources by modifying the initial write
>>> operation to write records in sorted buckets based on the desired join key.
>>> This means that subsequent joins of datasets written in this way are only
>>> sequential file reads, no shuffling involved. We've seen some pretty
>>> substantial performance speedups with our implementation and would love to
>>> get it checked in to Beam's Java SDK.
>>>
>>> We'd appreciate any suggestions or feedback on our proposal--the design
>>> doc should be public to comment on.
>>>
>>> Thanks!
>>> Claire / Neville
>>>
>>

Reply via email to