Ping again. Any chance someone takes a look to get this thing going? It's just a design doc and basic metadata/IO impl. We're not talking about actual source/sink code yet (already done but saved for future PRs).
On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay <al...@google.com> wrote: > Thank you Claire, this looks promising. Explicitly adding a few folks that > might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert Bradshaw > <rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara Jayalath > <chamik...@google.com> > > On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <claire.d.mcgi...@gmail.com> > wrote: > >> Hey dev@! >> >> Myself and a few other Spotify data engineers have put together a design >> doc for SMB Join support in Beam >> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>, >> and >> have a working Java implementation we've started to put up for PR ([0 >> <https://github.com/apache/beam/pull/8823>], [1 >> <https://github.com/apache/beam/pull/8824>], [2 >> <https://github.com/apache/beam/pull/8486>]). There's more detailed >> information in the document, but the tl;dr is that SMB is a strategy to >> optimize joins for file-based sources by modifying the initial write >> operation to write records in sorted buckets based on the desired join key. >> This means that subsequent joins of datasets written in this way are only >> sequential file reads, no shuffling involved. We've seen some pretty >> substantial performance speedups with our implementation and would love to >> get it checked in to Beam's Java SDK. >> >> We'd appreciate any suggestions or feedback on our proposal--the design >> doc should be public to comment on. >> >> Thanks! >> Claire / Neville >> >