Thanks Robert. Agree with the FileIO point. I'll look into it and see what
needs to be done.

Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So
for writes I'll probably build on top of WriteFiles. Read might be a bigger
change w.r.t. collocating ordered elements across files within a bucket and
TBH I'm not even sure where to start.

I'll file separate PRs for core changes needed for discussion. WDYT?

On Mon, Jul 22, 2019 at 4:20 AM Robert Bradshaw <rober...@google.com> wrote:

> On Fri, Jul 19, 2019 at 5:16 PM Neville Li <neville....@gmail.com> wrote:
> >
> > Forking this thread to discuss action items regarding the change. We can
> keep technical discussion in the original thread.
> >
> > Background: our SMB POC showed promising performance & cost saving
> improvements and we'd like to adopt it for production soon (by EOY). We
> want to contribute it to Beam so it's better generalized and maintained. We
> also want to avoid divergence between our internal version and the PR while
> it's in progress, specifically any breaking change in the produced SMB data.
>
> All good goals.
>
> > To achieve that I'd like to propose a few action items.
> >
> > 1. Reach a consensus about bucket and shard strategy, key handling,
> bucket file and metadata format, etc., anything that affect produced SMB
> data.
> > 2. Revise the existing PR according to #1
> > 3. Reduce duplicate file IO logic by reusing FileIO.Sink, Compression,
> etc., but keep the existing file level abstraction
> > 4. (Optional) Merge code into extensions::smb but mark clearly as
> @experimental
> > 5. Incorporate ideas from the discussion, e.g. ShardingFn,
> GroupByKeyAndSortValues, FileIO generalization, key URN, etc.
> >
> > #1-4 gives us something usable in the short term, while #1 guarantees
> that production data produced today are usable when #5 lands on master. #4
> also gives early adopters a chance to give feedback.
> > Due to the scope of #5, it might take much longer and a couple of big
> PRs to achieve, which we can keep iterating on.
> >
> > What are your thoughts on this?
>
> I would like to see some resolution on the FileIO abstractions before
> merging into experimental. (We have a FileBasedSink that would mostly
> already work, so it's a matter of coming up with an analogous Source
> interface.) Specifically I would not want to merge a set of per file
> type smb IOs without a path forward to this or the determination that
> it's not possible/desirable.
>

Reply via email to