On Fri, Jul 19, 2019 at 5:16 PM Neville Li <neville....@gmail.com> wrote: > > Forking this thread to discuss action items regarding the change. We can keep > technical discussion in the original thread. > > Background: our SMB POC showed promising performance & cost saving > improvements and we'd like to adopt it for production soon (by EOY). We want > to contribute it to Beam so it's better generalized and maintained. We also > want to avoid divergence between our internal version and the PR while it's > in progress, specifically any breaking change in the produced SMB data.
All good goals. > To achieve that I'd like to propose a few action items. > > 1. Reach a consensus about bucket and shard strategy, key handling, bucket > file and metadata format, etc., anything that affect produced SMB data. > 2. Revise the existing PR according to #1 > 3. Reduce duplicate file IO logic by reusing FileIO.Sink, Compression, etc., > but keep the existing file level abstraction > 4. (Optional) Merge code into extensions::smb but mark clearly as > @experimental > 5. Incorporate ideas from the discussion, e.g. ShardingFn, > GroupByKeyAndSortValues, FileIO generalization, key URN, etc. > > #1-4 gives us something usable in the short term, while #1 guarantees that > production data produced today are usable when #5 lands on master. #4 also > gives early adopters a chance to give feedback. > Due to the scope of #5, it might take much longer and a couple of big PRs to > achieve, which we can keep iterating on. > > What are your thoughts on this? I would like to see some resolution on the FileIO abstractions before merging into experimental. (We have a FileBasedSink that would mostly already work, so it's a matter of coming up with an analogous Source interface.) Specifically I would not want to merge a set of per file type smb IOs without a path forward to this or the determination that it's not possible/desirable.