Thanks Robert. Agree with the FileIO point. I'll look into it and see what needs to be done.
Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So for writes I'll probably build on top of WriteFiles. Read might be a bigger change w.r.t. collocating ordered elements across files within a bucket and TBH I'm not even sure where to start. I'll file separate PRs for core changes needed for discussion. WDYT? On Mon, Jul 22, 2019 at 4:20 AM Robert Bradshaw <rober...@google.com> wrote: > On Fri, Jul 19, 2019 at 5:16 PM Neville Li <neville....@gmail.com> wrote: > > > > Forking this thread to discuss action items regarding the change. We can > keep technical discussion in the original thread. > > > > Background: our SMB POC showed promising performance & cost saving > improvements and we'd like to adopt it for production soon (by EOY). We > want to contribute it to Beam so it's better generalized and maintained. We > also want to avoid divergence between our internal version and the PR while > it's in progress, specifically any breaking change in the produced SMB data. > > All good goals. > > > To achieve that I'd like to propose a few action items. > > > > 1. Reach a consensus about bucket and shard strategy, key handling, > bucket file and metadata format, etc., anything that affect produced SMB > data. > > 2. Revise the existing PR according to #1 > > 3. Reduce duplicate file IO logic by reusing FileIO.Sink, Compression, > etc., but keep the existing file level abstraction > > 4. (Optional) Merge code into extensions::smb but mark clearly as > @experimental > > 5. Incorporate ideas from the discussion, e.g. ShardingFn, > GroupByKeyAndSortValues, FileIO generalization, key URN, etc. > > > > #1-4 gives us something usable in the short term, while #1 guarantees > that production data produced today are usable when #5 lands on master. #4 > also gives early adopters a chance to give feedback. > > Due to the scope of #5, it might take much longer and a couple of big > PRs to achieve, which we can keep iterating on. > > > > What are your thoughts on this? > > I would like to see some resolution on the FileIO abstractions before > merging into experimental. (We have a FileBasedSink that would mostly > already work, so it's a matter of coming up with an analogous Source > interface.) Specifically I would not want to merge a set of per file > type smb IOs without a path forward to this or the determination that > it's not possible/desirable. >