On Fri, Jul 19, 2019 at 5:16 PM Neville Li <neville....@gmail.com> wrote:
>
> Forking this thread to discuss action items regarding the change. We can keep 
> technical discussion in the original thread.
>
> Background: our SMB POC showed promising performance & cost saving 
> improvements and we'd like to adopt it for production soon (by EOY). We want 
> to contribute it to Beam so it's better generalized and maintained. We also 
> want to avoid divergence between our internal version and the PR while it's 
> in progress, specifically any breaking change in the produced SMB data.

All good goals.

> To achieve that I'd like to propose a few action items.
>
> 1. Reach a consensus about bucket and shard strategy, key handling, bucket 
> file and metadata format, etc., anything that affect produced SMB data.
> 2. Revise the existing PR according to #1
> 3. Reduce duplicate file IO logic by reusing FileIO.Sink, Compression, etc., 
> but keep the existing file level abstraction
> 4. (Optional) Merge code into extensions::smb but mark clearly as 
> @experimental
> 5. Incorporate ideas from the discussion, e.g. ShardingFn, 
> GroupByKeyAndSortValues, FileIO generalization, key URN, etc.
>
> #1-4 gives us something usable in the short term, while #1 guarantees that 
> production data produced today are usable when #5 lands on master. #4 also 
> gives early adopters a chance to give feedback.
> Due to the scope of #5, it might take much longer and a couple of big PRs to 
> achieve, which we can keep iterating on.
>
> What are your thoughts on this?

I would like to see some resolution on the FileIO abstractions before
merging into experimental. (We have a FileBasedSink that would mostly
already work, so it's a matter of coming up with an analogous Source
interface.) Specifically I would not want to merge a set of per file
type smb IOs without a path forward to this or the determination that
it's not possible/desirable.

Reply via email to