Re: Sort Merge Bucket - Action Items

2019-08-12 Thread Claire McGinty
Hi! Wanted to bump this thread with some updated PRs that reflect these discussions (updated IOs that parameterize FileIO#Sink, and re-use ReadableFile). The base pull requests are: - https://github.com/apache/beam/pull/8823 (BucketMetadata implementation) -

Re: Sort Merge Bucket - Action Items

2019-07-26 Thread Kenneth Knowles
There is still considerable value in knowing data sources statically so you can do things like fetch sizes and other metadata and adjust pipeline shape. I would not expect to delete these, but to implement them on top of SDF while still giving them a clear URN and payload so runners can know that

Re: Sort Merge Bucket - Action Items

2019-07-26 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 11:09 PM Eugene Kirpichov wrote: > > Hi Gleb, > > Regarding the future of io.Read: ideally things would go as follows > - All runners support SDF at feature parity with Read (mostly this is just > the Dataflow runner's liquid sharding and size estimation for bounded >

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Eugene Kirpichov
Hi Gleb, Regarding the future of io.Read: ideally things would go as follows - All runners support SDF at feature parity with Read (mostly this is just the Dataflow runner's liquid sharding and size estimation for bounded sources, and backlog for unbounded sources, but I recall that a couple of

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Claire McGinty
As far as I/O code re-use, the consensus seems to be to make the SMB module as composable as possible using existing Beam components, ideally as-is or with very basic tweaks. To be clear, what I care about is that WriteFiles(X) and > WriteSmbFiles(X) can share the same X, for X in {Avro, Parquet,

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Gleb Kanterov
What is the long-term plan for org.apache.beam.sdk.io.Read? Is it going away in favor of SDF, or we are always going to have both? I was looking into AvroIO.read and AvroIO.readAll, both of them use AvroSource. AvroIO.readAll is using SDF, and it's implemented with ReadAllViaFileBasedSource that

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 12:35 AM Kenneth Knowles wrote: > > From the peanut gallery, keeping a separate implementation for SMB seems > fine. Dependencies are serious liabilities for both upstream and downstream. > It seems like the reuse angle is generating extra work, and potentially > making

Re: Sort Merge Bucket - Action Items

2019-07-24 Thread Kenneth Knowles
>From the peanut gallery, keeping a separate implementation for SMB seems fine. Dependencies are serious liabilities for both upstream and downstream. It seems like the reuse angle is generating extra work, and potentially making already-complex implementations more complex, instead of helping

Re: Sort Merge Bucket - Action Items

2019-07-24 Thread Neville Li
I spoke too soon. Turns out for unsharded writes, numShards can't be determined until the last finalize transform, which is again different from the current SMB proposal (static number of buckets & shards). I'll end up with more code specialized for SMB in order to generalize existing sink code,

Re: Sort Merge Bucket - Action Items

2019-07-23 Thread Neville Li
So I spent one afternoon trying some ideas for reusing the last few transforms WriteFiles. WriteShardsIntoTempFilesFn extends DoFn*, Iterable>, *FileResult*> => GatherResults extends PTransform, PCollection>> => FinalizeTempFileBundles extends PTransform*>>, WriteFilesResult> I replaced

Re: Sort Merge Bucket - Action Items

2019-07-23 Thread Chamikara Jayalath
On Mon, Jul 22, 2019 at 1:41 PM Robert Bradshaw wrote: > On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov > wrote: > > > > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw > wrote: > >> > >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li > wrote: > >> > > >> > Thanks Robert. Agree with the FileIO

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote: > > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote: >> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: >> > >> > Thanks Robert. Agree with the FileIO point. I'll look into it and see what >> > needs to be done. >> > >> >

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Eugene Kirpichov
On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote: > On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: > > > > Thanks Robert. Agree with the FileIO point. I'll look into it and see > what needs to be done. > > > > Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So >

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: > > Thanks Robert. Agree with the FileIO point. I'll look into it and see what > needs to be done. > > Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So for > writes I'll probably build on top of WriteFiles. Meaning it

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Neville Li
Thanks Robert. Agree with the FileIO point. I'll look into it and see what needs to be done. Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So for writes I'll probably build on top of WriteFiles. Read might be a bigger change w.r.t. collocating ordered elements across files

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote: > > Forking this thread to discuss action items regarding the change. We can keep > technical discussion in the original thread. > > Background: our SMB POC showed promising performance & cost saving > improvements and we'd like to adopt it for

Sort Merge Bucket - Action Items

2019-07-19 Thread Neville Li
Forking this thread to discuss action items regarding the change. We can keep technical discussion in the original thread. Background: our SMB POC showed promising performance & cost saving improvements and we'd like to adopt it for production soon (by EOY). We want to contribute it to Beam so