Hi! Wanted to bump this thread with some updated PRs that reflect these
discussions (updated IOs that parameterize FileIO#Sink, and re-use
ReadableFile). The base pull requests are:
- https://github.com/apache/beam/pull/8823 (BucketMetadata implementation)
-
There is still considerable value in knowing data sources statically so you
can do things like fetch sizes and other metadata and adjust pipeline
shape. I would not expect to delete these, but to implement them on top of
SDF while still giving them a clear URN and payload so runners can know
that
On Thu, Jul 25, 2019 at 11:09 PM Eugene Kirpichov wrote:
>
> Hi Gleb,
>
> Regarding the future of io.Read: ideally things would go as follows
> - All runners support SDF at feature parity with Read (mostly this is just
> the Dataflow runner's liquid sharding and size estimation for bounded
>
Hi Gleb,
Regarding the future of io.Read: ideally things would go as follows
- All runners support SDF at feature parity with Read (mostly this is just
the Dataflow runner's liquid sharding and size estimation for bounded
sources, and backlog for unbounded sources, but I recall that a couple of
As far as I/O code re-use, the consensus seems to be to make the SMB module
as composable as possible using existing Beam components, ideally as-is or
with very basic tweaks.
To be clear, what I care about is that WriteFiles(X) and
> WriteSmbFiles(X) can share the same X, for X in {Avro, Parquet,
What is the long-term plan for org.apache.beam.sdk.io.Read? Is it going
away in favor of SDF, or we are always going to have both?
I was looking into AvroIO.read and AvroIO.readAll, both of them
use AvroSource. AvroIO.readAll is using SDF, and it's implemented with
ReadAllViaFileBasedSource that
On Thu, Jul 25, 2019 at 12:35 AM Kenneth Knowles wrote:
>
> From the peanut gallery, keeping a separate implementation for SMB seems
> fine. Dependencies are serious liabilities for both upstream and downstream.
> It seems like the reuse angle is generating extra work, and potentially
> making
>From the peanut gallery, keeping a separate implementation for SMB seems
fine. Dependencies are serious liabilities for both upstream and
downstream. It seems like the reuse angle is generating extra work, and
potentially making already-complex implementations more complex, instead of
helping
I spoke too soon. Turns out for unsharded writes, numShards can't be
determined until the last finalize transform, which is again different from
the current SMB proposal (static number of buckets & shards).
I'll end up with more code specialized for SMB in order to generalize
existing sink code,
So I spent one afternoon trying some ideas for reusing the last few
transforms WriteFiles.
WriteShardsIntoTempFilesFn extends DoFn*,
Iterable>, *FileResult*>
=> GatherResults extends PTransform,
PCollection>>
=> FinalizeTempFileBundles extends PTransform*>>, WriteFilesResult>
I replaced
On Mon, Jul 22, 2019 at 1:41 PM Robert Bradshaw wrote:
> On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov
> wrote:
> >
> > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw
> wrote:
> >>
> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li
> wrote:
> >> >
> >> > Thanks Robert. Agree with the FileIO
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote:
>
> On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote:
>>
>> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote:
>> >
>> > Thanks Robert. Agree with the FileIO point. I'll look into it and see what
>> > needs to be done.
>> >
>> >
On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote:
> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote:
> >
> > Thanks Robert. Agree with the FileIO point. I'll look into it and see
> what needs to be done.
> >
> > Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So
>
On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote:
>
> Thanks Robert. Agree with the FileIO point. I'll look into it and see what
> needs to be done.
>
> Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So for
> writes I'll probably build on top of WriteFiles.
Meaning it
Thanks Robert. Agree with the FileIO point. I'll look into it and see what
needs to be done.
Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So
for writes I'll probably build on top of WriteFiles. Read might be a bigger
change w.r.t. collocating ordered elements across files
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote:
>
> Forking this thread to discuss action items regarding the change. We can keep
> technical discussion in the original thread.
>
> Background: our SMB POC showed promising performance & cost saving
> improvements and we'd like to adopt it for
Forking this thread to discuss action items regarding the change. We can
keep technical discussion in the original thread.
Background: our SMB POC showed promising performance & cost saving
improvements and we'd like to adopt it for production soon (by EOY). We
want to contribute it to Beam so
17 matches
Mail list logo