Re: Dynamic file-based sinks

Josh Mon, 31 Jul 2017 02:03:08 -0700

That's great news, thanks Reuven! I will try this out soon.

On Sat, Jul 29, 2017 at 2:33 AM, Reuven Lax <re...@google.com.invalid>
wrote:


> The AvroIO PR is now merged, so you can write to different destinations
> based on the value. It's available in head, and will be in Beam 2.2.0.
>
> On Wed, Jul 26, 2017 at 10:00 AM, Reuven Lax <re...@google.com> wrote:
>
> > Yes, there was! TextIO support is already merged into Beam (it missed the
> > 2.1 cutoff, so it will be in Beam 2.2.0). AvroIO support is in
> > https://github.com/apache/beam/pull/3541. This is almost ready to merge
> -
> > still waiting for final review from kennknowles on the Beam translation
> > changes.
> >
> > Nobody is working on BigtableIO yet, however the framework used in
> > BigQueryIO, TextIO, and AvroIO should be easy to generalize to other
> sinks.
> >
> > On Wed, Jul 26, 2017 at 7:41 AM, Josh <jof...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> Was there any progress on this recently? I am particularly interested in
> >> using value-dependent destinations in BigtableIO (writing to a specific
> >> table depending on the value) and AvroIO (writing to specific GCS
> buckets
> >> depending on the value).
> >>
> >> Thanks,
> >> Josh
> >>
> >> On Fri, Jun 9, 2017 at 5:35 PM, Reuven Lax <re...@google.com.invalid>
> >> wrote:
> >>
> >> > I'm putting together a proof-of-concept PR for option 1 to see how it
> >> > looks.
> >> >
> >> > On Thu, Jun 8, 2017 at 4:07 PM, Reuven Lax <re...@google.com> wrote:
> >> >
> >> > > After looking at everyone's comments, I think option 1 is the better
> >> > > approach - map destinations to a FilenamePolicy. It is a good
> >> parallel to
> >> > > what we do in BigQueryIO (the main difference is that we're mapping
> >> to a
> >> > > sharded filename, instead of a single destination like in
> BigQueryIO).
> >> > >
> >> > > The main limitation is that numShards cannot be dynamic per
> >> destination.
> >> > I
> >> > > think that's fine for two reasons:
> >> > >
> >> > > 1. We generally discourage people from statically setting numShards,
> >> as
> >> > > often runner-determined sharding is better.
> >> > > 2. In a case where users know that certain types of output files
> need
> >> a
> >> > > different number of shards, they can always partition. e.g.
> partition
> >> > into
> >> > > a 10-shard and a 100-shard sink, with each sink writing dynamic
> files.
> >> > >
> >> > > Eugene also brought up destination directory, but that part of the
> >> > > FilenamePolicy interface is more a hint than anything else.
> >> > > DestinationDirectory is realistically just the base directory for
> the
> >> > temp
> >> > > files, and the FilenamePolicy is free to ignore it.
> >> > >
> >> > > Reuven
> >> > >
> >> > > On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov <
> >> > > kirpic...@google.com.invalid> wrote:
> >> > >
> >> > >> Hmm, on one hand this looks syntactically very appealing, on the
> >> other
> >> > >> hand, it's icky to have a function return a PTransform at runtime,
> >> only
> >> > to
> >> > >> have some information be immediately extracted from that transform.
> >> > >> Moreover, not all TextIO.Write transforms will be legal to return -
> >> e.g.
> >> > >> most likely you're not allowed to return a transform that itself
> uses
> >> > >> dynamic destinations.
> >> > >>
> >> > >> We should think more about how to decompose this problem.
> >> > >> I think there are 2 natural elements to writing files:
> >> > >> 1) where to put the files (let's call this file location)
> >> > >> 2) how to write to a single file (let's call this file format. In
> >> case
> >> > of
> >> > >> Avro, this may theoretically include e.g. schema to be embedded in
> >> the
> >> > >> file).
> >> > >> There should be represented by different interfaces/classes in the
> >> API.
> >> > >>
> >> > >> Then:
> >> > >> - Writing a set of elements to a single file location using a
> single
> >> > file
> >> > >> format = "write operation"
> >> > >> - WriteFiles is able to route different elements to different write
> >> > >> operations, with potentially different both locations and formats.
> >> I.e.
> >> > >> it's configured by something like BQ's DynamicDestinations
> >> > >> - TextIO and AvroIO are thin wrappers over WriteFiles
> >> > >> - AvroIO in the future may be extended to support different schemas
> >> for
> >> > >> different files - then it would be even more like BigQuery: it'd
> take
> >> > also
> >> > >> a SerializableFunction<T, GenericRecord> and a
> >> > >> SerializableFunction<DestinationT, Schema>. That means that
> perhaps
> >> it
> >> > >> may
> >> > >> provide its own DynamicDestinations-like API to its users, more
> >> specific
> >> > >> than the one exposed by low-level WriteFiles.
> >> > >>
> >> > >> This is pretty vague, but I think "AvroIO with dynamic schema and
> >> with
> >> > >> (type of input PCollection = T) != (type being written =
> >> GenericRecord)"
> >> > >> is
> >> > >> a good target to guide search for the perfect API. WDYT?
> >> > >>
> >> > >> On Wed, May 24, 2017 at 11:24 AM Reuven Lax
> <re...@google.com.invalid
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> > Did you see that I modified the second proposal so that users can
> >> map
> >> > >> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO
> or
> >> > >> > DestinationT->AvroIO). This means that users do not have to deal
> >> with
> >> > >> > FileBasedSink or even know it exists.
> >> > >> >
> >> > >> > I prefer the second approach for two reason:
> >> > >> >
> >> > >> > 1. It allows customizing some useful things that the
> FilenamePolicy
> >> > does
> >> > >> > not. e.g. it's very reasonable to want to customize the output
> >> > directory
> >> > >> > and have a different number output shards for each directory. If
> >> the
> >> > >> > function returns a TextIO or AvroIO they can do that. If there's
> >> > simply
> >> > >> a
> >> > >> > mapping to a FilenamePolicy, the can't do that.
> >> > >> >
> >> > >> > 2. The majority of users don't need to deal with
> >> DefaultFilenamePolicy
> >> > >> > today. Allowing them to use the TextIO etc. builders for this
> will
> >> be
> >> > >> > more-familiar than the DefaultFilenamePolicy.Config option
> >> suggested.
> >> > >> >
> >> > >> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles
> >> > >> <k...@google.com.invalid>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > I commented a little in the doc I want to reply on list because
> >> this
> >> > >> is a
> >> > >> > > really great feature.
> >> > >> > >
> >> > >> > > The two alternatives, as I understand them, both include
> mapping
> >> > your
> >> > >> > > elements to an intermediate DestinationT that you can group by
> >> > before
> >> > >> > > writing. Then the big picture decision is whether to map each
> >> > >> > DestinationT
> >> > >> > > to a different FilenamePolicy (which may need to be made more
> >> > >> powerful)
> >> > >> > or
> >> > >> > > map each DestinationT to a different FileBasedSink.
> >> > >> > >
> >> > >> > > I think both are reasonable, modulo pitfalls that I'm probably
> >> > >> glossing
> >> > >> > > over. I favor the FilenamePolicy version a bit, because it is
> >> > focused
> >> > >> > just
> >> > >> > > on the file names, whereas the FileBasedSink version seems a
> bit
> >> > >> > > overpowered for the use case. The other consideration is that
> >> > >> > > FilenamePolicy is intended for user consumption, while
> >> FileBasedSink
> >> > >> is
> >> > >> > not
> >> > >> > > so much.
> >> > >> > >
> >> > >> > > Kenn
> >> > >> > >
> >> > >> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax
> >> > <re...@google.com.invalid
> >> > >> >
> >> > >> > > wrote:
> >> > >> > >
> >> > >> > > > While Beam now supports file-based sinks that can depend on
> the
> >> > >> current
> >> > >> > > > window, I've seen interest in value-dependent sinks as well
> >> (and
> >> > >> > there's
> >> > >> > > a
> >> > >> > > > long-standing JIRA for this). I wrote up a short API proposal
> >> for
> >> > >> this
> >> > >> > > for
> >> > >> > > > discussion on the list.
> >> > >> > > >
> >> > >> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURB
> >> MCl7j
> >> > >> > > > Wt6hOgw6ClwxE4/edit?usp=sharing
> >> > >> > > >
> >> > >> > > > Reuven
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Dynamic file-based sinks

Reply via email to