Hi all,

Was there any progress on this recently? I am particularly interested in
using value-dependent destinations in BigtableIO (writing to a specific
table depending on the value) and AvroIO (writing to specific GCS buckets
depending on the value).

Thanks,
Josh

On Fri, Jun 9, 2017 at 5:35 PM, Reuven Lax <[email protected]> wrote:

> I'm putting together a proof-of-concept PR for option 1 to see how it
> looks.
>
> On Thu, Jun 8, 2017 at 4:07 PM, Reuven Lax <[email protected]> wrote:
>
> > After looking at everyone's comments, I think option 1 is the better
> > approach - map destinations to a FilenamePolicy. It is a good parallel to
> > what we do in BigQueryIO (the main difference is that we're mapping to a
> > sharded filename, instead of a single destination like in BigQueryIO).
> >
> > The main limitation is that numShards cannot be dynamic per destination.
> I
> > think that's fine for two reasons:
> >
> > 1. We generally discourage people from statically setting numShards, as
> > often runner-determined sharding is better.
> > 2. In a case where users know that certain types of output files need a
> > different number of shards, they can always partition. e.g. partition
> into
> > a 10-shard and a 100-shard sink, with each sink writing dynamic files.
> >
> > Eugene also brought up destination directory, but that part of the
> > FilenamePolicy interface is more a hint than anything else.
> > DestinationDirectory is realistically just the base directory for the
> temp
> > files, and the FilenamePolicy is free to ignore it.
> >
> > Reuven
> >
> > On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> >> Hmm, on one hand this looks syntactically very appealing, on the other
> >> hand, it's icky to have a function return a PTransform at runtime, only
> to
> >> have some information be immediately extracted from that transform.
> >> Moreover, not all TextIO.Write transforms will be legal to return - e.g.
> >> most likely you're not allowed to return a transform that itself uses
> >> dynamic destinations.
> >>
> >> We should think more about how to decompose this problem.
> >> I think there are 2 natural elements to writing files:
> >> 1) where to put the files (let's call this file location)
> >> 2) how to write to a single file (let's call this file format. In case
> of
> >> Avro, this may theoretically include e.g. schema to be embedded in the
> >> file).
> >> There should be represented by different interfaces/classes in the API.
> >>
> >> Then:
> >> - Writing a set of elements to a single file location using a single
> file
> >> format = "write operation"
> >> - WriteFiles is able to route different elements to different write
> >> operations, with potentially different both locations and formats. I.e.
> >> it's configured by something like BQ's DynamicDestinations
> >> - TextIO and AvroIO are thin wrappers over WriteFiles
> >> - AvroIO in the future may be extended to support different schemas for
> >> different files - then it would be even more like BigQuery: it'd take
> also
> >> a SerializableFunction<T, GenericRecord> and a
> >> SerializableFunction<DestinationT, Schema>. That means that perhaps it
> >> may
> >> provide its own DynamicDestinations-like API to its users, more specific
> >> than the one exposed by low-level WriteFiles.
> >>
> >> This is pretty vague, but I think "AvroIO with dynamic schema and with
> >> (type of input PCollection = T) != (type being written = GenericRecord)"
> >> is
> >> a good target to guide search for the perfect API. WDYT?
> >>
> >> On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]>
> >> wrote:
> >>
> >> > Did you see that I modified the second proposal so that users can map
> >> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or
> >> > DestinationT->AvroIO). This means that users do not have to deal with
> >> > FileBasedSink or even know it exists.
> >> >
> >> > I prefer the second approach for two reason:
> >> >
> >> > 1. It allows customizing some useful things that the FilenamePolicy
> does
> >> > not. e.g. it's very reasonable to want to customize the output
> directory
> >> > and have a different number output shards for each directory. If the
> >> > function returns a TextIO or AvroIO they can do that. If there's
> simply
> >> a
> >> > mapping to a FilenamePolicy, the can't do that.
> >> >
> >> > 2. The majority of users don't need to deal with DefaultFilenamePolicy
> >> > today. Allowing them to use the TextIO etc. builders for this will be
> >> > more-familiar than the DefaultFilenamePolicy.Config option suggested.
> >> >
> >> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles
> >> <[email protected]>
> >> > wrote:
> >> >
> >> > > I commented a little in the doc I want to reply on list because this
> >> is a
> >> > > really great feature.
> >> > >
> >> > > The two alternatives, as I understand them, both include mapping
> your
> >> > > elements to an intermediate DestinationT that you can group by
> before
> >> > > writing. Then the big picture decision is whether to map each
> >> > DestinationT
> >> > > to a different FilenamePolicy (which may need to be made more
> >> powerful)
> >> > or
> >> > > map each DestinationT to a different FileBasedSink.
> >> > >
> >> > > I think both are reasonable, modulo pitfalls that I'm probably
> >> glossing
> >> > > over. I favor the FilenamePolicy version a bit, because it is
> focused
> >> > just
> >> > > on the file names, whereas the FileBasedSink version seems a bit
> >> > > overpowered for the use case. The other consideration is that
> >> > > FilenamePolicy is intended for user consumption, while FileBasedSink
> >> is
> >> > not
> >> > > so much.
> >> > >
> >> > > Kenn
> >> > >
> >> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax
> <[email protected]
> >> >
> >> > > wrote:
> >> > >
> >> > > > While Beam now supports file-based sinks that can depend on the
> >> current
> >> > > > window, I've seen interest in value-dependent sinks as well (and
> >> > there's
> >> > > a
> >> > > > long-standing JIRA for this). I wrote up a short API proposal for
> >> this
> >> > > for
> >> > > > discussion on the list.
> >> > > >
> >> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j
> >> > > > Wt6hOgw6ClwxE4/edit?usp=sharing
> >> > > >
> >> > > > Reuven
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to