Hi all, Was there any progress on this recently? I am particularly interested in using value-dependent destinations in BigtableIO (writing to a specific table depending on the value) and AvroIO (writing to specific GCS buckets depending on the value).
Thanks, Josh On Fri, Jun 9, 2017 at 5:35 PM, Reuven Lax <[email protected]> wrote: > I'm putting together a proof-of-concept PR for option 1 to see how it > looks. > > On Thu, Jun 8, 2017 at 4:07 PM, Reuven Lax <[email protected]> wrote: > > > After looking at everyone's comments, I think option 1 is the better > > approach - map destinations to a FilenamePolicy. It is a good parallel to > > what we do in BigQueryIO (the main difference is that we're mapping to a > > sharded filename, instead of a single destination like in BigQueryIO). > > > > The main limitation is that numShards cannot be dynamic per destination. > I > > think that's fine for two reasons: > > > > 1. We generally discourage people from statically setting numShards, as > > often runner-determined sharding is better. > > 2. In a case where users know that certain types of output files need a > > different number of shards, they can always partition. e.g. partition > into > > a 10-shard and a 100-shard sink, with each sink writing dynamic files. > > > > Eugene also brought up destination directory, but that part of the > > FilenamePolicy interface is more a hint than anything else. > > DestinationDirectory is realistically just the base directory for the > temp > > files, and the FilenamePolicy is free to ignore it. > > > > Reuven > > > > On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov < > > [email protected]> wrote: > > > >> Hmm, on one hand this looks syntactically very appealing, on the other > >> hand, it's icky to have a function return a PTransform at runtime, only > to > >> have some information be immediately extracted from that transform. > >> Moreover, not all TextIO.Write transforms will be legal to return - e.g. > >> most likely you're not allowed to return a transform that itself uses > >> dynamic destinations. > >> > >> We should think more about how to decompose this problem. > >> I think there are 2 natural elements to writing files: > >> 1) where to put the files (let's call this file location) > >> 2) how to write to a single file (let's call this file format. In case > of > >> Avro, this may theoretically include e.g. schema to be embedded in the > >> file). > >> There should be represented by different interfaces/classes in the API. > >> > >> Then: > >> - Writing a set of elements to a single file location using a single > file > >> format = "write operation" > >> - WriteFiles is able to route different elements to different write > >> operations, with potentially different both locations and formats. I.e. > >> it's configured by something like BQ's DynamicDestinations > >> - TextIO and AvroIO are thin wrappers over WriteFiles > >> - AvroIO in the future may be extended to support different schemas for > >> different files - then it would be even more like BigQuery: it'd take > also > >> a SerializableFunction<T, GenericRecord> and a > >> SerializableFunction<DestinationT, Schema>. That means that perhaps it > >> may > >> provide its own DynamicDestinations-like API to its users, more specific > >> than the one exposed by low-level WriteFiles. > >> > >> This is pretty vague, but I think "AvroIO with dynamic schema and with > >> (type of input PCollection = T) != (type being written = GenericRecord)" > >> is > >> a good target to guide search for the perfect API. WDYT? > >> > >> On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]> > >> wrote: > >> > >> > Did you see that I modified the second proposal so that users can map > >> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or > >> > DestinationT->AvroIO). This means that users do not have to deal with > >> > FileBasedSink or even know it exists. > >> > > >> > I prefer the second approach for two reason: > >> > > >> > 1. It allows customizing some useful things that the FilenamePolicy > does > >> > not. e.g. it's very reasonable to want to customize the output > directory > >> > and have a different number output shards for each directory. If the > >> > function returns a TextIO or AvroIO they can do that. If there's > simply > >> a > >> > mapping to a FilenamePolicy, the can't do that. > >> > > >> > 2. The majority of users don't need to deal with DefaultFilenamePolicy > >> > today. Allowing them to use the TextIO etc. builders for this will be > >> > more-familiar than the DefaultFilenamePolicy.Config option suggested. > >> > > >> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles > >> <[email protected]> > >> > wrote: > >> > > >> > > I commented a little in the doc I want to reply on list because this > >> is a > >> > > really great feature. > >> > > > >> > > The two alternatives, as I understand them, both include mapping > your > >> > > elements to an intermediate DestinationT that you can group by > before > >> > > writing. Then the big picture decision is whether to map each > >> > DestinationT > >> > > to a different FilenamePolicy (which may need to be made more > >> powerful) > >> > or > >> > > map each DestinationT to a different FileBasedSink. > >> > > > >> > > I think both are reasonable, modulo pitfalls that I'm probably > >> glossing > >> > > over. I favor the FilenamePolicy version a bit, because it is > focused > >> > just > >> > > on the file names, whereas the FileBasedSink version seems a bit > >> > > overpowered for the use case. The other consideration is that > >> > > FilenamePolicy is intended for user consumption, while FileBasedSink > >> is > >> > not > >> > > so much. > >> > > > >> > > Kenn > >> > > > >> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax > <[email protected] > >> > > >> > > wrote: > >> > > > >> > > > While Beam now supports file-based sinks that can depend on the > >> current > >> > > > window, I've seen interest in value-dependent sinks as well (and > >> > there's > >> > > a > >> > > > long-standing JIRA for this). I wrote up a short API proposal for > >> this > >> > > for > >> > > > discussion on the list. > >> > > > > >> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j > >> > > > Wt6hOgw6ClwxE4/edit?usp=sharing > >> > > > > >> > > > Reuven > >> > > > > >> > > > >> > > >> > > > > >
