After looking at everyone's comments, I think option 1 is the better approach - map destinations to a FilenamePolicy. It is a good parallel to what we do in BigQueryIO (the main difference is that we're mapping to a sharded filename, instead of a single destination like in BigQueryIO).
The main limitation is that numShards cannot be dynamic per destination. I think that's fine for two reasons: 1. We generally discourage people from statically setting numShards, as often runner-determined sharding is better. 2. In a case where users know that certain types of output files need a different number of shards, they can always partition. e.g. partition into a 10-shard and a 100-shard sink, with each sink writing dynamic files. Eugene also brought up destination directory, but that part of the FilenamePolicy interface is more a hint than anything else. DestinationDirectory is realistically just the base directory for the temp files, and the FilenamePolicy is free to ignore it. Reuven On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov < [email protected]> wrote: > Hmm, on one hand this looks syntactically very appealing, on the other > hand, it's icky to have a function return a PTransform at runtime, only to > have some information be immediately extracted from that transform. > Moreover, not all TextIO.Write transforms will be legal to return - e.g. > most likely you're not allowed to return a transform that itself uses > dynamic destinations. > > We should think more about how to decompose this problem. > I think there are 2 natural elements to writing files: > 1) where to put the files (let's call this file location) > 2) how to write to a single file (let's call this file format. In case of > Avro, this may theoretically include e.g. schema to be embedded in the > file). > There should be represented by different interfaces/classes in the API. > > Then: > - Writing a set of elements to a single file location using a single file > format = "write operation" > - WriteFiles is able to route different elements to different write > operations, with potentially different both locations and formats. I.e. > it's configured by something like BQ's DynamicDestinations > - TextIO and AvroIO are thin wrappers over WriteFiles > - AvroIO in the future may be extended to support different schemas for > different files - then it would be even more like BigQuery: it'd take also > a SerializableFunction<T, GenericRecord> and a > SerializableFunction<DestinationT, Schema>. That means that perhaps it may > provide its own DynamicDestinations-like API to its users, more specific > than the one exposed by low-level WriteFiles. > > This is pretty vague, but I think "AvroIO with dynamic schema and with > (type of input PCollection = T) != (type being written = GenericRecord)" is > a good target to guide search for the perfect API. WDYT? > > On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]> > wrote: > > > Did you see that I modified the second proposal so that users can map > > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or > > DestinationT->AvroIO). This means that users do not have to deal with > > FileBasedSink or even know it exists. > > > > I prefer the second approach for two reason: > > > > 1. It allows customizing some useful things that the FilenamePolicy does > > not. e.g. it's very reasonable to want to customize the output directory > > and have a different number output shards for each directory. If the > > function returns a TextIO or AvroIO they can do that. If there's simply a > > mapping to a FilenamePolicy, the can't do that. > > > > 2. The majority of users don't need to deal with DefaultFilenamePolicy > > today. Allowing them to use the TextIO etc. builders for this will be > > more-familiar than the DefaultFilenamePolicy.Config option suggested. > > > > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles <[email protected] > > > > wrote: > > > > > I commented a little in the doc I want to reply on list because this > is a > > > really great feature. > > > > > > The two alternatives, as I understand them, both include mapping your > > > elements to an intermediate DestinationT that you can group by before > > > writing. Then the big picture decision is whether to map each > > DestinationT > > > to a different FilenamePolicy (which may need to be made more powerful) > > or > > > map each DestinationT to a different FileBasedSink. > > > > > > I think both are reasonable, modulo pitfalls that I'm probably glossing > > > over. I favor the FilenamePolicy version a bit, because it is focused > > just > > > on the file names, whereas the FileBasedSink version seems a bit > > > overpowered for the use case. The other consideration is that > > > FilenamePolicy is intended for user consumption, while FileBasedSink is > > not > > > so much. > > > > > > Kenn > > > > > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected] > > > > > wrote: > > > > > > > While Beam now supports file-based sinks that can depend on the > current > > > > window, I've seen interest in value-dependent sinks as well (and > > there's > > > a > > > > long-standing JIRA for this). I wrote up a short API proposal for > this > > > for > > > > discussion on the list. > > > > > > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j > > > > Wt6hOgw6ClwxE4/edit?usp=sharing > > > > > > > > Reuven > > > > > > > > > >
