I'm putting together a proof-of-concept PR for option 1 to see how it looks.
On Thu, Jun 8, 2017 at 4:07 PM, Reuven Lax <[email protected]> wrote: > After looking at everyone's comments, I think option 1 is the better > approach - map destinations to a FilenamePolicy. It is a good parallel to > what we do in BigQueryIO (the main difference is that we're mapping to a > sharded filename, instead of a single destination like in BigQueryIO). > > The main limitation is that numShards cannot be dynamic per destination. I > think that's fine for two reasons: > > 1. We generally discourage people from statically setting numShards, as > often runner-determined sharding is better. > 2. In a case where users know that certain types of output files need a > different number of shards, they can always partition. e.g. partition into > a 10-shard and a 100-shard sink, with each sink writing dynamic files. > > Eugene also brought up destination directory, but that part of the > FilenamePolicy interface is more a hint than anything else. > DestinationDirectory is realistically just the base directory for the temp > files, and the FilenamePolicy is free to ignore it. > > Reuven > > On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov < > [email protected]> wrote: > >> Hmm, on one hand this looks syntactically very appealing, on the other >> hand, it's icky to have a function return a PTransform at runtime, only to >> have some information be immediately extracted from that transform. >> Moreover, not all TextIO.Write transforms will be legal to return - e.g. >> most likely you're not allowed to return a transform that itself uses >> dynamic destinations. >> >> We should think more about how to decompose this problem. >> I think there are 2 natural elements to writing files: >> 1) where to put the files (let's call this file location) >> 2) how to write to a single file (let's call this file format. In case of >> Avro, this may theoretically include e.g. schema to be embedded in the >> file). >> There should be represented by different interfaces/classes in the API. >> >> Then: >> - Writing a set of elements to a single file location using a single file >> format = "write operation" >> - WriteFiles is able to route different elements to different write >> operations, with potentially different both locations and formats. I.e. >> it's configured by something like BQ's DynamicDestinations >> - TextIO and AvroIO are thin wrappers over WriteFiles >> - AvroIO in the future may be extended to support different schemas for >> different files - then it would be even more like BigQuery: it'd take also >> a SerializableFunction<T, GenericRecord> and a >> SerializableFunction<DestinationT, Schema>. That means that perhaps it >> may >> provide its own DynamicDestinations-like API to its users, more specific >> than the one exposed by low-level WriteFiles. >> >> This is pretty vague, but I think "AvroIO with dynamic schema and with >> (type of input PCollection = T) != (type being written = GenericRecord)" >> is >> a good target to guide search for the perfect API. WDYT? >> >> On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]> >> wrote: >> >> > Did you see that I modified the second proposal so that users can map >> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or >> > DestinationT->AvroIO). This means that users do not have to deal with >> > FileBasedSink or even know it exists. >> > >> > I prefer the second approach for two reason: >> > >> > 1. It allows customizing some useful things that the FilenamePolicy does >> > not. e.g. it's very reasonable to want to customize the output directory >> > and have a different number output shards for each directory. If the >> > function returns a TextIO or AvroIO they can do that. If there's simply >> a >> > mapping to a FilenamePolicy, the can't do that. >> > >> > 2. The majority of users don't need to deal with DefaultFilenamePolicy >> > today. Allowing them to use the TextIO etc. builders for this will be >> > more-familiar than the DefaultFilenamePolicy.Config option suggested. >> > >> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles >> <[email protected]> >> > wrote: >> > >> > > I commented a little in the doc I want to reply on list because this >> is a >> > > really great feature. >> > > >> > > The two alternatives, as I understand them, both include mapping your >> > > elements to an intermediate DestinationT that you can group by before >> > > writing. Then the big picture decision is whether to map each >> > DestinationT >> > > to a different FilenamePolicy (which may need to be made more >> powerful) >> > or >> > > map each DestinationT to a different FileBasedSink. >> > > >> > > I think both are reasonable, modulo pitfalls that I'm probably >> glossing >> > > over. I favor the FilenamePolicy version a bit, because it is focused >> > just >> > > on the file names, whereas the FileBasedSink version seems a bit >> > > overpowered for the use case. The other consideration is that >> > > FilenamePolicy is intended for user consumption, while FileBasedSink >> is >> > not >> > > so much. >> > > >> > > Kenn >> > > >> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected] >> > >> > > wrote: >> > > >> > > > While Beam now supports file-based sinks that can depend on the >> current >> > > > window, I've seen interest in value-dependent sinks as well (and >> > there's >> > > a >> > > > long-standing JIRA for this). I wrote up a short API proposal for >> this >> > > for >> > > > discussion on the list. >> > > > >> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j >> > > > Wt6hOgw6ClwxE4/edit?usp=sharing >> > > > >> > > > Reuven >> > > > >> > > >> > >> > >
