Hmm, on one hand this looks syntactically very appealing, on the other hand, it's icky to have a function return a PTransform at runtime, only to have some information be immediately extracted from that transform. Moreover, not all TextIO.Write transforms will be legal to return - e.g. most likely you're not allowed to return a transform that itself uses dynamic destinations.
We should think more about how to decompose this problem. I think there are 2 natural elements to writing files: 1) where to put the files (let's call this file location) 2) how to write to a single file (let's call this file format. In case of Avro, this may theoretically include e.g. schema to be embedded in the file). There should be represented by different interfaces/classes in the API. Then: - Writing a set of elements to a single file location using a single file format = "write operation" - WriteFiles is able to route different elements to different write operations, with potentially different both locations and formats. I.e. it's configured by something like BQ's DynamicDestinations - TextIO and AvroIO are thin wrappers over WriteFiles - AvroIO in the future may be extended to support different schemas for different files - then it would be even more like BigQuery: it'd take also a SerializableFunction<T, GenericRecord> and a SerializableFunction<DestinationT, Schema>. That means that perhaps it may provide its own DynamicDestinations-like API to its users, more specific than the one exposed by low-level WriteFiles. This is pretty vague, but I think "AvroIO with dynamic schema and with (type of input PCollection = T) != (type being written = GenericRecord)" is a good target to guide search for the perfect API. WDYT? On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]> wrote: > Did you see that I modified the second proposal so that users can map > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or > DestinationT->AvroIO). This means that users do not have to deal with > FileBasedSink or even know it exists. > > I prefer the second approach for two reason: > > 1. It allows customizing some useful things that the FilenamePolicy does > not. e.g. it's very reasonable to want to customize the output directory > and have a different number output shards for each directory. If the > function returns a TextIO or AvroIO they can do that. If there's simply a > mapping to a FilenamePolicy, the can't do that. > > 2. The majority of users don't need to deal with DefaultFilenamePolicy > today. Allowing them to use the TextIO etc. builders for this will be > more-familiar than the DefaultFilenamePolicy.Config option suggested. > > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles <[email protected]> > wrote: > > > I commented a little in the doc I want to reply on list because this is a > > really great feature. > > > > The two alternatives, as I understand them, both include mapping your > > elements to an intermediate DestinationT that you can group by before > > writing. Then the big picture decision is whether to map each > DestinationT > > to a different FilenamePolicy (which may need to be made more powerful) > or > > map each DestinationT to a different FileBasedSink. > > > > I think both are reasonable, modulo pitfalls that I'm probably glossing > > over. I favor the FilenamePolicy version a bit, because it is focused > just > > on the file names, whereas the FileBasedSink version seems a bit > > overpowered for the use case. The other consideration is that > > FilenamePolicy is intended for user consumption, while FileBasedSink is > not > > so much. > > > > Kenn > > > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected]> > > wrote: > > > > > While Beam now supports file-based sinks that can depend on the current > > > window, I've seen interest in value-dependent sinks as well (and > there's > > a > > > long-standing JIRA for this). I wrote up a short API proposal for this > > for > > > discussion on the list. > > > > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j > > > Wt6hOgw6ClwxE4/edit?usp=sharing > > > > > > Reuven > > > > > >
