Hmm, on one hand this looks syntactically very appealing, on the other
hand, it's icky to have a function return a PTransform at runtime, only to
have some information be immediately extracted from that transform.
Moreover, not all TextIO.Write transforms will be legal to return - e.g.
most likely you're not allowed to return a transform that itself uses
dynamic destinations.

We should think more about how to decompose this problem.
I think there are 2 natural elements to writing files:
1) where to put the files (let's call this file location)
2) how to write to a single file (let's call this file format. In case of
Avro, this may theoretically include e.g. schema to be embedded in the
file).
There should be represented by different interfaces/classes in the API.

Then:
- Writing a set of elements to a single file location using a single file
format = "write operation"
- WriteFiles is able to route different elements to different write
operations, with potentially different both locations and formats. I.e.
it's configured by something like BQ's DynamicDestinations
- TextIO and AvroIO are thin wrappers over WriteFiles
- AvroIO in the future may be extended to support different schemas for
different files - then it would be even more like BigQuery: it'd take also
a SerializableFunction<T, GenericRecord> and a
SerializableFunction<DestinationT, Schema>. That means that perhaps it may
provide its own DynamicDestinations-like API to its users, more specific
than the one exposed by low-level WriteFiles.

This is pretty vague, but I think "AvroIO with dynamic schema and with
(type of input PCollection = T) != (type being written = GenericRecord)" is
a good target to guide search for the perfect API. WDYT?

On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]>
wrote:

> Did you see that I modified the second proposal so that users can map
> DestinationT to the actual PTransform (i.e. DestinationT->TextIO or
> DestinationT->AvroIO). This means that users do not have to deal with
> FileBasedSink or even know it exists.
>
> I prefer the second approach for two reason:
>
> 1. It allows customizing some useful things that the FilenamePolicy does
> not. e.g. it's very reasonable to want to customize the output directory
> and have a different number output shards for each directory. If the
> function returns a TextIO or AvroIO they can do that. If there's simply a
> mapping to a FilenamePolicy, the can't do that.
>
> 2. The majority of users don't need to deal with DefaultFilenamePolicy
> today. Allowing them to use the TextIO etc. builders for this will be
> more-familiar than the DefaultFilenamePolicy.Config option suggested.
>
> On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles <[email protected]>
> wrote:
>
> > I commented a little in the doc I want to reply on list because this is a
> > really great feature.
> >
> > The two alternatives, as I understand them, both include mapping your
> > elements to an intermediate DestinationT that you can group by before
> > writing. Then the big picture decision is whether to map each
> DestinationT
> > to a different FilenamePolicy (which may need to be made more powerful)
> or
> > map each DestinationT to a different FileBasedSink.
> >
> > I think both are reasonable, modulo pitfalls that I'm probably glossing
> > over. I favor the FilenamePolicy version a bit, because it is focused
> just
> > on the file names, whereas the FileBasedSink version seems a bit
> > overpowered for the use case. The other consideration is that
> > FilenamePolicy is intended for user consumption, while FileBasedSink is
> not
> > so much.
> >
> > Kenn
> >
> > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected]>
> > wrote:
> >
> > > While Beam now supports file-based sinks that can depend on the current
> > > window, I've seen interest in value-dependent sinks as well (and
> there's
> > a
> > > long-standing JIRA for this). I wrote up a short API proposal for this
> > for
> > > discussion on the list.
> > >
> > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j
> > > Wt6hOgw6ClwxE4/edit?usp=sharing
> > >
> > > Reuven
> > >
> >
>

Reply via email to