Re: Dynamic file-based sinks

Reuven Lax Thu, 08 Jun 2017 16:07:53 -0700

After looking at everyone's comments, I think option 1 is the better
approach - map destinations to a FilenamePolicy. It is a good parallel to
what we do in BigQueryIO (the main difference is that we're mapping to a
sharded filename, instead of a single destination like in BigQueryIO).


The main limitation is that numShards cannot be dynamic per destination. I
think that's fine for two reasons:

1. We generally discourage people from statically setting numShards, as
often runner-determined sharding is better.
2. In a case where users know that certain types of output files need a
different number of shards, they can always partition. e.g. partition into
a 10-shard and a 100-shard sink, with each sink writing dynamic files.

Eugene also brought up destination directory, but that part of the
FilenamePolicy interface is more a hint than anything else.
DestinationDirectory is realistically just the base directory for the temp
files, and the FilenamePolicy is free to ignore it.

Reuven

On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov <
[email protected]> wrote:

> Hmm, on one hand this looks syntactically very appealing, on the other
> hand, it's icky to have a function return a PTransform at runtime, only to
> have some information be immediately extracted from that transform.
> Moreover, not all TextIO.Write transforms will be legal to return - e.g.
> most likely you're not allowed to return a transform that itself uses
> dynamic destinations.
>
> We should think more about how to decompose this problem.
> I think there are 2 natural elements to writing files:
> 1) where to put the files (let's call this file location)
> 2) how to write to a single file (let's call this file format. In case of
> Avro, this may theoretically include e.g. schema to be embedded in the
> file).
> There should be represented by different interfaces/classes in the API.
>
> Then:
> - Writing a set of elements to a single file location using a single file
> format = "write operation"
> - WriteFiles is able to route different elements to different write
> operations, with potentially different both locations and formats. I.e.
> it's configured by something like BQ's DynamicDestinations
> - TextIO and AvroIO are thin wrappers over WriteFiles
> - AvroIO in the future may be extended to support different schemas for
> different files - then it would be even more like BigQuery: it'd take also
> a SerializableFunction<T, GenericRecord> and a
> SerializableFunction<DestinationT, Schema>. That means that perhaps it may
> provide its own DynamicDestinations-like API to its users, more specific
> than the one exposed by low-level WriteFiles.
>
> This is pretty vague, but I think "AvroIO with dynamic schema and with
> (type of input PCollection = T) != (type being written = GenericRecord)" is
> a good target to guide search for the perfect API. WDYT?
>
> On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]>
> wrote:
>
> > Did you see that I modified the second proposal so that users can map
> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or
> > DestinationT->AvroIO). This means that users do not have to deal with
> > FileBasedSink or even know it exists.
> >
> > I prefer the second approach for two reason:
> >
> > 1. It allows customizing some useful things that the FilenamePolicy does
> > not. e.g. it's very reasonable to want to customize the output directory
> > and have a different number output shards for each directory. If the
> > function returns a TextIO or AvroIO they can do that. If there's simply a
> > mapping to a FilenamePolicy, the can't do that.
> >
> > 2. The majority of users don't need to deal with DefaultFilenamePolicy
> > today. Allowing them to use the TextIO etc. builders for this will be
> > more-familiar than the DefaultFilenamePolicy.Config option suggested.
> >
> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles <[email protected]
> >
> > wrote:
> >
> > > I commented a little in the doc I want to reply on list because this
> is a
> > > really great feature.
> > >
> > > The two alternatives, as I understand them, both include mapping your
> > > elements to an intermediate DestinationT that you can group by before
> > > writing. Then the big picture decision is whether to map each
> > DestinationT
> > > to a different FilenamePolicy (which may need to be made more powerful)
> > or
> > > map each DestinationT to a different FileBasedSink.
> > >
> > > I think both are reasonable, modulo pitfalls that I'm probably glossing
> > > over. I favor the FilenamePolicy version a bit, because it is focused
> > just
> > > on the file names, whereas the FileBasedSink version seems a bit
> > > overpowered for the use case. The other consideration is that
> > > FilenamePolicy is intended for user consumption, while FileBasedSink is
> > not
> > > so much.
> > >
> > > Kenn
> > >
> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected]
> >
> > > wrote:
> > >
> > > > While Beam now supports file-based sinks that can depend on the
> current
> > > > window, I've seen interest in value-dependent sinks as well (and
> > there's
> > > a
> > > > long-standing JIRA for this). I wrote up a short API proposal for
> this
> > > for
> > > > discussion on the list.
> > > >
> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j
> > > > Wt6hOgw6ClwxE4/edit?usp=sharing
> > > >
> > > > Reuven
> > > >
> > >
> >
>

Re: Dynamic file-based sinks

Reply via email to