Re: Dynamic file-based sinks

Reuven Lax Fri, 09 Jun 2017 09:38:30 -0700

I'm putting together a proof-of-concept PR for option 1 to see how it
looks.


On Thu, Jun 8, 2017 at 4:07 PM, Reuven Lax <[email protected]> wrote:

> After looking at everyone's comments, I think option 1 is the better
> approach - map destinations to a FilenamePolicy. It is a good parallel to
> what we do in BigQueryIO (the main difference is that we're mapping to a
> sharded filename, instead of a single destination like in BigQueryIO).
>
> The main limitation is that numShards cannot be dynamic per destination. I
> think that's fine for two reasons:
>
> 1. We generally discourage people from statically setting numShards, as
> often runner-determined sharding is better.
> 2. In a case where users know that certain types of output files need a
> different number of shards, they can always partition. e.g. partition into
> a 10-shard and a 100-shard sink, with each sink writing dynamic files.
>
> Eugene also brought up destination directory, but that part of the
> FilenamePolicy interface is more a hint than anything else.
> DestinationDirectory is realistically just the base directory for the temp
> files, and the FilenamePolicy is free to ignore it.
>
> Reuven
>
> On Wed, May 24, 2017 at 1:54 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
>> Hmm, on one hand this looks syntactically very appealing, on the other
>> hand, it's icky to have a function return a PTransform at runtime, only to
>> have some information be immediately extracted from that transform.
>> Moreover, not all TextIO.Write transforms will be legal to return - e.g.
>> most likely you're not allowed to return a transform that itself uses
>> dynamic destinations.
>>
>> We should think more about how to decompose this problem.
>> I think there are 2 natural elements to writing files:
>> 1) where to put the files (let's call this file location)
>> 2) how to write to a single file (let's call this file format. In case of
>> Avro, this may theoretically include e.g. schema to be embedded in the
>> file).
>> There should be represented by different interfaces/classes in the API.
>>
>> Then:
>> - Writing a set of elements to a single file location using a single file
>> format = "write operation"
>> - WriteFiles is able to route different elements to different write
>> operations, with potentially different both locations and formats. I.e.
>> it's configured by something like BQ's DynamicDestinations
>> - TextIO and AvroIO are thin wrappers over WriteFiles
>> - AvroIO in the future may be extended to support different schemas for
>> different files - then it would be even more like BigQuery: it'd take also
>> a SerializableFunction<T, GenericRecord> and a
>> SerializableFunction<DestinationT, Schema>. That means that perhaps it
>> may
>> provide its own DynamicDestinations-like API to its users, more specific
>> than the one exposed by low-level WriteFiles.
>>
>> This is pretty vague, but I think "AvroIO with dynamic schema and with
>> (type of input PCollection = T) != (type being written = GenericRecord)"
>> is
>> a good target to guide search for the perfect API. WDYT?
>>
>> On Wed, May 24, 2017 at 11:24 AM Reuven Lax <[email protected]>
>> wrote:
>>
>> > Did you see that I modified the second proposal so that users can map
>> > DestinationT to the actual PTransform (i.e. DestinationT->TextIO or
>> > DestinationT->AvroIO). This means that users do not have to deal with
>> > FileBasedSink or even know it exists.
>> >
>> > I prefer the second approach for two reason:
>> >
>> > 1. It allows customizing some useful things that the FilenamePolicy does
>> > not. e.g. it's very reasonable to want to customize the output directory
>> > and have a different number output shards for each directory. If the
>> > function returns a TextIO or AvroIO they can do that. If there's simply
>> a
>> > mapping to a FilenamePolicy, the can't do that.
>> >
>> > 2. The majority of users don't need to deal with DefaultFilenamePolicy
>> > today. Allowing them to use the TextIO etc. builders for this will be
>> > more-familiar than the DefaultFilenamePolicy.Config option suggested.
>> >
>> > On Wed, May 24, 2017 at 10:59 AM, Kenneth Knowles
>> <[email protected]>
>> > wrote:
>> >
>> > > I commented a little in the doc I want to reply on list because this
>> is a
>> > > really great feature.
>> > >
>> > > The two alternatives, as I understand them, both include mapping your
>> > > elements to an intermediate DestinationT that you can group by before
>> > > writing. Then the big picture decision is whether to map each
>> > DestinationT
>> > > to a different FilenamePolicy (which may need to be made more
>> powerful)
>> > or
>> > > map each DestinationT to a different FileBasedSink.
>> > >
>> > > I think both are reasonable, modulo pitfalls that I'm probably
>> glossing
>> > > over. I favor the FilenamePolicy version a bit, because it is focused
>> > just
>> > > on the file names, whereas the FileBasedSink version seems a bit
>> > > overpowered for the use case. The other consideration is that
>> > > FilenamePolicy is intended for user consumption, while FileBasedSink
>> is
>> > not
>> > > so much.
>> > >
>> > > Kenn
>> > >
>> > > On Thu, May 18, 2017 at 10:31 PM, Reuven Lax <[email protected]
>> >
>> > > wrote:
>> > >
>> > > > While Beam now supports file-based sinks that can depend on the
>> current
>> > > > window, I've seen interest in value-dependent sinks as well (and
>> > there's
>> > > a
>> > > > long-standing JIRA for this). I wrote up a short API proposal for
>> this
>> > > for
>> > > > discussion on the list.
>> > > >
>> > > > https://docs.google.com/document/d/1Bd9mJO1YC8vOoFObJFupVURBMCl7j
>> > > > Wt6hOgw6ClwxE4/edit?usp=sharing
>> > > >
>> > > > Reuven
>> > > >
>> > >
>> >
>>
>
>

Re: Dynamic file-based sinks

Reply via email to