Beam introduced in version 2.4.0 the Wait transform to delay
processing of each window in a PCollection until signaled. This opened
new interesting patterns for example writing to a database and when
‘fully’ done write to another database.

To support this pattern an IO connector Write transform must return a
type different from PDone to signal the processing of the next step.
Some IOs have already started to implement this return type, but each
returned type has different pros and cons so I wanted to open the
discussion on this to see if we could somehow find a common pattern to
suggest IO authors to follow (Note: It may be the case that there is
not a pattern that fits certain use cases).

So far the approaches in our code base are:

1. Write returns ‘PCollection<Void>’

This is the simplest case but if subsequent transforms require more
data that could have been produced during the write it gets ‘lost’.
Used by JdbcIO and DynamoDBIO.

2. Write returns ‘PCollection<?>’

We can return whatever we want but the return type is uncertain for
the user in case he wants to use information from it. This is less
user friendly but has the maintenance advantage of not changing
signatures if we want to change the return type in the future. Used by
RabbitMQIO.

3. Write returns a `PCollectionTuple`

It is like (2) but with the advantage of returning an untyped tuple of
PCollections so we can return more things. Used by SnsIO.

4. Write returns ‘a class that implements POutput’

This class wraps inside of the PCollections that were part of the
write, e.g. SpannerWriteResult. This is useful because we can be
interested on saving inside a PCollection of failed mutations apart of
the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
of this one is used by FileIO for Destinations via:
‘WriteFilesResult<DestinationT>’.

5. Write returns ‘a class that implements POutput’ with specific data
(no PCollections)

This is similar to (4) but with the difference that the returned type
contains the specific data that may be needed next, for example not a
PCollection but values like the number of rows written. Used by
BigtableIO (PR in review at the moment). (This can be seen as a
simpler version of 4).

I would like to have your opinions on which approach you think it is
better or worse and arguments if you see other
advantages/disadvantages. I am probably more in the (4) camp but I
feel somehow attracted by the flexibility that the lack of strict
typing brings in (2, 3) in case of changes to the public IO API (of
course this can be contested too).

Any other ideas, preferences, issues we may be missing?

Reply via email to