Re: Return types of Write transforms (aka best way to signal)

Reza Rokni Wed, 26 Jun 2019 19:38:47 -0700

The use case of a transform  waiting for a SInk or Sinks to complete is
very interesting indeed!


Curious, if a sink internally makes use of a Global Window with processing
time triggers to push its writes, what mechanism could be used to release a
transform waiting for a signal from the Sink(s) that all processing is done
and it can move forward?

On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw <rober...@google.com> wrote:

> Regarding Python, yes and no. Python doesn't distinguish at compile
> time between (1), (2), and (6), but that doesn't mean it isn't part of
> the public API and people might start counting on it, so it's in some
> sense worse. We can also do (3) (which is less cumbersome in Python,
> either returning a tuple or a dict) or (4).
>
> Good point about providing a simple solution (something that can be
> waited on at least) and allowing for with* modifiers to return more.
>
> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
> >
> > BTW regarding Python SDK, I think the answer to this question is simpler
> for Python SDK due to the lack of types. Most examples I know just return a
> PCollection from the Write transform which may or may not be ignored by
> users. If the PCollection is used, the user should be aware of the element
> type of the returned PCollection and should use it accordingly in
> subsequent transforms.
> >
> > Thanks,
> > Cham
> >
> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath <chamik...@google.com>
> wrote:
> >>
> >>
> >>
> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>
> >>> Good question.
> >>>
> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> objects (e.g there's nothing to wait on).
> >>>
> >>> There is also (6) return PCollection<SourceSpecificWriteResult>. The
> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> changing the public API, while giving something to wait on without
> >>> promising anything about its contents.
> >>>
> >>>
> >>> I would probably lean towards (4) for anything that would want to
> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> and view (3) as being a "cheap" but more cumbersome for the user way
> >>> of writing (4). In both cases, more information can be added in a
> >>> forward-compatible way. Technically (4) could extend (3) if one wants
> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
> >>> an aside, it would be interesting if any of the schema work that lets
> >>> us get rid of tuple tags for elements (e.g. join operations) could let
> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> >>> with PCollection members be as powerful as a PCollectionTuple).
> >>>
> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía <ieme...@gmail.com>
> wrote:
> >>> >
> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
> >>> > processing of each window in a PCollection until signaled. This
> opened
> >>> > new interesting patterns for example writing to a database and when
> >>> > ‘fully’ done write to another database.
> >>> >
> >>> > To support this pattern an IO connector Write transform must return a
> >>> > type different from PDone to signal the processing of the next step.
> >>> > Some IOs have already started to implement this return type, but each
> >>> > returned type has different pros and cons so I wanted to open the
> >>> > discussion on this to see if we could somehow find a common pattern
> to
> >>> > suggest IO authors to follow (Note: It may be the case that there is
> >>> > not a pattern that fits certain use cases).
> >>> >
> >>> > So far the approaches in our code base are:
> >>> >
> >>> > 1. Write returns ‘PCollection<Void>’
> >>> >
> >>> > This is the simplest case but if subsequent transforms require more
> >>> > data that could have been produced during the write it gets ‘lost’.
> >>> > Used by JdbcIO and DynamoDBIO.
> >>> >
> >>> > 2. Write returns ‘PCollection<?>’
> >>> >
> >>> > We can return whatever we want but the return type is uncertain for
> >>> > the user in case he wants to use information from it. This is less
> >>> > user friendly but has the maintenance advantage of not changing
> >>> > signatures if we want to change the return type in the future. Used
> by
> >>> > RabbitMQIO.
> >>> >
> >>> > 3. Write returns a `PCollectionTuple`
> >>> >
> >>> > It is like (2) but with the advantage of returning an untyped tuple
> of
> >>> > PCollections so we can return more things. Used by SnsIO.
> >>> >
> >>> > 4. Write returns ‘a class that implements POutput’
> >>> >
> >>> > This class wraps inside of the PCollections that were part of the
> >>> > write, e.g. SpannerWriteResult. This is useful because we can be
> >>> > interested on saving inside a PCollection of failed mutations apart
> of
> >>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> >>> > of this one is used by FileIO for Destinations via:
> >>> > ‘WriteFilesResult<DestinationT>’.
> >>> >
> >>> > 5. Write returns ‘a class that implements POutput’ with specific data
> >>> > (no PCollections)
> >>> >
> >>> > This is similar to (4) but with the difference that the returned type
> >>> > contains the specific data that may be needed next, for example not a
> >>> > PCollection but values like the number of rows written. Used by
> >>> > BigtableIO (PR in review at the moment). (This can be seen as a
> >>> > simpler version of 4).
> >>
> >>
> >> Thanks Ismaël for detailing various approaches with examples.
> >>
> >> I think current PR for BigTable returns a
> PCollection<BigTableWriteResult>  from a PTransform 'WithWriteResults' that
> can be optionally invoked through a BigTableIO.Write.withWriteResults(). So
> this is more closer to (6) Robert mentioned. But (1) was also discussed as
> an option. PR is https://github.com/apache/beam/pull/7805 for anybody
> interested.
> >>
> >> I think (6) is less cumbersome to implement/use and allows us to easily
> extend the transform through more chaining or by changing the return
> transform through additional "with*" methods to the FooIO.Write class.
> >>
> >> Thanks,
> >> Cham
> >>
> >>> >
> >>> > I would like to have your opinions on which approach you think it is
> >>> > better or worse and arguments if you see other
> >>> > advantages/disadvantages. I am probably more in the (4) camp but I
> >>> > feel somehow attracted by the flexibility that the lack of strict
> >>> > typing brings in (2, 3) in case of changes to the public IO API (of
> >>> > course this can be contested too).
> >>> >
> >>> > Any other ideas, preferences, issues we may be missing?
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Return types of Write transforms (aka best way to signal)

Reply via email to