output transform

Kenneth Knowles Fri, 15 Dec 2017 19:12:06 -0800

I can think of a couple mechanisms that you might use. You could drop to a
low level and use a stateful ParDo, storing the contents of panes to state
and emitting them all when the final pane arrives. You could also
re-trigger on the side input with a Never.ever() trigger, which will only
fire on GC time - and with terminating triggers eliminated, this is the
only final pane. And if you only want synchronization, you could just use a
simple ParDo to filter out all non-final panes.


Kenn

On Fri, Dec 15, 2017 at 7:01 PM, Ben Chambers <[email protected]> wrote:

> Would it make more sense for the side input watermark and details about
> the pane to be made available to the dofn which can then decide how to
> handle it? Then if a dofn only wants the final pane, it is analogous to
> triggering-is-for-sinks to push that back and only produce the views the
> dofn wants.
>
> I think exposing it and letting  the dofn figure it out has similarities
> to how the input punctuation for each input might be exposed in other
> systems.
>
> On Fri, Dec 15, 2017, 5:31 PM Eugene Kirpichov <[email protected]>
> wrote:
>
>> So this appears not as easy as anticipated (surprise!)
>>
>> Suppose we have a PCollection "donePanes" with an element per
>> fully-processed pane: e.g. BigQuery sink, and elements saying "a pane of
>> data has been written; this pane is: final / non-final".
>>
>> Suppose we want to use this to ensure that somePc.apply(ParDo.of(fn))
>> happens only after the final pane has been written.
>>
>> In other words: we want a.apply(ParDo.of(b).withSideInput(c)) to happen
>> when c emits a *final* pane.
>>
>> Unfortunately, using 
>> ParDo.of(fn).withSideInputs(donePanes.apply(View.asSingleton()))
>> doesn't do the trick: the side input becomes ready the moment *the first
>> *pane of data has been written.
>>
>> But neither does ParDo.of(fn).withSideInputs(donePanes.apply(...filter
>> only final panes...).apply(View.asSingleton())). It also becomes ready
>> the moment *the first* pane has been written, you just get an exception
>> if you access the side input before the *final* pane was written.
>>
>> I can't think of a pure-Beam solution to this: either "donePanes" will be
>> used as a main input to something (and then everything else can only be a
>> side input, which is not general enough), or it will be used as a side
>> input (and then we can't achieve "trigger only after the final pane fires").
>>
>> It seems that we need a way to control the side input pushback, and
>> configure whether a view becomes ready when its first pane has fired or
>> when its last pane has fired. I could see this be a property on the View
>> transform itself. In terms of implementation - I tried to figure out how
>> side input readiness is determined, in the direct runner and Dataflow
>> runner, and I'm completely lost and would appreciate some help.
>>
>> On Thu, Dec 7, 2017 at 12:01 AM Reuven Lax <[email protected]> wrote:
>>
>>> This sounds great!
>>>
>>> On Mon, Dec 4, 2017 at 4:34 PM, Ben Chambers <[email protected]>
>>> wrote:
>>>
>>>> This would be absolutely great! It seems somewhat similar to the
>>>> changes that were made to the BigQuery sink to support WriteResult (
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/
>>>> google-cloud-platform/src/main/java/org/apache/beam/sdk/
>>>> io/gcp/bigquery/WriteResult.java).
>>>>
>>>> I find it helpful to think about the different things that may come
>>>> after a sink. For instance:
>>>>
>>>> 1. It might be helpful to have a collection of failing input elements.
>>>> The type of failed elements is pretty straightforward -- just the input
>>>> elements. This allows handling such failures by directing them elsewhere or
>>>> performing additional processing.
>>>>
>>>
>>> BigQueryIO already does this as you point out.
>>>
>>
>>>> 2. For a sink that produces a series of files, it might be useful to
>>>> have a collection of the file names that have been completely written. This
>>>> allows performing additional handling on these completed segments.
>>>>
>>>
>>> In fact we already do this for FileBasedSinks.   See https://github.com/
>>> apache/beam/blob/7d53878768757ef2115170a5073b99
>>> 956e924ff2/sdks/java/core/src/main/java/org/apache/beam/sdk/
>>> io/WriteFilesResult.java
>>>
>>
>>>> 3. For a sink that updates some destination, it would be reasonable to
>>>> have a collection that provides (periodically) output indicating how
>>>> complete the information written to that destination is. For instance, this
>>>> might be something like "<this bigquery table> has all of the elements up
>>>> to <input watermark>" complete. This allows tracking how much information
>>>> has been completely written out.
>>>>
>>>
>>> Interesting. Maybe tough to do since sinks often don't have that
>>> knowledge.
>>>
>>
>>>
>>>>
>>>> I think those concepts map to the more detailed description Eugene
>>>> provided, but I find it helpful to focus on what information comes out of
>>>> the sink and how it might be used.
>>>>
>>>> Were there any use cases the above miss? Any functionality that has
>>>> been described that doesn't map to these use cases?
>>>>
>>>> -- Ben
>>>>
>>>> On Mon, Dec 4, 2017 at 4:02 PM Eugene Kirpichov <[email protected]>
>>>> wrote:
>>>>
>>>>> It makes sense to consider how this maps onto existing kinds of sinks.
>>>>>
>>>>> E.g.:
>>>>> - Something that just makes an RPC per record, e.g. MqttIO.write():
>>>>> that will emit 1 result per bundle (either a bogus value or number of
>>>>> records written) that will be Combine'd into 1 result per pane of input. A
>>>>> user can sequence against this and be notified when some intermediate
>>>>> amount of data has been written for a window, or (via .isFinal()) when all
>>>>> of it has been written.
>>>>> - Something that e.g. initiates an import job, such as
>>>>> BigQueryIO.write(), or an ElasticsearchIO write with a follow-up atomic
>>>>> index swap: should emit 1 result per import job, e.g. containing
>>>>> information about the job (e.g. its id and statistics). Role of panes is
>>>>> the same.
>>>>> - Something like above but that supports dynamic destinations: like in
>>>>> WriteFiles, result will be PCollection<KV<DestinationT, ResultT>> where
>>>>> ResultT may be something like a list of files that were written for this
>>>>> pane of this destination.
>>>>>
>>>>> On Mon, Dec 4, 2017 at 3:58 PM Eugene Kirpichov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I agree that the proper API for enabling the use case "do something
>>>>>> after the data has been written" is to return a PCollection of objects
>>>>>> where each object represents the result of writing some identifiable 
>>>>>> subset
>>>>>> of the data. Then one can apply a ParDo to this PCollection, in order to
>>>>>> "do something after this subset has been written".
>>>>>>
>>>>>> The challenging part here is *identifying* the subset of the data
>>>>>> that's been written, in a way consistent with Beam's unified
>>>>>> batch/streaming model, where saying "all data has been written" is not an
>>>>>> option because more data can arrive.
>>>>>>
>>>>>> The next choice is "a window of input has been written", but then
>>>>>> again, late data can arrive into a window as well.
>>>>>>
>>>>>> Next choice after that is "a pane of input has been written", but per
>>>>>> https://s.apache.org/beam-sink-triggers the term "pane of input" is
>>>>>> moot: triggering and panes should be something private to the sink, and 
>>>>>> the
>>>>>> same input can trigger different sinks differently. The hypothetical
>>>>>> different accumulation modes make this trickier still. I'm not sure 
>>>>>> whether
>>>>>> we intend to also challenge the idea that windowing is inherent to the
>>>>>> collection, or whether it too should be specified on a transform that
>>>>>> processes the collection. I think for the sake of this discussion we can
>>>>>> assume that it's inherent, and assume the mental model that the elements 
>>>>>> in
>>>>>> different windows of a PCollection are processed independently - "as if"
>>>>>> there were multiple pipelines processing each window.
>>>>>>
>>>>>> Overall, embracing the full picture, we end up with something like
>>>>>> this:
>>>>>> - The input PCollection is a composition of windows.
>>>>>> - If the windowing strategy is non-merging (e.g. fixed or sliding
>>>>>> windows), the below applies to the entire contents of the PCollection. If
>>>>>> it's merging (e.g. session windows), then it applies per-key, and the 
>>>>>> input
>>>>>> should be (perhaps implicitly) keyed in a way that the sink understands -
>>>>>> for example, the grouping by destination in DynamicDestinations in file 
>>>>>> and
>>>>>> bigquery writes.
>>>>>> - Each window's contents is a "changelog" - stream of elements and
>>>>>> retractions.
>>>>>> - A "sink" processes each window of the collection, deciding how to
>>>>>> handle elements and retractions (and whether to support retractions at 
>>>>>> all)
>>>>>> in a sink-specific way, and deciding *when* to perform the side effects 
>>>>>> for
>>>>>> a portion of the changelog (a "pane") based on the sink's triggering
>>>>>> strategy.
>>>>>> - If the side effect itself is parallelized, then there'll be
>>>>>> multiple results for the pane - e.g. one per bundle.
>>>>>> - Each (sink-chosen) pane produces a set of results, e.g. a list of
>>>>>> filenames that have been written, or simply a number of records that was
>>>>>> written, or a bogus void value etc. The result will implicitly include 
>>>>>> the
>>>>>> window of the input it's associated with. It will also implicitly include
>>>>>> pane information - index of the pane in this window, and whether this is
>>>>>> the first or last pane.
>>>>>> - The partitioning into bundles is an implementation detail and not
>>>>>> very useful, so before presenting the pane write results to the user, the
>>>>>> sink will probably want to Combine the bundle results so that there ends 
>>>>>> up
>>>>>> being 1 value for each pane that was written. Once again note that panes
>>>>>> may be associated with windows of the input as a whole, but if the input 
>>>>>> is
>>>>>> keyed (like with DynamicDestinations) they'll be associated with per-key
>>>>>> subsets of windows of the input.
>>>>>> - This combining requires an extra, well, combining operation, so it
>>>>>> should be optional.
>>>>>> - The user will end up getting either a PCollection<ResultT> or a
>>>>>> PCollection<KV<KeyT, ResultT>>, for sink-specific KeyT and ResultT, where
>>>>>> the elements of this collection will implicitly have window and pane
>>>>>> information, available via the implicit BoundedWindow and PaneInfo.
>>>>>> - Until "sink triggering" is implemented, we'll have to embrace the
>>>>>> fact that trigger strategy is set on the input. But in that case the user
>>>>>> will have to accept that the PaneInfo of ResultT's is not necessarily
>>>>>> directly related to panes of the input - the sink is allowed to do 
>>>>>> internal
>>>>>> aggregation as an implementation detail, which may modify the triggering
>>>>>> strategy. Basically the user will still get sink-assigned panes.
>>>>>> - In most cases, one may imagine that the user is interested in being
>>>>>> notified of "no more data associated with this window will be written", 
>>>>>> so
>>>>>> the user will ignore all ResultT's except those where the pane is marked
>>>>>> final. If a user is interested in being notified of intermediate write
>>>>>> results - they'll have to embrace the fact that they cannot identify the
>>>>>> precise subset of input associated with the intermediate result.
>>>>>>
>>>>>> I think the really key points of the above are:
>>>>>> - Sinks should support windowed input. Sinks should write different
>>>>>> windows of input independently. If the sink can write multi-destination
>>>>>> input, the destination should function as a grouping key, and in that 
>>>>>> case
>>>>>> merging windowing should be allowed.
>>>>>> - Producing a PCollection of write results should be optional.
>>>>>> - When asked to produce results, sinks produce a PCollection of
>>>>>> results that may be keyed or unkeyed (per above), and are placed in the
>>>>>> window of the input that was written, and have a PaneInfo assigned by the
>>>>>> sink, of which probably the only part useful to the user is whether it's
>>>>>> .isFinal().
>>>>>>
>>>>>> Does this sound reasonable?
>>>>>>
>>>>>> On Mon, Dec 4, 2017 at 11:50 AM Robert Bradshaw <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> At the very least an empty PCollection<?> could be produced with no
>>>>>>> promises about its contents but the ability to be followed (e.g. as a
>>>>>>> side input), which is forward compatible with whatever actual
>>>>>>> metadata
>>>>>>> one may decide to produce in the future.
>>>>>>>
>>>>>>> On Mon, Dec 4, 2017 at 11:06 AM, Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>> > +dev@
>>>>>>> >
>>>>>>> > I am in complete agreement with Luke. Data dependencies are easy to
>>>>>>> > understand and a good way for an IO to communicate and establish
>>>>>>> causal
>>>>>>> > dependencies. Converting an IO from PDone to real output may spur
>>>>>>> further
>>>>>>> > useful thoughts based on the design decisions about what sort of
>>>>>>> output is
>>>>>>> > most useful.
>>>>>>> >
>>>>>>> > Kenn
>>>>>>> >
>>>>>>> > On Mon, Dec 4, 2017 at 10:42 AM, Lukasz Cwik <[email protected]>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I think all sinks actually do have valuable information to output
>>>>>>> which
>>>>>>> >> can be used after a write (file names, transaction/commit/row
>>>>>>> ids, table
>>>>>>> >> names, ...). In addition to this metadata, having a PCollection
>>>>>>> of all
>>>>>>> >> successful writes and all failed writes is useful for users so
>>>>>>> they can
>>>>>>> >> chain an action which depends on what was or wasn't successfully
>>>>>>> written.
>>>>>>> >> Users have requested adding retry/failure handling policies to
>>>>>>> sinks so that
>>>>>>> >> failed writes don't jam up the pipeline.
>>>>>>> >>
>>>>>>> >> On Fri, Dec 1, 2017 at 2:43 PM, Chet Aldrich <
>>>>>>> [email protected]>
>>>>>>> >> wrote:
>>>>>>> >>>
>>>>>>> >>> So I agree generally with the idea that returning a PCollection
>>>>>>> makes all
>>>>>>> >>> of this easier so that arbitrary additional functions can be
>>>>>>> added, what
>>>>>>> >>> exactly would write functions be returning in a PCollection that
>>>>>>> would make
>>>>>>> >>> sense? The whole idea is that we’ve written to an external
>>>>>>> source and now
>>>>>>> >>> the collection itself is no longer needed.
>>>>>>> >>>
>>>>>>> >>> Currently, that’s represented with a PDone, but currently that
>>>>>>> doesn’t
>>>>>>> >>> allow any work to occur after it. I see a couple possible ways
>>>>>>> of handling
>>>>>>> >>> this given this conversation, and am curious which solution
>>>>>>> sounds like the
>>>>>>> >>> best way to deal with the problem:
>>>>>>> >>>
>>>>>>> >>> 1. Have output transforms always return something specific
>>>>>>> (which would
>>>>>>> >>> be the same across transforms by convention), that is in the
>>>>>>> form of a
>>>>>>> >>> PCollection, so operations can occur after it.
>>>>>>> >>>
>>>>>>> >>> 2. Make either PDone or some new type that can act as a
>>>>>>> PCollection so we
>>>>>>> >>> can run applies afterward.
>>>>>>> >>>
>>>>>>> >>> 3. Make output transforms provide the facility for a callback
>>>>>>> function
>>>>>>> >>> which runs after the transform is complete.
>>>>>>> >>>
>>>>>>> >>> I went through these gymnastics recently when I was trying to
>>>>>>> build
>>>>>>> >>> something that would move indices after writing to Algolia, and
>>>>>>> the solution
>>>>>>> >>> was to co-opt code from the old Sink class that used to exist in
>>>>>>> Beam. The
>>>>>>> >>> problem is that particular method requires the output transform
>>>>>>> in question
>>>>>>> >>> to return a PCollection, even if it is trivial or doesn’t make
>>>>>>> sense to
>>>>>>> >>> return one. This seems like a bad solution, but unfortunately
>>>>>>> there isn’t a
>>>>>>> >>> notion of a transform that has no explicit output that needs to
>>>>>>> have
>>>>>>> >>> operations occur after it.
>>>>>>> >>>
>>>>>>> >>> The three potential solutions above address this issue, but I
>>>>>>> would like
>>>>>>> >>> to hear on which would be preferable (or perhaps a different
>>>>>>> proposal
>>>>>>> >>> altogether?). Perhaps we could also start up a ticket on this,
>>>>>>> since it
>>>>>>> >>> seems like a worthwhile feature addition. I would find it really
>>>>>>> useful, for
>>>>>>> >>> one.
>>>>>>> >>>
>>>>>>> >>> Chet
>>>>>>> >>>
>>>>>>> >>> On Dec 1, 2017, at 12:19 PM, Lukasz Cwik <[email protected]>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Instead of a callback fn, its most useful if a PCollection is
>>>>>>> returned
>>>>>>> >>> containing the result of the sink so that any arbitrary
>>>>>>> additional functions
>>>>>>> >>> can be applied.
>>>>>>> >>>
>>>>>>> >>> On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré <
>>>>>>> [email protected]>
>>>>>>> >>> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Agree, I would prefer to do the callback in the IO more than in
>>>>>>> the
>>>>>>> >>>> main.
>>>>>>> >>>>
>>>>>>> >>>> Regards
>>>>>>> >>>> JB
>>>>>>> >>>>
>>>>>>> >>>> On 12/01/2017 03:54 PM, Steve Niemitz wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> I do something almost exactly like this, but with BigtableIO
>>>>>>> instead.
>>>>>>> >>>>> I have a pull request open here [1] (which reminds me I need
>>>>>>> to finish this
>>>>>>> >>>>> up...).  It would really be nice for most IOs to support
>>>>>>> something like
>>>>>>> >>>>> this.
>>>>>>> >>>>>
>>>>>>> >>>>> Essentially you do a GroupByKey (or some CombineFn) on the
>>>>>>> output from
>>>>>>> >>>>> the BigtableIO, and then feed that into your function which
>>>>>>> will run when
>>>>>>> >>>>> all writes finish.
>>>>>>> >>>>>
>>>>>>> >>>>> You probably want to avoid doing something in the main method
>>>>>>> because
>>>>>>> >>>>> there's no guarantee it'll actually run (maybe the driver will
>>>>>>> die, get
>>>>>>> >>>>> killed, machine will explode, etc).
>>>>>>> >>>>>
>>>>>>> >>>>> [1] https://github.com/apache/beam/pull/3997
>>>>>>> >>>>>
>>>>>>> >>>>> On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick <[email protected]
>>>>>>> >>>>> <mailto:[email protected]>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>     Assuming you're in Java. You could just follow on in your
>>>>>>> Main
>>>>>>> >>>>> method.
>>>>>>> >>>>>     Checking the state of the Result.
>>>>>>> >>>>>
>>>>>>> >>>>>     Example:
>>>>>>> >>>>>     PipelineResult result = pipeline.run();
>>>>>>> >>>>>     try {
>>>>>>> >>>>>     result.waitUntilFinish();
>>>>>>> >>>>>     if(result.getState() == PipelineResult.State.DONE) {
>>>>>>> >>>>>     //DO ES work
>>>>>>> >>>>>     }
>>>>>>> >>>>>     } catch(Exception e) {
>>>>>>> >>>>>     result.cancel();
>>>>>>> >>>>>     throw e;
>>>>>>> >>>>>     }
>>>>>>> >>>>>
>>>>>>> >>>>>     Otherwise you could also use Oozie to construct a work
>>>>>>> flow.
>>>>>>> >>>>>
>>>>>>> >>>>>     On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré
>>>>>>> >>>>> <[email protected]
>>>>>>> >>>>>     <mailto:[email protected]>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>         Hi,
>>>>>>> >>>>>
>>>>>>> >>>>>         yes, we had a similar question some days ago.
>>>>>>> >>>>>
>>>>>>> >>>>>         We can imagine to have a user callback fn fired when
>>>>>>> the sink
>>>>>>> >>>>> batch is
>>>>>>> >>>>>         complete.
>>>>>>> >>>>>
>>>>>>> >>>>>         Let me think about that.
>>>>>>> >>>>>
>>>>>>> >>>>>         Regards
>>>>>>> >>>>>         JB
>>>>>>> >>>>>
>>>>>>> >>>>>         On 12/01/2017 09:04 AM, Philip Chan wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>             Hey JB,
>>>>>>> >>>>>
>>>>>>> >>>>>             Thanks for getting back so quickly.
>>>>>>> >>>>>             I suppose in that case I would need a way of
>>>>>>> monitoring
>>>>>>> >>>>> when the ES
>>>>>>> >>>>>             transform completes successfully before I can
>>>>>>> proceed with
>>>>>>> >>>>> doing the
>>>>>>> >>>>>             swap.
>>>>>>> >>>>>             The problem with this is that I can't think of a
>>>>>>> good way
>>>>>>> >>>>> to
>>>>>>> >>>>>             determine that termination state short of polling
>>>>>>> the new
>>>>>>> >>>>> index to
>>>>>>> >>>>>             check the document count compared to the size of
>>>>>>> input
>>>>>>> >>>>> PCollection.
>>>>>>> >>>>>             That, or maybe I'd need to use an external system
>>>>>>> like you
>>>>>>> >>>>> mentioned
>>>>>>> >>>>>             to poll on the state of the pipeline (I'm using
>>>>>>> Google
>>>>>>> >>>>> Dataflow, so
>>>>>>> >>>>>             maybe there's a way to do this with some API).
>>>>>>> >>>>>             But I would have thought that there would be an
>>>>>>> easy way of
>>>>>>> >>>>> simply
>>>>>>> >>>>>             saying "do not process this transform until this
>>>>>>> other
>>>>>>> >>>>> transform
>>>>>>> >>>>>             completes".
>>>>>>> >>>>>             Is there no established way of "signaling" between
>>>>>>> >>>>> pipelines when
>>>>>>> >>>>>             some pipeline completes, or have some way of
>>>>>>> declaring a
>>>>>>> >>>>> dependency
>>>>>>> >>>>>             of 1 transform on another transform?
>>>>>>> >>>>>
>>>>>>> >>>>>             Thanks again,
>>>>>>> >>>>>             Philip
>>>>>>> >>>>>
>>>>>>> >>>>>             On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste
>>>>>>> Onofré
>>>>>>> >>>>>             <[email protected] <mailto:[email protected]>
>>>>>>> >>>>> <mailto:[email protected]
>>>>>>> >>>>>
>>>>>>> >>>>>             <mailto:[email protected]>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>                  Hi Philip,
>>>>>>> >>>>>
>>>>>>> >>>>>                  You won't be able to do (3) in the same
>>>>>>> pipeline as
>>>>>>> >>>>> the
>>>>>>> >>>>>             Elasticsearch Sink
>>>>>>> >>>>>                  PTransform ends the pipeline with PDone.
>>>>>>> >>>>>
>>>>>>> >>>>>                  So, (3) has to be done in another pipeline
>>>>>>> (using a
>>>>>>> >>>>> DoFn) or in
>>>>>>> >>>>>             another
>>>>>>> >>>>>                  "system" (like Camel for instance). I would
>>>>>>> do a check
>>>>>>> >>>>> of the
>>>>>>> >>>>>             data in the
>>>>>>> >>>>>                  index and then trigger the swap there.
>>>>>>> >>>>>
>>>>>>> >>>>>                  Regards
>>>>>>> >>>>>                  JB
>>>>>>> >>>>>
>>>>>>> >>>>>                  On 12/01/2017 08:41 AM, Philip Chan wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>                      Hi,
>>>>>>> >>>>>
>>>>>>> >>>>>                      I'm pretty new to Beam, and I've been
>>>>>>> trying to
>>>>>>> >>>>> use the
>>>>>>> >>>>>             ElasticSearchIO
>>>>>>> >>>>>                      sink to write docs into ES.
>>>>>>> >>>>>                      With this, I want to be able to
>>>>>>> >>>>>                      1. ingest and transform rows from DB
>>>>>>> (done)
>>>>>>> >>>>>                      2. write JSON docs/strings into a new ES
>>>>>>> index
>>>>>>> >>>>> (done)
>>>>>>> >>>>>                      3. After (2) is complete and all
>>>>>>> documents are
>>>>>>> >>>>> written into
>>>>>>> >>>>>             a new index,
>>>>>>> >>>>>                      trigger an atomic index swap under an
>>>>>>> alias to
>>>>>>> >>>>> replace the
>>>>>>> >>>>>             current
>>>>>>> >>>>>                      aliased index with the new index
>>>>>>> generated in step
>>>>>>> >>>>> 2. This
>>>>>>> >>>>>             is basically
>>>>>>> >>>>>                      a single POST request to the ES cluster.
>>>>>>> >>>>>
>>>>>>> >>>>>                      The problem I'm facing is that I don't
>>>>>>> seem to be
>>>>>>> >>>>> able to
>>>>>>> >>>>>             find a way to
>>>>>>> >>>>>                      have a way for (3) to happen after step
>>>>>>> (2) is
>>>>>>> >>>>> complete.
>>>>>>> >>>>>
>>>>>>> >>>>>                      The ElasticSearchIO.Write transform
>>>>>>> returns a
>>>>>>> >>>>> PDone, and
>>>>>>> >>>>>             I'm not sure
>>>>>>> >>>>>                      how to proceed from there because it
>>>>>>> doesn't seem
>>>>>>> >>>>> to let me
>>>>>>> >>>>>             do another
>>>>>>> >>>>>                      apply on it to "define" a dependency.
>>>>>>> >>>>>
>>>>>>> >>>>> https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>>>>
>>>>>>> >>>>> <https://beam.apache.org/documentation/sdks/javadoc/2.
>>>>>>> 1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>> >>>
>>>>>>> >>>>>
>>>>>>> >>>>>                      Is there a recommended way to construct
>>>>>>> pipelines
>>>>>>> >>>>> workflows
>>>>>>> >>>>>             like this?
>>>>>>> >>>>>
>>>>>>> >>>>>                      Thanks in advance,
>>>>>>> >>>>>                      Philip
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>                  --     Jean-Baptiste Onofré
>>>>>>> >>>>>             [email protected] <mailto:[email protected]>
>>>>>>> >>>>>             <mailto:[email protected] <mailto:
>>>>>>> [email protected]>>
>>>>>>> >>>>>             http://blog.nanthrax.net
>>>>>>> >>>>>                  Talend - http://www.talend.com
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>         --         Jean-Baptiste Onofré
>>>>>>> >>>>>         [email protected] <mailto:[email protected]>
>>>>>>> >>>>>         http://blog.nanthrax.net
>>>>>>> >>>>>         Talend - http://www.talend.com
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>     --     Nick Verbeck - NerdyNick
>>>>>>> >>>>>     ----------------------------------------------------
>>>>>>> >>>>>     NerdyNick.com <http://NerdyNick.com>
>>>>>>> >>>>>     TrailsOffroad.com <http://TrailsOffroad.com>
>>>>>>> >>>>>     NoKnownBoundaries.com <http://NoKnownBoundaries.com>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>>> --
>>>>>>> >>>> Jean-Baptiste Onofré
>>>>>>> >>>> [email protected]
>>>>>>> >>>> http://blog.nanthrax.net
>>>>>>> >>>> Talend - http://www.talend.com
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>

Re: Callbacks/other functions run after a PDone/output transform

Reply via email to