Let me first see if I understand what you're trying to accomplish (but I could be way off).
1) A single element comes in from PubSub indicating that a batch is ready to be processed. 2) You process this batch in a DoFn, resulting in many output elements, which percolate through various other stages in your pipeline. 3) What you want is to be able to detect when this batch is "finished", i.e. all produced elements have gone through the pipeline. I don't think triggers are the right tool for the job, they're (loosely speaking) used when you have a GBK but want to see data without waiting for the watermark to reach the end of the window. You could, however, still use watermarks to do this. Say the element comes in a timestamp t. By default, any processing you do with this element will result in elements that also have timestamp t. When aggregating (gbk, combine, join, etc.) you can set the windowing to use the EARLIEST [1] such that the outputs also have timestamp t. (Otherwise, the outputs will have a timestamp at the end of the window, which may also be fine but shifted forward a bit which needs to be taken into account). At the very end, emit a "done" element, flatten all these together, and do a standard GBK windowing by some (presumably small) fixed window (possibly keying by batch id if you have more than one batch flowing at a time). Due to the way watermarks advance, you can be assured that once this GBK emits to the subsequent DoFn, all upstream elements (up to timestamp t) for this batch (key) have been processed, and do whatever logic you need to declare the batch done. As I said, I might not be understanding the question, but hopefully this helps. [1] https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/TimestampCombiner.java On Fri, Sep 14, 2018 at 3:38 AM [email protected] <[email protected]> wrote: > > > On 2018/09/13 22:30:07, Lukasz Cwik <[email protected]> wrote: > > You can even change windowing strategies between group bys with > Window.into. > > > > On Thu, Sep 13, 2018 at 3:29 PM Lukasz Cwik <[email protected]> wrote: > > > > > Multiple group by are supported. > > > > > > On Thu, Sep 13, 2018 at 2:36 PM [email protected] < > [email protected]> > > > wrote: > > > > > >> Hi > > >> > > >> from documentation groupby is applied on key and window basis. > > >> > > >> If my source is Pubsub (unbounded) - does Beam support applying > multiple > > >> groupby transformations and all of applied groupby transformation > execute > > >> in a single window. Or is only one groupby operation supported for > > >> unbounded sources. > > >> > > >> > > >> Thanks > > >> Aniruddh > > >> > > > > > thanks for the revert. > > Here is the use case. Using Dataflow in streaming mode but in actual it > processes batch files. To use it in streaming mode it reads PubSub messages > (where we write a message per batch) and based on that message it should > process that batch. So although number of elements in PubSub is only 1 per > batch. But one batch could have many files and many records within those > files. Assume there will be no parallel batches and it will only be > sequential. So for first message in PubSub Dataflow will process first > batch. Once it is complete then a second message will be written in PubSub. > > Using this trigger to create a window per batch as one message in PubSub > is indicator of complete batch. So assumption here is following trigger > should create a single window for 1 message in pubsub which actually is > batch. > triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)) > > After doing some ParDos and some logical processing , we do multiple > GroupBy based on some logic. problem in GroupBy is that it is not waiting > for all records for same batch. As soon as GroubBy hits it starts emitting > details for downstream function. How to make GroupBy wait for all records > belonging to same batch (which is one record in PubSub and window Trigger > is created on element of PubSub) ? Published only one message in PubSub > which triggers the processing, but Groupby doesn't wait. > > Following are queries. > a) Not understanding Repeatedly.forever trigger. If I publish only one > message in PubSub so my understanding was it will create only 1 window and > complete all processing for 1 window. All GroupBys will wait for all data > to come (for same window) . But GroubBy is emitting multiple times. if > publish only one message then not sure how it creates multiple windows. > b) May be choosing a wrong trigger. Have to choose a trigger (logic not > dependent on time ). Only thing known for sure short is that we write a > single message in pubsub . Other messages will not be written in parallel. > Requirement is to choose a trigger independent of time which makes sure it > creates and executes multiple ParDo and multiple GroupBy in same window. > What could be best trigger for same. > c) is there any way to debug if GroupBy are not waiting and emitting data > for next function. How to debug and find out how it created windows so that > one can look at how windows were created and take a guess how probably > their logic of selecting logic of windows/triggers is wrong. Currently not > able to figure out why GroupBy is not waiting as not sure exactly how and > when it is emitting results. > > Apology for long email and thanks in advance. > > Thanks > Aniruddh >
