On 2018/09/13 22:30:07, Lukasz Cwik <[email protected]> wrote: 
> You can even change windowing strategies between group bys with Window.into.
> 
> On Thu, Sep 13, 2018 at 3:29 PM Lukasz Cwik <[email protected]> wrote:
> 
> > Multiple group by are supported.
> >
> > On Thu, Sep 13, 2018 at 2:36 PM [email protected] <[email protected]>
> > wrote:
> >
> >> Hi
> >>
> >> from documentation groupby is applied on key and window basis.
> >>
> >> If my source is Pubsub (unbounded) - does Beam support applying multiple
> >> groupby transformations and all of applied groupby transformation execute
> >> in a single window. Or is only one groupby operation supported for
> >> unbounded sources.
> >>
> >>
> >> Thanks
> >> Aniruddh
> >>
> >
> thanks for the revert.

Here is the use case. Using Dataflow in streaming mode but in actual it 
processes batch files. To use it in streaming mode it reads PubSub messages 
(where we write a message per batch) and based on that message it should 
process that batch. So although number of elements in PubSub is only 1 per 
batch. But one batch could have many files and many records within those files. 
Assume there will be no parallel batches and it will only be sequential. So for 
first message in PubSub Dataflow will process first batch. Once it is complete 
then a second message will be written in PubSub.

Using this trigger to create a window per batch as one message in PubSub is 
indicator of complete batch. So assumption here is following trigger should 
create a single window for 1 message in pubsub which actually is batch.
triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))

After doing some ParDos and some logical processing , we do multiple GroupBy 
based on some logic. problem in GroupBy is that it is not waiting for all 
records for same batch. As soon as GroubBy hits it starts emitting details for 
downstream function.  How to make GroupBy wait for all records belonging to 
same batch (which is one record in PubSub and window Trigger is created on 
element of PubSub) ? Published only one message in PubSub which triggers the 
processing, but Groupby doesn't wait. 

Following are queries.
a) Not understanding Repeatedly.forever trigger.  If I publish only one message 
in PubSub so my understanding was it will create only 1 window and complete all 
processing for 1 window. All GroupBys will wait for all data to come (for same 
window) . But GroubBy is emitting multiple times. if publish only one message 
then not sure how it creates multiple windows.
b) May be choosing a wrong trigger. Have to choose a trigger (logic not 
dependent on time ). Only thing known for sure short is that we write a single 
message in pubsub . Other messages will not be written in parallel.  
Requirement is to choose a trigger independent of time which makes sure it 
creates and executes multiple ParDo and multiple GroupBy in same window. What 
could be best trigger for same.
c) is there any way to debug if GroupBy are not waiting and emitting data for 
next function. How to debug and find out how it created windows so that one can 
look at how windows were created and take a guess how probably their logic of 
selecting logic of windows/triggers is wrong. Currently not able to figure out 
why GroupBy is not waiting as not sure exactly how and when it is emitting 
results.

Apology for long email and thanks in advance.

Thanks
Aniruddh

Reply via email to