On 2018/09/13 22:30:07, Lukasz Cwik <[email protected]> wrote:
> You can even change windowing strategies between group bys with Window.into.
>
> On Thu, Sep 13, 2018 at 3:29 PM Lukasz Cwik <[email protected]> wrote:
>
> > Multiple group by are supported.
> >
> > On Thu, Sep 13, 2018 at 2:36 PM [email protected] <[email protected]>
> > wrote:
> >
> >> Hi
> >>
> >> from documentation groupby is applied on key and window basis.
> >>
> >> If my source is Pubsub (unbounded) - does Beam support applying multiple
> >> groupby transformations and all of applied groupby transformation execute
> >> in a single window. Or is only one groupby operation supported for
> >> unbounded sources.
> >>
> >>
> >> Thanks
> >> Aniruddh
> >>
> >
> thanks for the revert.
Here is the use case. Using Dataflow in streaming mode but in actual it
processes batch files. To use it in streaming mode it reads PubSub messages
(where we write a message per batch) and based on that message it should
process that batch. So although number of elements in PubSub is only 1 per
batch. But one batch could have many files and many records within those files.
Assume there will be no parallel batches and it will only be sequential. So for
first message in PubSub Dataflow will process first batch. Once it is complete
then a second message will be written in PubSub.
Using this trigger to create a window per batch as one message in PubSub is
indicator of complete batch. So assumption here is following trigger should
create a single window for 1 message in pubsub which actually is batch.
triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))
After doing some ParDos and some logical processing , we do multiple GroupBy
based on some logic. problem in GroupBy is that it is not waiting for all
records for same batch. As soon as GroubBy hits it starts emitting details for
downstream function. How to make GroupBy wait for all records belonging to
same batch (which is one record in PubSub and window Trigger is created on
element of PubSub) ? Published only one message in PubSub which triggers the
processing, but Groupby doesn't wait.
Following are queries.
a) Not understanding Repeatedly.forever trigger. If I publish only one message
in PubSub so my understanding was it will create only 1 window and complete all
processing for 1 window. All GroupBys will wait for all data to come (for same
window) . But GroubBy is emitting multiple times. if publish only one message
then not sure how it creates multiple windows.
b) May be choosing a wrong trigger. Have to choose a trigger (logic not
dependent on time ). Only thing known for sure short is that we write a single
message in pubsub . Other messages will not be written in parallel.
Requirement is to choose a trigger independent of time which makes sure it
creates and executes multiple ParDo and multiple GroupBy in same window. What
could be best trigger for same.
c) is there any way to debug if GroupBy are not waiting and emitting data for
next function. How to debug and find out how it created windows so that one can
look at how windows were created and take a guess how probably their logic of
selecting logic of windows/triggers is wrong. Currently not able to figure out
why GroupBy is not waiting as not sure exactly how and when it is emitting
results.
Apology for long email and thanks in advance.
Thanks
Aniruddh