Dear Kenn, Thanks again, that pattern was my initial plan but there seems to be a bug in the python API in the periodicsequence.py on line 42 "total_outputs = math.ceil((end - start) / interval)". Here end start and interval are all Durations and the / operator is not defined for the Duration class. I already wrote an email about this to this list but I didn't get a satisfactory answer on how to go around this issue so now I am trying to go around using PeriodicImpulse. If you have any other suggestions I would highly appreciate it.
On Thu, 7 Jan 2021 at 18:29, Kenneth Knowles <[email protected]> wrote: > Actually, if you want to actually re-read the BQ table then you need > something more, following the pattern here: > https://beam.apache.org/documentation/patterns/side-inputs/. There are > two variations on the page there, and these do not use triggers but instead > the read from BigQuery at the beginning of the 24 hours is used by all of > the main input elements for the whole 24 hour window of the main input. The > general pattern is PeriodImpulse --> ParDo(convert impulse to read spec) > --> ReadAll > > I realize you did not specify what language you are using. The ReadAll > transform only exists for BigQuery in Python right now, and it is not yet > easy to use it from Java (plus you may not want to). > > Kenn > > On Thu, Jan 7, 2021 at 12:36 AM Manninger, Matyas < > [email protected]> wrote: > >> Thanks Kenn for the clear explanation. Very helpful. I am trying to read >> a small BQ table as side input and refresh it every 24 hours or so but I >> still want to main stream to be processed during that time. Is there a >> better way to do this than have a 24 hour window with 1 minute triggers on >> the side input? Maybe just restarting the job every 24 hour and reading the >> side input on setup would be the best option. >> >> On Tue, 5 Jan 2021 at 17:53, Kenneth Knowles <[email protected]> wrote: >> >>> You have it basically right. However, there are a couple minor >>> clarifications: >>> >>> 1. A particular window on the side input is not "ready" until there has >>> been some element output to it (or it has expired, which will make it the >>> default value). Main input elements will wait for the side input to be >>> ready. If you configure triggering on the side input, then the first >>> triggering will make it "ready". Of course, this means that the value you >>> will read will be incomplete view of the data. If you have a 24 hour window >>> with triggering set up then the value that is read will be whatever the >>> most recent trigger is, but with some caching delay. >>> 2. None of the "time" that you are talking about is real time. It is all >>> event time so it is controlled by the side input and main input watermarks. >>> Of course in streaming these are usually close to real time so yes on >>> average what you describe is probably right. >>> >>> It sounds like you want a side input with a trigger on it, if you want >>> to read it before you have all the data. This is highly nondeterministic so >>> you want to be sure that you do not require exact answers on the side input. >>> >>> Kenn >>> >>> On Tue, Jan 5, 2021 at 6:56 AM Manninger, Matyas < >>> [email protected]> wrote: >>> >>>> Dear Beam users, >>>> >>>> Can someone clarify me how side input works in streaming? If I use a >>>> stream as a side input to my main stream, each element will be paired with >>>> a side input from the according time window. does this mean that the >>>> element will not be processed until the appropriate window on the side >>>> input stream is closed? So if my side input is windowed into 24 hour >>>> windows will my elements from the main stream be processed only every 24 >>>> hour? If not, then if the window is triggered for the sideinput at 12:00 >>>> and the input actually only arrives at 12:05 then all elements from the >>>> main stream processed between 12:00 and 12:05 will be matched with an empty >>>> sideinput? >>>> >>>> Any clarification is appreciated. >>>> >>>> Best regards, >>>> Matyas >>>> >>>
