Whoops my bad, that would never happen. There is a check that only allows purging of checkpoints for an operator if the operator has more than one checkpoint. :)
On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <[email protected]> wrote: > Siyuan, then Ashwin may be right that there is an issue. Looking at the > code again I think this could happen: > > 1 - All operators reach checkpiont 30 > 2 - Checkpoints are updated on heartbeat and committed window is now 25, > everything before window 30 is purged > 3 - no new checkpoint is reached for any operator > 4 - Checkpoints are updated on heartbeat again and committed window is now > 30, now window 30 is purged. > > May be missing something again though. > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <[email protected]> > wrote: > >> My understanding is the committed window could possibly be 30 as well, >> depends on whether container manager get heart beat from containers. >> >> And I guess the discussion is assuming at_least_once semantic? :) >> at_most_once should have different recovery window. >> >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <[email protected]> >> wrote: >> >> > Hi Ashwin, >> > >> > In your example, if A fails the recovery windows would be >> > >> > D - 15 >> > C - 15 >> > B - 15 >> > A - 15 >> > >> > If C fails the recovery windows would be >> > >> > D -15 >> > C -15 >> > B - 25 >> > A - 30 >> > >> > If every operator just reached window 30 and checkpointed, the committed >> > window would be 25, and all the checkpoints before window 30 would be >> > purged, but the checkpoint for window 30 would not be purged. >> > >> > Thanks, >> > Tim >> > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta < >> > [email protected]> wrote: >> > >> > > Tim, >> > > >> > > Thanks, that is pretty much inline with what I was thinking. A little >> > > different thought though in terms of picking the checkpoint based on >> > > downstream operators. For A, is it not going to be "the checkpoint >> with >> > the >> > > largest window id that is less than or equal to the checkpoint with >> the >> > > largest common window id (instead of largest window id) among all the >> > > operators down stream to A" >> > > >> > > For example, >> > > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count >> is 5 >> > > and the largest checkpoints are as follows. >> > > >> > > A - 30 >> > > B - 25 >> > > C - 20 >> > > D - 15 >> > > >> > > Does A recover at 25 (checkpoint with largest window id) or 15 >> > (checkpoint >> > > with largest common window id)? >> > > >> > > Also, regarding recovering at committed window id. Is it not possible >> in >> > > the following scenario where all operators have checkpointed at 30 and >> > got >> > > the committed window call back. And then an operator fails before any >> > > operator checkpoints further. In that case, the recovery window is 30 >> > > right? >> > > >> > > Regards, >> > > Ashwin. >> > > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <[email protected] >> > >> > > wrote: >> > > >> > > > Hi Ashwin, >> > > > >> > > > The recovery checkpoint for operator A is computed by taking the >> > > checkpoint >> > > > with the largest window id that is less than or equal to the >> checkpoint >> > > > with the largest window id among all the operators down stream to A. >> > The >> > > > output operators in a dag will always recover to their most recent >> > > > checkpoint. The input operator of the dag may recover to the >> earliest >> > > > checkpoint. Operators between the input and ouput operators could >> > recover >> > > > to a window in between. >> > > > >> > > > I don't think you can ever recover to a committed window, the >> earliest >> > I >> > > > think you can recover to is the window after the committed window >> (may >> > be >> > > > wrong on this). >> > > > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta < >> > > > [email protected]> wrote: >> > > > >> > > > > In the apex architecture there is concept of checkpointing and >> > concept >> > > of >> > > > > committed when all operator have crossed a common checkpoint. >> > > > > >> > > > > So, in which scenarios does a given operator recover at last >> > checkpoint >> > > > > window vs last committed window vs some other checkpoint window in >> > > > between? >> > > > > -- >> > > > > >> > > > > Regards, >> > > > > Ashwin. >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > >> > > Regards, >> > > Ashwin. >> > > >> > >> > >
