+1. We should not change the default, as this needs explicit user involvement. The downstream operator has accomodate this behavior.
Thks Amol E:a...@datatorrent.com | M: 510-449-2606 | Twitter: @*amolhkekre* www.datatorrent.com On Mon, Apr 24, 2017 at 10:56 AM, Sanjay Pujare <san...@datatorrent.com> wrote: > +1. > > If we didn't have to worry about backward compatibility I would even make > this new behavior as the default behavior (so may be in v4.0 of Apex ?) > > On Mon, Apr 24, 2017 at 8:57 AM, Pramod Immaneni <pra...@datatorrent.com> > wrote: > > > In a failure scenario, when a container fails, it is redeployed along > with > > all the operators in it. The operators downstream to these operators are > > also redeployed within their containers. The operators are restored from > > their checkpoint and connect to the appropriate point in the stream > > according to the processing mode. In at least once mode, for example, the > > data is replayed from the same checkpoint > > > > Restoring an operator state from checkpoint could turn out to be a costly > > operation depending on the size of the state. In some use cases, based on > > the operator logic, when there is an upstream failure, the operator state > > without being restored to the checkpoint i.e., remaining as is, will > still > > produce the same results with the data replayed from the last fully > > processed window. This is true with some operators in batch use cases. > The > > operator state can remain the same as it was before the upstream failure > by > > reusing the same operator instance from before and only the streams and > > window reset to the window after the last fully processed window to > > guarantee the at least once processing of tuples. If the container where > > the operator itself is running goes down, it would need to be restored > from > > the checkpoint of course. > > > > I would like to propose adding the ability for a user to explicitly > > identify operators to be of this type and the corresponding functionality > > in the engine to handle their recovery in the way described above by not > > restoring their state from checkpoint, reusing the instance and restoring > > the stream to the window after the last fully processed window for the > > operator. When operators are not identified to be of this type, the > default > > behavior is what it is today and nothing changes. > > > > I have done some prototyping on the engine side to ensure that this is > > possible with our current code base without requiring a massive overhaul, > > especially the restoration of the operator instance within the Node in > the > > streaming container, the re-establishment of the subscriber stream to a > > window in the buffer server where the publisher (upstream) hasn't yet > > reached as it would be restarting from checkpoint and have been able to > get > > it all working successfully. > > > > Thanks > > >