+1. If we didn't have to worry about backward compatibility I would even make this new behavior as the default behavior (so may be in v4.0 of Apex ?)
On Mon, Apr 24, 2017 at 8:57 AM, Pramod Immaneni <pra...@datatorrent.com> wrote: > In a failure scenario, when a container fails, it is redeployed along with > all the operators in it. The operators downstream to these operators are > also redeployed within their containers. The operators are restored from > their checkpoint and connect to the appropriate point in the stream > according to the processing mode. In at least once mode, for example, the > data is replayed from the same checkpoint > > Restoring an operator state from checkpoint could turn out to be a costly > operation depending on the size of the state. In some use cases, based on > the operator logic, when there is an upstream failure, the operator state > without being restored to the checkpoint i.e., remaining as is, will still > produce the same results with the data replayed from the last fully > processed window. This is true with some operators in batch use cases. The > operator state can remain the same as it was before the upstream failure by > reusing the same operator instance from before and only the streams and > window reset to the window after the last fully processed window to > guarantee the at least once processing of tuples. If the container where > the operator itself is running goes down, it would need to be restored from > the checkpoint of course. > > I would like to propose adding the ability for a user to explicitly > identify operators to be of this type and the corresponding functionality > in the engine to handle their recovery in the way described above by not > restoring their state from checkpoint, reusing the instance and restoring > the stream to the window after the last fully processed window for the > operator. When operators are not identified to be of this type, the default > behavior is what it is today and nothing changes. > > I have done some prototyping on the engine side to ensure that this is > possible with our current code base without requiring a massive overhaul, > especially the restoration of the operator instance within the Node in the > streaming container, the re-establishment of the subscriber stream to a > window in the buffer server where the publisher (upstream) hasn't yet > reached as it would be restarting from checkpoint and have been able to get > it all working successfully. > > Thanks >