Re: Specialized operator recovery

Amol Kekre Mon, 24 Apr 2017 11:18:29 -0700

+1.

We should not change the default, as this needs explicit user involvement.
The downstream operator has accomodate this behavior.


Thks
Amol



E:[email protected] | M: 510-449-2606 | Twitter: @*amolhkekre*

www.datatorrent.com


On Mon, Apr 24, 2017 at 10:56 AM, Sanjay Pujare <[email protected]>
wrote:

> +1.
>
> If we didn't have to worry about backward compatibility I would even make
> this new behavior as the default behavior (so may be in v4.0 of Apex ?)
>
> On Mon, Apr 24, 2017 at 8:57 AM, Pramod Immaneni <[email protected]>
> wrote:
>
> > In a failure scenario, when a container fails, it is redeployed along
> with
> > all the operators in it. The operators downstream to these operators are
> > also redeployed within their containers. The operators are restored from
> > their checkpoint and connect to the appropriate point in the stream
> > according to the processing mode. In at least once mode, for example, the
> > data is replayed from the same checkpoint
> >
> > Restoring an operator state from checkpoint could turn out to be a costly
> > operation depending on the size of the state. In some use cases, based on
> > the operator logic, when there is an upstream failure, the operator state
> > without being restored to the checkpoint i.e., remaining as is, will
> still
> > produce the same results with the data replayed from the last fully
> > processed window. This is true with some operators in batch use cases.
> The
> > operator state can remain the same as it was before the upstream failure
> by
> > reusing the same operator instance from before and only the streams and
> > window reset to the window after the last fully processed window to
> > guarantee the at least once processing of tuples. If the container where
> > the operator itself is running goes down, it would need to be restored
> from
> > the checkpoint of course.
> >
> > I would like to propose adding the ability for a user to explicitly
> > identify operators to be of this type and the corresponding functionality
> > in the engine to handle their recovery in the way described above by not
> > restoring their state from checkpoint, reusing the instance and restoring
> > the stream to the window after the last fully processed window for the
> > operator. When operators are not identified to be of this type, the
> default
> > behavior is what it is today and nothing changes.
> >
> > I have done some prototyping on the engine side to ensure that this is
> > possible with our current code base without requiring a massive overhaul,
> > especially the restoration of the operator instance within the Node in
> the
> > streaming container, the re-establishment of the subscriber stream to a
> > window in the buffer server where the publisher (upstream) hasn't yet
> > reached as it would be restarting from checkpoint and have been able to
> get
> > it all working successfully.
> >
> > Thanks
> >
>

Re: Specialized operator recovery

Reply via email to