Enhance batch support - batch demarcation

Thomas Weise Mon, 28 Dec 2015 00:04:35 -0800

Following JIRA is open to enhance the support for batch:

https://issues.apache.org/jira/browse/APEXCORE-235


One of the challenges with batch on Apex today is that there isn't any
native support to identify begin/end of batch and associate actions to it.
For example, at the beginning we may want to fetch some data needed for all
subsequent processing and at the end perform some finalization action or
push to external system (add partition to Hive table or similar).

Absent native support, the workaround is to add a bunch of ports and extra
operators for propagation and synchronization purposes, which makes
building the batch application with standard operators or development of
custom operators rather difficult and inefficient.

The span of a batch can also be seen as a user defined window, with logic
for begin and end. The current "application window" support is limited to a
multiple of streaming window on a per operator basis. In the batch case,
the boundary needs to be more flexible - user code needs to be able to
determine begin/endWindow based on external data (existence of files etc.).

There is another commonality with application window, and that's alignment
of checkpointing. For batches where it is more efficient to redo the
processing instead of checkpointing potentially large amounts of
intermediate state for incremental recovery, it would be nice to be able to
say "user window == checkpoint interval".

This is to float the idea of having a window control that can be influenced
by user code. An operator that identifies the batch boundary tells the
engine about it and corresponding control tuples are submitted through the
stream, leading to callbacks on downstream operators. These control
tuples should
be able to carry contextual information that can be used in downstream
operator logic (file names, schema information etc.)

I don't expect the current beginWindow/endWindow can be augmented in a
backward compatible way to accommodate this, but a similar optional
interface could be supported to enable batch aware operators and
checkpointing optimization.

Thoughts?

Enhance batch support - batch demarcation

Reply via email to