Following JIRA is open to enhance the support for batch: https://issues.apache.org/jira/browse/APEXCORE-235
One of the challenges with batch on Apex today is that there isn't any native support to identify begin/end of batch and associate actions to it. For example, at the beginning we may want to fetch some data needed for all subsequent processing and at the end perform some finalization action or push to external system (add partition to Hive table or similar). Absent native support, the workaround is to add a bunch of ports and extra operators for propagation and synchronization purposes, which makes building the batch application with standard operators or development of custom operators rather difficult and inefficient. The span of a batch can also be seen as a user defined window, with logic for begin and end. The current "application window" support is limited to a multiple of streaming window on a per operator basis. In the batch case, the boundary needs to be more flexible - user code needs to be able to determine begin/endWindow based on external data (existence of files etc.). There is another commonality with application window, and that's alignment of checkpointing. For batches where it is more efficient to redo the processing instead of checkpointing potentially large amounts of intermediate state for incremental recovery, it would be nice to be able to say "user window == checkpoint interval". This is to float the idea of having a window control that can be influenced by user code. An operator that identifies the batch boundary tells the engine about it and corresponding control tuples are submitted through the stream, leading to callbacks on downstream operators. These control tuples should be able to carry contextual information that can be used in downstream operator logic (file names, schema information etc.) I don't expect the current beginWindow/endWindow can be augmented in a backward compatible way to accommodate this, but a similar optional interface could be supported to enable batch aware operators and checkpointing optimization. Thoughts?
