Certainly, there are a number of patterns for which careful co-design using the pipeline structure in conjunction with the output system gives best semantics. It may be worth looking at the (batch) Write transform to see how the sink is split into
1) single initialize 2) parallel write 3) single finalize and at FileBasedSink to see how it uses temp files and idempotent rename to get correctness. The above are all done within the existing model's semantics, using Create and side inputs and the startBundle/processElement/finishBundle design patterns. Admittedly, this is harder for some external systems, and the correctness guarantees can depend on the APIs they provide. E.g., whether they support atomic bulk operations such as a database bulk load or a filesystem rename. I need to spend some time reviewing the Flink RollingSink, but it may be that we can get close within the model already, or that the right application of the forthcoming state APIs will help with the internal checkpointing that Aljoscha is probably referencing. Dan On Tue, May 3, 2016 at 10:15 AM, Raghu Angadi <rang...@google.com.invalid> wrote: > agreed. finishBundle() helps but can not guarantee consistent state. > > On Tue, May 3, 2016 at 1:49 AM, Maximilian Michels <m...@apache.org> wrote: > > > Correct, Kafka doesn't support rollbacks of the producer. In Flink > > there is the RollingSink which supports transactional rolling files. > > Admittedly, that is the only one. Still, checkpointing sinks in Beam > > could be useful for users who are concerned about exactly once > > semantics. I'm not sure whether we can implement something similar > > with the bundle mechanism. > > > > On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi > > <rang...@google.com.invalid> wrote: > > > What are good examples of streaming sinks that support checkpointing > (or > > > transactions/rollbacks)? I don't Kafka supports a rollback. > > > > > > On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <m...@apache.org> > > wrote: > > > > > >> Yes, I would expect sinks to provide similar additional interfaces > > >> like sources, e.g. checkpointing. We could also use the > > >> startBundle/processElement/finishBundle lifecycle methods to implement > > >> checkpointing. I just wonder, if we want to make it more explicit. > > >> Also, does it make sense that sinks can return a PCollection? You can > > >> return PDone but you don't have to. > > >> > > >> Since sinks are fundamental in streaming pipelines, it just seemed odd > > >> to me that there is not dedicated interface. I understand a bit > > >> clearer now that it is not viewed as crucial because we can use > > >> existing primitives to create sinks. In a way, that might be elegant > > >> but also less explicit. > > >> > > >> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry > <f...@google.com.invalid > > > > > >> wrote: > > >> >> > > >> >> @Frances Sources are not simple DoFns. They add additional > > >> >> functionality, e.g. checkpointing, watermark generation, creating > > >> >> splits. If we want sinks to be portable, we should think about a > > >> >> dedicated interface. At least for the checkpointing. > > >> >> > > >> > > > >> > We might be mixing sources and sinks in this conversation. ;-) > Sources > > >> > definitely provide additional functionality as you mentioned. But at > > >> least > > >> > currently, sinks don't provide any new primitive functionality. Are > > you > > >> > suggestion there needs to be a checkpointing interface for sinks > > beyond > > >> > DoFn's bundle finalization? (Note that the existing Write for batch > is > > >> just > > >> > a PTransform based around ParDo.) > > >> > > >