In Splittable DoFn, the trySplit[1] API in RestrictionTracker is for performing checkpointing, keeping primary as current restriction and returning residuals. In the DoFn, you can do Splittable DoFn initiated checkpoint by returning ProcessContinuation.resume()[2]. Beam programming guide[3] also talks about Splittable DoFn initiated checkpoint and runner initiated checkpoint.
[1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/RestrictionTracker.java#L72-L108 [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L1297-L1333 [3] https://beam.apache.org/documentation/programming-guide/#splittable-dofns On Sun, Nov 29, 2020 at 10:28 PM Vincent Marquez <vincent.marq...@gmail.com> wrote: > Regarding checkpointing: > > I'm confused how the Splittable DoFn can make use of checkpoints to resume > and not have data loss. Unlike the old API that had a very easy to > understand method called 'getCheckpointMark' that allows me to return the > completed work, I don't see where that is done with the current API. > > I tried looking at the OffsetRangeTracker and how it is used by Kafka but > I'm failing to understand it. The process method takes the > RestrictionTracker, but there isn't a way I see for the OffsetRangeTracker > to represent half completed work (in the event of an exception/crash during > a previous 'process' method run. Is there some documentation that could > help me understand this part? Thanks in advance. > > *~Vincent* > > > On Thu, Nov 26, 2020 at 2:01 PM Ismaël Mejía <ieme...@gmail.com> wrote: > >> Just want to mention that we have been working with Vincent in the >> ReadAll implementation for Cassandra based on normal DoFn, and we >> expect to get it merged for the next release of Beam. Vincent is >> familiarized now with DoFn based IO composition, a first step towards >> SDF understanding. Vincent you can think of our Cassandra RingRange as >> a Restriction in the context of SDF. Just for reference it would be >> good to read in advance these two: >> >> https://beam.apache.org/blog/splittable-do-fn/ >> https://beam.apache.org/documentation/programming-guide/#sdf-basics >> >> Thanks Boyuan for offering your help I think it is really needed >> considering that we don't have many Unbounded SDF connectors to use as >> reference. >> >> On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang <boyu...@google.com> wrote: >> > >> > >> > >> > On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez < >> vincent.marq...@gmail.com> wrote: >> >> >> >> >> >> >> >> >> >> On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang <boyu...@google.com> >> wrote: >> >>> >> >>> Hi Vincent, >> >>> >> >>> Thanks for your contribution! I'm happy to work with you on this when >> you contribute the code into Beam. >> >> >> >> >> >> Should I write up a JIRA to start? I have access, I've already been >> in the process of contributing some big changes to the CassandraIO >> connector. >> > >> > >> > Yes, please create a JIRA and assign it to yourself. >> > >> >> >> >> >> >>> >> >>> >> >>> Another thing is that it would be preferable to use Splittable DoFn >> instead of using UnboundedSource to write a new IO. >> >> >> >> >> >> I would prefer to use the UnboundedSource connector, I've already >> written most of it, but also, I see some challenges using Splittable DoFn >> for Redis streams. >> >> >> >> Unlike Kafka and Kinesis, Redis Streams offsets are not simply >> monotonically increasing counters, so there is not a way to just claim a >> chunk of work and know that the chunk has any actual data in it. >> >> >> >> Since UnboundedSource is not yet deprecated, could I contribute that >> after finishing up some test aspects, and then perhaps we can implement a >> Splittable DoFn version? >> > >> > >> > It would be nice not to build new IOs on top of UnboundedSource. >> Currently we already have the wrapper class which translates the existing >> UnboundedSource into Unbounded Splittable DoFn and executes the >> UnboundedSource as the Splittable DoFn. How about you open a WIP PR and we >> go through the UnboundedSource implementation together to figure out a >> design for using Splittable DoFn? >> > >> > >> >> >> >> >> >> >> >>> >> >>> >> >>> On Thu, Nov 19, 2020 at 10:18 AM Vincent Marquez < >> vincent.marq...@gmail.com> wrote: >> >>>> >> >>>> Currently, Redis offers a streaming queue functionality similar to >> Kafka/Kinesis/Pubsub, but Beam does not currently have a connector for it. >> >>>> >> >>>> I've written an UnboundedSource connector that makes use of Redis >> Streams as a POC and it seems to work well. >> >>>> >> >>>> If someone is willing to work with me, I could write up a JIRA >> and/or open up a WIP pull request if there is interest in getting this as >> an official connector. I would mostly need guidance on naming/testing >> aspects. >> >>>> >> >>>> https://redis.io/topics/streams-intro >> >>>> >> >>>> ~Vincent >> >> >> >> >> >> ~Vincent >> >