Regarding checkpointing: I'm confused how the Splittable DoFn can make use of checkpoints to resume and not have data loss. Unlike the old API that had a very easy to understand method called 'getCheckpointMark' that allows me to return the completed work, I don't see where that is done with the current API.
I tried looking at the OffsetRangeTracker and how it is used by Kafka but I'm failing to understand it. The process method takes the RestrictionTracker, but there isn't a way I see for the OffsetRangeTracker to represent half completed work (in the event of an exception/crash during a previous 'process' method run. Is there some documentation that could help me understand this part? Thanks in advance. *~Vincent* On Thu, Nov 26, 2020 at 2:01 PM Ismaël Mejía <[email protected]> wrote: > Just want to mention that we have been working with Vincent in the > ReadAll implementation for Cassandra based on normal DoFn, and we > expect to get it merged for the next release of Beam. Vincent is > familiarized now with DoFn based IO composition, a first step towards > SDF understanding. Vincent you can think of our Cassandra RingRange as > a Restriction in the context of SDF. Just for reference it would be > good to read in advance these two: > > https://beam.apache.org/blog/splittable-do-fn/ > https://beam.apache.org/documentation/programming-guide/#sdf-basics > > Thanks Boyuan for offering your help I think it is really needed > considering that we don't have many Unbounded SDF connectors to use as > reference. > > On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang <[email protected]> wrote: > > > > > > > > On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez < > [email protected]> wrote: > >> > >> > >> > >> > >> On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang <[email protected]> > wrote: > >>> > >>> Hi Vincent, > >>> > >>> Thanks for your contribution! I'm happy to work with you on this when > you contribute the code into Beam. > >> > >> > >> Should I write up a JIRA to start? I have access, I've already been in > the process of contributing some big changes to the CassandraIO connector. > > > > > > Yes, please create a JIRA and assign it to yourself. > > > >> > >> > >>> > >>> > >>> Another thing is that it would be preferable to use Splittable DoFn > instead of using UnboundedSource to write a new IO. > >> > >> > >> I would prefer to use the UnboundedSource connector, I've already > written most of it, but also, I see some challenges using Splittable DoFn > for Redis streams. > >> > >> Unlike Kafka and Kinesis, Redis Streams offsets are not simply > monotonically increasing counters, so there is not a way to just claim a > chunk of work and know that the chunk has any actual data in it. > >> > >> Since UnboundedSource is not yet deprecated, could I contribute that > after finishing up some test aspects, and then perhaps we can implement a > Splittable DoFn version? > > > > > > It would be nice not to build new IOs on top of UnboundedSource. > Currently we already have the wrapper class which translates the existing > UnboundedSource into Unbounded Splittable DoFn and executes the > UnboundedSource as the Splittable DoFn. How about you open a WIP PR and we > go through the UnboundedSource implementation together to figure out a > design for using Splittable DoFn? > > > > > >> > >> > >> > >>> > >>> > >>> On Thu, Nov 19, 2020 at 10:18 AM Vincent Marquez < > [email protected]> wrote: > >>>> > >>>> Currently, Redis offers a streaming queue functionality similar to > Kafka/Kinesis/Pubsub, but Beam does not currently have a connector for it. > >>>> > >>>> I've written an UnboundedSource connector that makes use of Redis > Streams as a POC and it seems to work well. > >>>> > >>>> If someone is willing to work with me, I could write up a JIRA and/or > open up a WIP pull request if there is interest in getting this as an > official connector. I would mostly need guidance on naming/testing aspects. > >>>> > >>>> https://redis.io/topics/streams-intro > >>>> > >>>> ~Vincent > >> > >> > >> ~Vincent >
