In Splittable DoFn, the trySplit[1] API in RestrictionTracker is for
performing checkpointing, keeping primary as current restriction and
returning residuals. In the DoFn, you can do Splittable DoFn initiated
checkpoint by returning ProcessContinuation.resume()[2]. Beam programming
guide[3] also talks about Splittable DoFn initiated checkpoint and runner
initiated checkpoint.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/RestrictionTracker.java#L72-L108
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L1297-L1333
[3]
https://beam.apache.org/documentation/programming-guide/#splittable-dofns

On Sun, Nov 29, 2020 at 10:28 PM Vincent Marquez <vincent.marq...@gmail.com>
wrote:

> Regarding checkpointing:
>
> I'm confused how the Splittable DoFn can make use of checkpoints to resume
> and not have data loss.  Unlike the old API that had a very easy to
> understand method called 'getCheckpointMark' that allows me to return the
> completed work, I don't see where that is done with the current API.
>
> I tried looking at the OffsetRangeTracker and how it is used by Kafka but
> I'm failing to understand it.  The process method takes the
> RestrictionTracker, but there isn't a way I see for the OffsetRangeTracker
> to represent half completed work (in the event of an exception/crash during
> a previous 'process' method run.   Is there some documentation that could
> help me understand this part?  Thanks in advance.
>
> *~Vincent*
>
>
> On Thu, Nov 26, 2020 at 2:01 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> Just want to mention that we have been working with Vincent in the
>> ReadAll implementation for Cassandra based on normal DoFn, and we
>> expect to get it merged for the next release of Beam. Vincent is
>> familiarized now with DoFn based IO composition, a first step towards
>> SDF understanding. Vincent you can think of our Cassandra RingRange as
>> a Restriction in the context of SDF. Just for reference it would be
>> good to read in advance these two:
>>
>> https://beam.apache.org/blog/splittable-do-fn/
>> https://beam.apache.org/documentation/programming-guide/#sdf-basics
>>
>> Thanks Boyuan for offering your help I think it is really needed
>> considering that we don't have many Unbounded SDF connectors to use as
>> reference.
>>
>> On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang <boyu...@google.com> wrote:
>> >
>> >
>> >
>> > On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez <
>> vincent.marq...@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang <boyu...@google.com>
>> wrote:
>> >>>
>> >>> Hi Vincent,
>> >>>
>> >>> Thanks for your contribution! I'm happy to work with you on this when
>> you contribute the code into Beam.
>> >>
>> >>
>> >> Should I write up a JIRA to start?  I have access, I've already been
>> in the process of contributing some big changes to the CassandraIO
>> connector.
>> >
>> >
>> > Yes, please create a JIRA and assign it to yourself.
>> >
>> >>
>> >>
>> >>>
>> >>>
>> >>> Another thing is that it would be preferable to use Splittable DoFn
>> instead of using UnboundedSource to write a new IO.
>> >>
>> >>
>> >> I would prefer to use the UnboundedSource connector, I've already
>> written most of it, but also, I see some challenges using Splittable DoFn
>> for Redis streams.
>> >>
>> >> Unlike Kafka and Kinesis, Redis Streams offsets are not simply
>> monotonically increasing counters, so there is not a way  to just claim a
>> chunk of work and know that the chunk has any actual data in it.
>> >>
>> >> Since UnboundedSource is not yet deprecated, could I contribute that
>> after finishing up some test aspects, and then perhaps we can implement a
>> Splittable DoFn version?
>> >
>> >
>> > It would be nice not to build new IOs on top of UnboundedSource.
>> Currently we already have the wrapper class which translates the existing
>> UnboundedSource into Unbounded Splittable DoFn and executes the
>> UnboundedSource as the Splittable DoFn. How about you open a WIP PR and we
>> go through the UnboundedSource implementation together to figure out a
>> design for using Splittable DoFn?
>> >
>> >
>> >>
>> >>
>> >>
>> >>>
>> >>>
>> >>> On Thu, Nov 19, 2020 at 10:18 AM Vincent Marquez <
>> vincent.marq...@gmail.com> wrote:
>> >>>>
>> >>>> Currently, Redis offers a streaming queue functionality similar to
>> Kafka/Kinesis/Pubsub, but Beam does not currently have a connector for it.
>> >>>>
>> >>>> I've written an UnboundedSource connector that makes use of Redis
>> Streams as a POC and it seems to work well.
>> >>>>
>> >>>> If someone is willing to work with me, I could write up a JIRA
>> and/or open up a WIP pull request if there is interest in getting this as
>> an official connector.  I would mostly need guidance on naming/testing
>> aspects.
>> >>>>
>> >>>> https://redis.io/topics/streams-intro
>> >>>>
>> >>>> ~Vincent
>> >>
>> >>
>> >> ~Vincent
>>
>

Reply via email to