Regarding checkpointing:

I'm confused how the Splittable DoFn can make use of checkpoints to resume
and not have data loss.  Unlike the old API that had a very easy to
understand method called 'getCheckpointMark' that allows me to return the
completed work, I don't see where that is done with the current API.

I tried looking at the OffsetRangeTracker and how it is used by Kafka but
I'm failing to understand it.  The process method takes the
RestrictionTracker, but there isn't a way I see for the OffsetRangeTracker
to represent half completed work (in the event of an exception/crash during
a previous 'process' method run.   Is there some documentation that could
help me understand this part?  Thanks in advance.

*~Vincent*


On Thu, Nov 26, 2020 at 2:01 PM Ismaël Mejía <[email protected]> wrote:

> Just want to mention that we have been working with Vincent in the
> ReadAll implementation for Cassandra based on normal DoFn, and we
> expect to get it merged for the next release of Beam. Vincent is
> familiarized now with DoFn based IO composition, a first step towards
> SDF understanding. Vincent you can think of our Cassandra RingRange as
> a Restriction in the context of SDF. Just for reference it would be
> good to read in advance these two:
>
> https://beam.apache.org/blog/splittable-do-fn/
> https://beam.apache.org/documentation/programming-guide/#sdf-basics
>
> Thanks Boyuan for offering your help I think it is really needed
> considering that we don't have many Unbounded SDF connectors to use as
> reference.
>
> On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang <[email protected]> wrote:
> >
> >
> >
> > On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez <
> [email protected]> wrote:
> >>
> >>
> >>
> >>
> >> On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang <[email protected]>
> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>> Thanks for your contribution! I'm happy to work with you on this when
> you contribute the code into Beam.
> >>
> >>
> >> Should I write up a JIRA to start?  I have access, I've already been in
> the process of contributing some big changes to the CassandraIO connector.
> >
> >
> > Yes, please create a JIRA and assign it to yourself.
> >
> >>
> >>
> >>>
> >>>
> >>> Another thing is that it would be preferable to use Splittable DoFn
> instead of using UnboundedSource to write a new IO.
> >>
> >>
> >> I would prefer to use the UnboundedSource connector, I've already
> written most of it, but also, I see some challenges using Splittable DoFn
> for Redis streams.
> >>
> >> Unlike Kafka and Kinesis, Redis Streams offsets are not simply
> monotonically increasing counters, so there is not a way  to just claim a
> chunk of work and know that the chunk has any actual data in it.
> >>
> >> Since UnboundedSource is not yet deprecated, could I contribute that
> after finishing up some test aspects, and then perhaps we can implement a
> Splittable DoFn version?
> >
> >
> > It would be nice not to build new IOs on top of UnboundedSource.
> Currently we already have the wrapper class which translates the existing
> UnboundedSource into Unbounded Splittable DoFn and executes the
> UnboundedSource as the Splittable DoFn. How about you open a WIP PR and we
> go through the UnboundedSource implementation together to figure out a
> design for using Splittable DoFn?
> >
> >
> >>
> >>
> >>
> >>>
> >>>
> >>> On Thu, Nov 19, 2020 at 10:18 AM Vincent Marquez <
> [email protected]> wrote:
> >>>>
> >>>> Currently, Redis offers a streaming queue functionality similar to
> Kafka/Kinesis/Pubsub, but Beam does not currently have a connector for it.
> >>>>
> >>>> I've written an UnboundedSource connector that makes use of Redis
> Streams as a POC and it seems to work well.
> >>>>
> >>>> If someone is willing to work with me, I could write up a JIRA and/or
> open up a WIP pull request if there is interest in getting this as an
> official connector.  I would mostly need guidance on naming/testing aspects.
> >>>>
> >>>> https://redis.io/topics/streams-intro
> >>>>
> >>>> ~Vincent
> >>
> >>
> >> ~Vincent
>

Reply via email to