Re: Checkpointing on Google Cloud Dataflow Runner

2022-08-30 Thread Reuven Lax via user
Snapshots are expected to happen nearly instantaneously. While processing
is paused while the snapshot is in progress, the pause should usually be
very brief. It's true that Dataflow does not support automated snapshots -
you would have to create them yourself using a cron.

Checkpoints on Flink aren't simply automated snapshot mechanism.
Checkpoints are how Flink implements consistent, exactly-once processing.
Dataflow on the other hand continuously checkpoints records, so doesn't
need global checkpoints for exactly-once processing.

Reuven

On Tue, Aug 30, 2022 at 5:10 AM Will Baker  wrote:

> I looked into snapshots and they do seem useful for providing a means
> to save state and resume, however they aren't as seamless as I was
> hoping for with the automatic checkpointing that is supported by other
> runners. It looked like snapshots would be user initiated and would
> pause the pipeline while the snapshot was being created. I could
> imagine how this would be set up on an automated schedule, but would
> still prefer something more light-weight like checkpoints.
>
> On Mon, Aug 29, 2022 at 8:11 PM Reuven Lax  wrote:
> >
> > Google Cloud Dataflow does support snapshots. Is this what you were
> looking for?
> >
> > On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles  wrote:
> >>
> >> Hi Will, David,
> >>
> >> I think you'll find the best source of answer for this sort of question
> on the user@beam list. I've put that in the To: line with a BCC: to the
> dev@beam list so everyone knows they can find the thread there. If I have
> misunderstood, and your question has to do with building Beam itself, feel
> free to move it back.
> >>
> >> Kenn
> >>
> >> On Mon, Aug 29, 2022 at 2:24 PM Will Baker  wrote:
> >>>
> >>> Hello!
> >>>
> >>> I am wondering about using checkpoints with Beam running on Google
> >>> Cloud Dataflow.
> >>>
> >>> The docs indicate that checkpoints are not supported by Google Cloud
> >>> Dataflow:
> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
> >>>
> >>> Is there a recommended approach to handling checkpointing on Google
> >>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
> >>> that a pipeline could be resumed from where it left off if it needs to
> >>> be stopped or crashes for some reason?
> >>>
> >>> Thanks!
> >>> Will Baker
>


Re: Checkpointing on Google Cloud Dataflow Runner

2022-08-30 Thread Will Baker
I looked into snapshots and they do seem useful for providing a means
to save state and resume, however they aren't as seamless as I was
hoping for with the automatic checkpointing that is supported by other
runners. It looked like snapshots would be user initiated and would
pause the pipeline while the snapshot was being created. I could
imagine how this would be set up on an automated schedule, but would
still prefer something more light-weight like checkpoints.

On Mon, Aug 29, 2022 at 8:11 PM Reuven Lax  wrote:
>
> Google Cloud Dataflow does support snapshots. Is this what you were looking 
> for?
>
> On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles  wrote:
>>
>> Hi Will, David,
>>
>> I think you'll find the best source of answer for this sort of question on 
>> the user@beam list. I've put that in the To: line with a BCC: to the 
>> dev@beam list so everyone knows they can find the thread there. If I have 
>> misunderstood, and your question has to do with building Beam itself, feel 
>> free to move it back.
>>
>> Kenn
>>
>> On Mon, Aug 29, 2022 at 2:24 PM Will Baker  wrote:
>>>
>>> Hello!
>>>
>>> I am wondering about using checkpoints with Beam running on Google
>>> Cloud Dataflow.
>>>
>>> The docs indicate that checkpoints are not supported by Google Cloud
>>> Dataflow:  
>>> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>>>
>>> Is there a recommended approach to handling checkpointing on Google
>>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
>>> that a pipeline could be resumed from where it left off if it needs to
>>> be stopped or crashes for some reason?
>>>
>>> Thanks!
>>> Will Baker


Re: Checkpointing on Google Cloud Dataflow Runner

2022-08-29 Thread Reuven Lax via user
Google Cloud Dataflow does support snapshots
. Is this
what you were looking for?

On Mon, Aug 29, 2022 at 4:04 PM Kenneth Knowles  wrote:

> Hi Will, David,
>
> I think you'll find the best source of answer for this sort of question on
> the user@beam list. I've put that in the To: line with a BCC: to the
> dev@beam list so everyone knows they can find the thread there. If I have
> misunderstood, and your question has to do with building Beam itself, feel
> free to move it back.
>
> Kenn
>
> On Mon, Aug 29, 2022 at 2:24 PM Will Baker  wrote:
>
>> Hello!
>>
>> I am wondering about using checkpoints with Beam running on Google
>> Cloud Dataflow.
>>
>> The docs indicate that checkpoints are not supported by Google Cloud
>> Dataflow:
>> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>>
>> Is there a recommended approach to handling checkpointing on Google
>> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
>> that a pipeline could be resumed from where it left off if it needs to
>> be stopped or crashes for some reason?
>>
>> Thanks!
>> Will Baker
>>
>


Re: Checkpointing on Google Cloud Dataflow Runner

2022-08-29 Thread Kenneth Knowles
Hi Will, David,

I think you'll find the best source of answer for this sort of question on
the user@beam list. I've put that in the To: line with a BCC: to the
dev@beam list so everyone knows they can find the thread there. If I have
misunderstood, and your question has to do with building Beam itself, feel
free to move it back.

Kenn

On Mon, Aug 29, 2022 at 2:24 PM Will Baker  wrote:

> Hello!
>
> I am wondering about using checkpoints with Beam running on Google
> Cloud Dataflow.
>
> The docs indicate that checkpoints are not supported by Google Cloud
> Dataflow:
> https://beam.apache.org/documentation/runners/capability-matrix/additional-common-features-not-yet-part-of-the-beam-model/
>
> Is there a recommended approach to handling checkpointing on Google
> Cloud Dataflow when using streaming sources like Kinesis and Kafka, so
> that a pipeline could be resumed from where it left off if it needs to
> be stopped or crashes for some reason?
>
> Thanks!
> Will Baker
>