Re: Hot update in dataflow without lossing messages

Wiśniowski Piotr Sun, 28 Apr 2024 01:20:48 -0700

HI,

I have pretty same setup. Regarding Terraform and DataFlow on GCP:

- Terraform apply does check if there is a DataFlow job running withsame `job_name`

- if there is not - it does create a new one and waits till its in"running" state

- if there is one already - it does try to update the job, which meanscreate a new job with same "job_name" which will be running the newversion of the code, and send "update" signal to the old one. Afterthat, old job halts and waits for the new one to fully start andtransmit the state of the old job. Once that's done the old job goesinto "updated" state, and new one does process messages. If the new onefails the old one resumes processing.

- Note for this to work the new code requires to be compatible with theold one. If its not, the new job will fail, and old job will getslightly behind as it needed to wait for the new job to fail.

- Note 2: there is a way to run verify compatibility so that the new jobwill not start, but there will be a check to make sure it is compatiblewith the new job, hence avoiding possible delays in the old job.

- Note 3: there is entirely separate job update type called "in-flightupdate" which does not effectively change the job, but allows to changeautoscaller parameters (like max number of workers) without creating anydelays in the pipeline.

Given above context, to fully diagnose your issue, more information isneeded, but you might be hitting the issue mentioned by Robert:

- if you use a topic for PubSubIO, this will mean that each new job doescreate a new subscription on the topic on graph construction time. Sothis means if there are messages in the old subscription that were notyet processed (and acked) by the old pipeline, and the old pipeline gets"update" signal and halts, there might be some time duration whenmessages can be published to the old subscription and not published tothe new one.


Workarounds:

- use subscription on PubSubIO or

- use random job names on TF and drain old pipelines.

Note all above is just hypothesis, but hopefully it might be helpful.

Best

Wiśniowski Piotr


On 16.04.2024 05:15, Juan Romero wrote:

The deployment in the job is made by terraform. I verified and seemsthat terraform do it incorrectly under the hood because it stop thecurrent job and starts and new one. Thanks for the information !

On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user<[email protected]> wrote:


    Are you draining[1] your pipeline or simply canceling it and
    starting a new one? Draining should close open windows and attempt
    to flush all in-flight data before shutting down. For PubSub you
    may also need to read from subscriptions rather than topics to
    ensure messages are processed by either one or the other.

    [1]
    https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

    On Mon, Apr 15, 2024 at 9:33 AM Juan Romero <[email protected]> wrote:

        Hi guys. Good morning.

        I haven't done some test in apache beam over data flow in
        order to see if i can do an hot update or hot swap meanwhile
        the pipeline is processing a bunch of messages that fall in a
        time window of 10 minutes. What I saw is that when I do a hot
        update over the pipeline and currently there are some messages
        in the time window (before sending them to the target), the
        current job is shutdown and dataflow creates a new one. The
        problem is that it seems that I am losing the messages that
        were being processed in the old one and they are not taken by
        the new one, which imply we are incurring in losing data .

        Can you help me or recommend any strategy to me?

        Thanks!!

Re: Hot update in dataflow without lossing messages

Reply via email to