HI,
I have pretty same setup. Regarding Terraform and DataFlow on GCP:
- Terraform apply does check if there is a DataFlow job running with
same `job_name`
- if there is not - it does create a new one and waits till its in
"running" state
- if there is one already - it does try to update the job, which means
create a new job with same "job_name" which will be running the new
version of the code, and send "update" signal to the old one. After
that, old job halts and waits for the new one to fully start and
transmit the state of the old job. Once that's done the old job goes
into "updated" state, and new one does process messages. If the new one
fails the old one resumes processing.
- Note for this to work the new code requires to be compatible with the
old one. If its not, the new job will fail, and old job will get
slightly behind as it needed to wait for the new job to fail.
- Note 2: there is a way to run verify compatibility so that the new job
will not start, but there will be a check to make sure it is compatible
with the new job, hence avoiding possible delays in the old job.
- Note 3: there is entirely separate job update type called "in-flight
update" which does not effectively change the job, but allows to change
autoscaller parameters (like max number of workers) without creating any
delays in the pipeline.
Given above context, to fully diagnose your issue, more information is
needed, but you might be hitting the issue mentioned by Robert:
- if you use a topic for PubSubIO, this will mean that each new job does
create a new subscription on the topic on graph construction time. So
this means if there are messages in the old subscription that were not
yet processed (and acked) by the old pipeline, and the old pipeline gets
"update" signal and halts, there might be some time duration when
messages can be published to the old subscription and not published to
the new one.
Workarounds:
- use subscription on PubSubIO or
- use random job names on TF and drain old pipelines.
Note all above is just hypothesis, but hopefully it might be helpful.
Best
Wiśniowski Piotr
On 16.04.2024 05:15, Juan Romero wrote:
The deployment in the job is made by terraform. I verified and seems
that terraform do it incorrectly under the hood because it stop the
current job and starts and new one. Thanks for the information !
On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user
<user@beam.apache.org> wrote:
Are you draining[1] your pipeline or simply canceling it and
starting a new one? Draining should close open windows and attempt
to flush all in-flight data before shutting down. For PubSub you
may also need to read from subscriptions rather than topics to
ensure messages are processed by either one or the other.
[1]
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain
On Mon, Apr 15, 2024 at 9:33 AM Juan Romero <jsrf...@gmail.com> wrote:
Hi guys. Good morning.
I haven't done some test in apache beam over data flow in
order to see if i can do an hot update or hot swap meanwhile
the pipeline is processing a bunch of messages that fall in a
time window of 10 minutes. What I saw is that when I do a hot
update over the pipeline and currently there are some messages
in the time window (before sending them to the target), the
current job is shutdown and dataflow creates a new one. The
problem is that it seems that I am losing the messages that
were being processed in the old one and they are not taken by
the new one, which imply we are incurring in losing data .
Can you help me or recommend any strategy to me?
Thanks!!