A 2025-03-31 22:26, Jens Scheffler escreveu:
Hi,

thanks for working on the bug and raising a PR to fix it.

As other commiters also commented I think from product view I'd expect a
different resolution. We use the "Pause DAG" in most cases for
administrative or infrastructure problems to prevent further failures
and/or to drain infra to switch some backend.

I assume when we pause a long-running DAG that is in-between execution
of tasks we want to really "pause" scheduling, we don't want to set it
to failed. That would also not be correct because once we un-pause the
running DAGs should continoue to work. I see no reason marking this
failed anf then manually running behind to reset the state later.

My view on this is that as also proposed in the discussion of the bug,
we should rather filter the paused DAG from clouster activity reporting
such that paused DAGs are not reported with excessive runtime. Also
later if un-paused it would be "right" that the overall DAG runtime was
longer than normal (would not expect to deduct the paused time from
runtime of the DAG.)

If I want (as operator/admin) to really terminate existing running
instances I'd rather walk through Browse -> DAG Runs --> Filter for
running with paused DAG id and mark them as failed explicitly.

Jens

On 31.03.25 20:50, Pedro Nunes Leal wrote:
Hello everyone,

Currently, I'm trying to fix this bug:
https://github.com/apache/airflow/issues/44443

Basically, the issue is that the DAGs would be stuck on running even
though they were paused.
Consequently, the duration of the dag run will keep on increasing even
though the DAG is paused.

My proposal to solve this problem is changing the DAGs state from
running to failed, when paused, to avoid the increment of their duration.

Since this can be an impactful change, I would like to hear what
others think about it.

Link for the Pull Request: https://github.com/apache/airflow/pull/47557


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org
That can be a better approach.

However, if I'm not mistaken, the code related to the cluster activity page doesn't exist in Airflow 3 (the version where I'm trying to do the changes).

So what should I do in this case?
Is there any other way not involving cluster activity to solve this problem?

The change to queued state instead of fail was my proposal at the beginning, and it really pauses the DAG. This is the type of solution I was thinking, because as I said before in the pull request, I feel that the cluster activity behavior is just a symptom from a bigger problem (the DAGs doesn't really pause, they just keep running).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to