Re: [DISCUSS] The definition of the dag `run_id`

Ash Berlin-Taylor Wed, 18 Aug 2021 14:43:55 -0700

Hi Lionel,

Great questions, most of them are for historic reasons.

Getting run_type form run_id: should only be used for back-compat --the run_type column didn't used to exist (it was only added about 6-9months ago from my rough memory) but going forward the "prefix" onrun_id has no meaning anymore, run_type is all that matters.

run_id vs execution_date: I have plans (and I'm slowly working towardsthis) to make execution_date /not/ unique on the dag_run. For examplelets say you have two (or n) models you want to try out and see whichperforms better. To really compare them you need them to operate on thesame data, so ideally that means the same execution_date.

run_id is just meant to be that -- an identifier. It's exact valueholds _no_ meaning to Airflow anymore, and we are free to have it takewhatever value makes most sense to a user.


As to your suggestions:
1. Yes, more clear docs would always be good
2. Yes, we should avoid doing this. Do we still do this anywhere?

3. As per your PR, I think making the behaviour configurable makessense -- as some airflow install operate "across" more than one TZ, sohaving them all be UTC might be a good option there.


Thanks,
Ash

On Wed, Aug 18 2021 at 10:33:41 +0800, Lionel Zhao<[email protected]> wrote:

Hi guys,
When I try to use the airflow, I found the dag`run_id` shown on the page is the UTC time and my time zone is +8:00,it makes me quite hard to know which runs exactly are?
For example, I trigger a dag run at ‘2020-08-18 10:10:00’ but thedag `run_id` is `2020-08-18 02:10:00`.
So I create a PR here: https://github.com/apache/airflow/pull/17502to localize the dag `run_id` and the PR is WIP now.
But I think we can have a discussion about the `run_id`. Actually, itmakes me quite confused about the `run_id` definition when I checkthe sources.
There are 2 points:
Actually, most of the time we use the `execution_date` to query thedag_runs, and there is also a UNIQUE_KEY( dag_id+ execution_date),why do we still need another key to query. And in fact, the`execution_date` can be the `run_id` already and we don’t needanother `run_id`. If we want to use the `run_id` to let the user knowwhen the task extract ly run, but it is UTC time, and it is very hardfor users to useI saw use in some places, we get the run_type fromthe `run_id`, but we didn’t set a clear rule of the `run_id`. Itwill be a risk in the future because it is a hidden rule of the dag`run_id`.
For my suggestions:
1. We should clear the definition of the `run_id`and make a clear rule of it.
2. Avoid getting the `run_type` from the `run_id`and only use the `run_type` in the dag_run
3. Change the `run_id` to local time to make theuser know the exact run time easily.
Just awider discussions, let me know what do you think.

Thanks a lot





From,

Lionel Zhao

Re: [DISCUSS] The definition of the dag `run_id`

Reply via email to