Re: [Discussion] Deprecate auto cleanup RenderedTaskInstanceFields and decouple k8s_pod_yaml

Kaxil Naik Wed, 01 Feb 2023 17:41:38 -0800

> >Users often asked (slack, issues, discussions) to give the ability to
> auto-maintain tables/logs and receive the usual answer: "No, we don't give
> you this opportunity, please use something for Airflow Ecosystem Page". But
> on the other hand we have auto-maintenance only for a single table.



That is not true, we added "airflow db clean" command to give this control
to users.

What about these tables: session, log, job? I expected the answer would be
> "They are not so specific."

The RTIF can vary in size based on what is stored and is really useless
beyond last X for each task. RTIF also has a foreign key constraints
<https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73>
on
TI and that is the most common table that is affected during the migrations
and can affect scheduling decisions. Log and Session tables aren't affected
by it.

I will have a detailed reply if I manage to find the time, it has just been
too difficult

On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <[email protected]> wrote:

> BTW, I would really love to hear what the original authors have to say
> here. I am merely trying to put myself in their shoes and guess what
> the reasoning is :).
>
> I think this is really a question of: Do we want to keep all rendered
> arbitrary size rendered task instance fields in our database forever
> by default, same as other fields.
> I believe the original authors answered the question to be "no". And
> the "num_row" was a way to limit it.
>
> And I really do not want to "win" this argument, I just want to
> protect our users (and environment).
>
> There is (IMHO) currently a big difference between
> session/logg/task_instance fields and rendered task_instance fields
> that justify different behaviour.
>
> The former are generally fixed in max size of rows (this is one of the
> reasons we have limited string sizes in our DB) - to be able to limit
> them growing uncontrollably large. We simply do not keep arbitrary
> size data in those tables.
> On the other hand, the rendered task instance is arbitrary in size
> (JSONField) and the need for deletion accounts for the "worst" case
> scenario.
> Until we get rid of that "property" of the rendered task instance
> table, I think "just" skipping deletion of those fields without fixing
> the "worst" case scenario is not a good idea.
>
> Maybe in your test cases (and many others) those tables are not
> bigger, but I think the protection here is implemented to account for
> the case where the rendered task instance field is "big". I think the
> protection here is done for the cases where the rendered task instance
> fields are really "big".
>
> But very interestingly - if the rendered task instance is "big" then
> likely it is next to useless to be displayed in the Web UI in its
> entirety.
>
> So maybe you are actually right Andrey, Maybe we can skip deleting
> those and maybe we could solve it differently and apply the same rules
> as other tables?
>
> Let me then - constructively - propose another idea which actually
> might solve both yours and my own concerns. Maybe we can fix the
> "worst case" scenario differently? We do not have to limit the number
> of rows, we can limit the size of the row instead.
>
> Why don't we simply limit the size of the rendered task instance JSON
> and if they are too big (we can configure the maximum size), we will
> render something like (probably a bit more unique and not
> "accidentally triggerable"):
>
> {
>    "error": "This task instance has too large rendered task instances
> to display it"
> }
>
> And implement an escape hatch in the web server to handle it properly
> when displaying such "truncated" rendered task instance field.
>
> We should be able come up with a sensible max size that we might think
> makes sense when rendering it in the web UI.  And we could make the
> max size configurable by the user if they have a lot of GB to spare.
> We could even apply it automatically. If there is a max_num_row_limit
> - we allow any size, if we have no limit on the number of rows, we
> limit the maximum row size.
>
> If we have such "upper-bounded" reasonable size of each row in that
> table - then I am perfectly happy with not deleting the rows
> automatically.
> But only if we limit that and handle the "worst" case properly.
>
> That would be my proposal how we can handle it to get both views taken
> into account.
>
> What do you Andrey (and others) think? Does it make sense? Or do we
> think we should not have any such protections in place ?
>
> J.
>
>
> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <[email protected]>
> wrote:
> >
> > > I think that is all something to be tested with explaining plans.
> >
> > I would be really careful with these results. DELETE in Transaction with
> Rollback usually shows more optimistic than actually executed.
> >
> >
> > > I think we would not know before we try - and possibly there are other
> > optimisation approaches. The optimisation I proposed was only first
> > that came to my mind to avoid the "not in" query. The problem  with
> > "not in query" is that there is no way to optimise it by the DB.
> > Effectively you have to get every record (or index entry) and test it.
> > Maybe it can be done better :). And yes locking the index with
> > anti-insert locks and the need to rebalance trees during the delete is
> > a concern.
> >
> > My point is still the same, I would rather remove it in the future or
> make policy about maintenance more consistent: all or nothing. Right now we
> are close to nothing rather than all.
> >
> >
> > > It's not about index size or JSON access. It is about the size of the
> > actual rows and storage it takes - i.e. general size of the database.
> >
> > I'm tired, but I'm not sure that I understand the actual point.
> > Is it not really a matter of size of the table if you always access by
> pattern:
> > 1 request which returns exactly 1 record accessed by a unique index.
> > Basically query travercial by index find reference to single CTID/rowid
> (or whatever name used in other RDBMS).
> > So at this point it really matters how fast your index grows rather than
> table size.
> >
> >
> > > The problem with it is that (especially with dynamic task mapping), it
> > might grow really, really fast. Basically you have NUM_DAGS x
> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
> > of records there.
> > Back-of-the envelope calculation Assuming you have a
> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
> > fields, each field evaluating to 1K string.  Then you have 10 tasks *
> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day
> > your database from a single DAG grows by whooping 1.44 GB of data
> > every single day (from single DAG).
> >
> > Depends on DB. If we talk about Postgres you could easily miss up to 3-4
> times (thanks for inline compression before/instead of TOASTs).
> > I have a couple questions:
> > 1. Do you know how big would be indexes in rendered_task_instance_fields
> after one day? (Spoiler alert I could give estimation in the morning)
> > 2. In this scenario with default settings always would keep up to 30 000
> TI for this DAG.
> > Could someone who wants to optimize the query make it more optimal
> rather than access the table by index (Unique Index -> ctid/rowid - record)
> and where this crosspoint?
> >
> > > This is of course an estimation that assumes a lot, but it's not at
> all unrealistic.
> >
> > 144,000 TI per/day on single DAG (initially I want to put here some
> sarcastic message).
> > How would Airflow feel with 144k Task Instances per day? How
> > I asked because right now I've always had a task_instance table bigger
> than rendered_task_instance_fields.
> >
> >
> > > This table is very specific compared with the other tables. The only
> reason for it being here is to be able
> > to show the rendered fields in the UI if you go to the specific run of a
> task. If you clean-up other tables you basically lose the history of
> execution of the tasks and you cannot really know if the data has been
> > processed, you cannot do backfills effectively, you lose all the
> > context. Cleaning this table is merely about the values that have been
> > rendered for a specific run and the assumption there is that the older
> > it gets, the less interesting it is.
> >
> > What about these tables: session, log, job? I expected the answer would
> be "They are not so specific."
> > For me every table is specific for their purpose.
> > Users often asked (slack, issues, discussions) to give the ability to
> auto-maintain tables/logs and receive the usual answer: "No, we don't give
> you this opportunity, please use something for Airflow Ecosystem Page". But
> on the other hand we have auto-maintenance only for a single table.
> >
> > >> It is opposite of what we have right now, we scan tables (maybe
> multiple times), read all records tuples which contain JSON.
> > > Not sure if I get the point here :). Yes -in my proposal I think the
> records will not be touched - only indexes. So the cleanup should be way
> faster, contentions less of problem, due to the way the delete
> > uses < ordering, deadlocks will not be possible at all (as opposed to
> the current "not in" - there is a very easy way to get into deadlocks when
> two parallel deletes are trying to delete same rows in a different
> sequence. I think my proposal improves all the characteristics of the
> "cleanup" with very little extra penalty on record creation.
> >
> > I was talking about the current solution and why it is also slow (and if
> abstract users use some DBaaS where they also pay for IOPs then it is
> expensive). Let's talk about the benefits of optimised queries for 4
> different DB backends (3 if excluding SQLite) when we have it.
> >
> > > We pay for table/index size linary more records, more size. But other
> operations vary and depend on B-Tree implementation and usually it has
> logarithmic growth. Or do we worry only about table/toast/index size on
> disk?
> > >> Yep. I (and I believe the original author had the same worry) am
> worried a lot about the size of the table and the fact that this table will
> be by far the biggest table in our DB while most of the old records will
> never be touched. And by the fact that this is the only table that our
> users will have to know about to clean up separately from all others pretty
> much always.
> >
> > Same as previous.
> >
> > > If not even worrying about money spent by our users, and performance
> degradation that comes with databases that are bigger - that's a lot of
> environmental effects that we might incur. Airflow is heavily used, if
> suddenly all our users
> > will start having 10 bigger databases that they have now because we will
> deprecate the values and keep all the history, then we have a big number of
> extra disks that will have to be used. I'd strongly prefer a solution where
> we keep the data usage lower in this case.
> >
> > Am I right that this is all about "lets don't delete by default as we do
> for other tables" rather than the current default implementation?
> > Because I get the result which is opposite what you said. And
> rendered_task_instance_fields don't grow faster than other tables that what
> I got.
> > I would like to compare it with other findings and some reproducible
> metrics rather than with hypothetical things.
> >
> > ----
> > Best Wishes
> > Andrey Anshin
> >
> >
> >
> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <[email protected]> wrote:
> >>
> >> COMMENT: While writing the answer here, I think I found a deeper
> >> problem (and optimisation needed)  - i.e I think the delete should be
> >> even more fine-grained than it is today and include map_index) -
> >> please take a look at the end (Also maybe TP might comment on that
> >> one).
> >>
> >> > 1. Additional indexes add additional performance degradation on
> Insert but gain potential improvements on delete and unknown on update,
> RDBMS still require rebalance index and make it consistent to the table.
> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially if
> the user set a huge number for offset (which? unknown).
> >> > 3. Mixing two indexes could improve performance in a single query but
> in concurrent execution might lead to degradation because it needs to
> create a bitmap table for comparison between these two indexes, as result
> it might lead different issues, such as OOM on DB backend, use swaps or
> optimiser decided that better not to use this indexes.
> >>
> >> I think that is all something to be tested with explain plans. I think
> >> we would not know before we try - and possibly there are other
> >> optimisation approaches. The optimisation I proposed was only first
> >> that came to my mind to avoid the "not in" query. The problem  with
> >> "not in query" is that there is no way to optimise it by the DB.
> >> Effectively you have to get every record (or index entry) and test it.
> >> Maybe it can be done better :). And yes locking the index with
> >> anti-insert locks and the need to rebalance trees during the delete is
> >> a concern.
> >>
> >> > Is it a real problem? Until we access only by indexes, which doesn't
> include this JSON, it really doesn't matter. I guess we almost always
> should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a single
> record.
> >>
> >> Yes I think so, and while. I was not the author of this "cleanup"
> >> code, I believe I know the intention.
> >>
> >> It's not about index size or JSON access. It is about the size of the
> >> actual rows and storage it takes - i.e. general size of the database.
> >> The problem with it is that (especially with dynamic task mapping), it
> >> might grow really, really fast. Basically you have NUM_DAGS x
> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
> >> of records there. Back-of-the envelope calculation Assuming you have a
> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
> >> fields, each field evaluating to 1K string.  Then you have 10 tasks *
> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day
> >> your database from a single DAG grows by whooping 1.44 GB of data
> >> every single day (from single DAG).This is of course an estimation
> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And
> >> if we want the user to do the cleanup then a) they need to know it b)
> >> they need to specifically clean up this table only because all the
> >> other data is relatively small. This table is very specific compared
> >> with the other tables. The only reason for it being here is to be able
> >> to show the rendered fields in the UI if you go to the specific run of
> >> a task. If you clean-up other tables you basically lose the history of
> >> execution of the tasks and you cannot really know if the data has been
> >> processed, you cannot do backfills effectively, you lose all the
> >> context. Cleaning this table is merely about the values that have been
> >> rendered for a specific run and the assumption there is that the older
> >> it gets, the less interesting it is.
> >>
> >> > It is opposite of what we have right now, we scan tables (maybe
> multiple times), read all records tuples which contain JSON.
> >>
> >> Not sure if I get the point here :). Yes -in my proposal I think the
> >> records will not be touched - only indexes. So the cleanup should be
> >> way faster, contentions less of problem, due to the way the delete
> >> uses < ordering, deadlocks will not be possible at all (as opposed to
> >> the current "not in" - there is a very easy way to get into deadlocks
> >> when two parallel deletes are trying to delete same rows in a
> >> different sequence. I think my proposal improves all the
> >> characteristics of the "cleanup" with very little extra penalty on
> >> record creation.
> >>
> >> > We pay for table/index size linary more records, more size. But other
> operations vary and depend on B-Tree implementation and usually it has
> logarithmic growth. Or do we worry only about table/toast/index size on
> disk?
> >>
> >> Yep. I (and I believe the original author had the same worry) am
> >> worried a lot about the size of the table and the fact that this table
> >> will be by far the biggest table in our DB while most of the old
> >> records will never be touched. And by the fact that this is the only
> >> table that our users will have to know about to clean up separately
> >> from all others pretty much always. If not even worrying about money
> >> spent by our users, and performance degradation that comes with
> >> databases that are bigger - that's a lot of environmental effects that
> >> we might incur. Airflow is heavily used, if suddenly all our users
> >> will start having 10 bigger databases that they have now because we
> >> will deprecate the values and keep all the history, then we have a big
> >> number of extra disks that will have to be used. I'd strongly prefer a
> >> solution where we keep the data usage lower in this case.
> >>
> >> > If we do not want to grant users the ability to clean up rendered
> templates tables, there could be another option:
> >> > - Do not delete records on every task instance run.
> >> > - Delete once per defined period (hourly, daily, weekly, monthly). In
> this case you really could not care about locks.
> >>
> >> Yes we could come up with a different strategy as to "when" run the
> >> cleanup. This is also a viable option. If you can propose one that
> >> will be equally adaptive as the current solution, I am all ears.
> >> Basically my goal is to keep the usage of the table low, possibly
> >> controlled by the same parameter we had. How we do it - this is a
> >> different story. If we - for example add a thread in the scheduler
> >> (for example) that performs such cleanup effectively in parallel and
> >> scales, I am happy with that.
> >>
> >> But I am trying to get into the head of the author trying to
> >> understand why the original implementation was done this way. I
> >> believe (and maybe those who remember it better could confirm it) that
> >> distributing the deletion to tasks to clean up after itself is a
> >> better idea than centralising the cleanup. This makes each cleanup
> >> smaller, locks are held for a shorter time (at least that was the
> >> assumption where no full table scan was used), it is more "immediate"
> >> and you do not have to decide upfront what should be the cleanup
> >> frequency. It seems this is the best logical approach to keep the
> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is
> >> complete, you can be sure that there are no more than this number of
> >> fields per task in the DB. With a scheduled run, that would be a much
> >> more "eventual" consistency and it will be potentially fluctuating
> >> much more.
> >>
> >> But there are risks involved in having a single thread doing the
> >> cleanup. I think it has a huge risk of being a "stop-the world" and
> >> "deadlock-prone" kind of event - if in big instances there are a lot
> >> of rows to cleanup in a single pass. When you delete entries from a
> >> table, there are anti-insert locks created for existing index entries,
> >> which makes it possible to rollback the whole DELETE transaction.
> >> Which means that when you try to insert the row with the same index,
> >> the index will be held. And this means that when you run a single huge
> >> DELETE for multiple rows affected with multiple (all?) index keys
> >> matching select query, it will effectively prevent new rows with the
> >> same indexes that are matching the SELECT. It would mean that if you
> >> have some tasks running while deleting existing run_id rendered
> >> fields, then you could REALLY start having deadlocks on those tasks
> >> trying to insert rendered task instance rows. That's why I think the
> >> only viable strategy for single "cleanup" thread is to do such cleanup
> >> as separate DELETE for each of the "dag/task/map_index/run" - in order
> >> to avoid such deadlocks. Which effectively will turn into what have
> >> currently - only that currently those transactions are done by tasks,
> >> not by a single cleanup thread.
> >>
> >> Also using tasks to delete old rows is more "effective" when you have
> >> vast differences in frequency of DAGs. Naturally - when you do it in
> >> task, you will only do it "when needed" for given DAG + Task. If you
> >> try to centralize the cleanup, unless you include somehow schedule and
> >> frequency of each dag, you are going to check every DAG every time
> >> your run the cleanup - no matter if that DAG is run daily or every
> >> minute, you will have to run the cleanup frequently enough to match
> >> your most frequent dags. If you have 1000 dags that run hourly and one
> >> DAG that runs every minue, then you have to run a cleanup job that
> >> scans all DAGs every few minutes. That's a big waste.
> >>
> >> So I am not sure if we gain anything by centralizing the cleanup.
> >> Decentralising it to Task seems to be a well thought and sound
> >> decision (but I think the problem we have now is that we need to
> >> optimize it after Dynamic Task Mapping has been added).
> >>
> >> ANOTHER FINDING:
> >>
> >> While looking at the code and discussing it and looking more closely I
> >> **think** there is another problem that we have to fix regardless of a
> >> solution. I THINK a problem we might have now is that we do not
> >> include map_index in this DELETE. While we are curreently delete all
> >> the rendered task fields without including map_index - and for big
> >> dynamic tasks, it means that exacly the same DELETE query is run by
> >> every single mapped instance of that tasks and that is where a lot of
> >> contention and locking might happen (basically when single task
> >> instance does the delete, anti-insert locks held the other mapped
> >> instances of the same task from inserting rendered fields).
> >>
> >> It does not change much in the optimisation proposal of mine, other
> >> than we should include map_index in those queries. But I think this
> >> might cause a lot of delays in the current implementation.
> >>
> >> J.
> >>
> >> > ----
> >> > Best Wishes
> >> > Andrey Anshin
> >> >
> >> >
> >> >
> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <[email protected]> wrote:
> >> >>
> >> >> I think there is a good reason to clean those up automatically.
> >> >> rendered task instance fields are almost arbitrary in size. If we try
> >> >> to keep all historical values there by default, there are numerous
> >> >> cases it will grow very fast - far, far too quickly.
> >> >>
> >> >> And I am not worried at all about locks on this table if we do it the
> >> >> way I described it and it uses the indexes. The contention this way
> >> >> might only be between the two deleting tasks. and with the query I
> >> >> proposed, they will only last for a short time - the index will be
> >> >> locked when two DELETES  or SELECT DISTINCT - which should both be
> >> >> fast.
> >> >>
> >> >>
> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin <
> [email protected]> wrote:
> >> >> >
> >> >> > I guess two things involved to reduce performance on this query
> through the time: Dynamic Task Mapping and run_id instead of execution date.
> >> >> >
> >> >> > I still personally think that changing the default value from 30
> to 0 might improve performance of multiple concurrent tasks, just because
> this query does not run and there are no locks on multiple records/pages.
> >> >> >
> >> >> > I do not have any proof (yet?) other than simple DAGs. I think
> that there is some cross point exists when keeping this table growth worse
> rather than cleanup for each TI run. But users have ability to cleanup
> table by execute airflow db clean which should improve performance again
> >> >> >
> >> >> > And also there is interesting behavior with this query: if user
> already have more that value specified by
> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill
> than rendered templates not written to table (or may be inserted and after
> that immediately deleted), the same is valid for cleanup old tasks.
> >> >> >
> >> >> > ----
> >> >> > Best Wishes
> >> >> > Andrey Anshin
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <[email protected]>
> wrote:
> >> >> >>
> >> >> >> Yep. Agree this is not an efficient query and dynamic task mapping
> >> >> >> makes the effect much worse. Generally speaking, selecting "what
> >> >> >> should be left" and then deleting stuff where the key is "not in"
> is
> >> >> >> never an efficient way of running an sql query.  And the query not
> >> >> >> using index at all makes it rather terrible.
> >> >> >>
> >> >> >> I think we should not deprecate it though, but find a more
> efficient
> >> >> >> way of deleting the old keys. I think we could slightly
> denormalize
> >> >> >> RenderedTaskInstance + DagRun tables, and add
> DAG_RUN_EXECUTION_DATE
> >> >> >> to the RenderedTaskInstance table and that will be enough to
> optimise
> >> >> >> it.
> >> >> >>
> >> >> >> Then we could have either:
> >> >> >>
> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID,
> TASK_ID,
> >> >> >> RUN_ID_EXECUTION_DATE
> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and separate
> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE
> >> >> >>
> >> >> >> Probably the latter is better as I am not sure how < , >
> comparison
> >> >> >> looks like for composite B-TREE indexes when char + date columns
> are
> >> >> >> mixed. Also we could have hit the infamous MySQL index key length
> >> >> >> limit.
> >> >> >>
> >> >> >> Then deletion process would look roughly like:
> >> >> >>
> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM
> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>,
> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY
> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET
> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK>
> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID
> =<DAG_ID>,
> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date
> >> >> >>
> >> >> >> I believe that would be fast, and it would use the B-TREE index
> >> >> >> features nicely (ordering support)
> >> >> >>
> >> >> >> J
> >> >> >>
> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin <
> [email protected]> wrote:
> >> >> >> >
> >> >> >> > First of all I want to highlight that this approach I guess
> worked well until Dynamic Task Mappings introduced.
> >> >> >> >
> >> >> >> > > The main reason for adding that cleanup was -- if you don't
> do that, you will have many rows, similar to the TaskInstance table
> >> >> >> >
> >> >> >> > The problem itself is not how big your table/indexes, rather
> then what kind of operation you run.
> >> >> >> >
> >> >> >> > > Do you have any data for locks or performance degradation?
> >> >> >> >
> >> >> >> > In this case if we try to clean up
> rendered_task_instance_fields table when a new TI is created/cleared we
> make almost two full/sequential scans (note: need to check) against the
> table without any index usage, so we pay here a couple times:
> >> >> >> > 1. We scan without indexes - not all parts of the composite key
> are included to query, plus we need to filter everything except 30 records
> with order and distinct
> >> >> >> > 2. After that we make another full scan for find 1 record or
> map_size records
> >> >> >> >
> >> >> >> > And I guess the situation becomes worse if you have a lot of
> tasks, even if we have a small table, we need to do ineffective operations.
> >> >> >> >
> >> >> >> > That how looks like Query Plan (please note without commit
> transaction DELETE operation doesn't have all information):
> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b
> >> >> >> >
> >> >> >> > In case if we do not clean up the table, we only use these
> operations:
> >> >> >> > 1. SELECT single record by index
> >> >> >> > 2. INSERT new record
> >> >> >> > 3. DELETE old record(s), which were found by index.
> >> >> >> >
> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so we
> should not consider to use any findings as totally truth):
> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480
> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times
> >> >> >> > DAG with single map tasks: degradation up to 7-10 times
> >> >> >> >
> >> >> >> > I have a plan for more complex and more close to real use cases
> with Database which do not have network latency almost 0 as I have in my
> local.
> >> >> >> > But I will not refuse if someone also does their tests with
> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value.
> >> >> >> >
> >> >> >> > About deadlock we know that it exists at least in MySQL:
> https://github.com/apache/airflow/pull/18616
> >> >> >> >
> >> >> >> > > And the larger tables create problems during database
> migrations.
> >> >> >> >
> >> >> >> > That is a very good point, so if we found that problem only
> related to migrations we could:
> >> >> >> > 1. Cleanup this table in migration
> >> >> >> > 2. Add cli command to airflow db which could cleanup only
> rendered fields, so it would be user's choice cleanup or not before
> migration, do periodical maintenance or not
> >> >> >> >
> >> >> >> >
> >> >> >> > ----
> >> >> >> > Best Wishes
> >> >> >> > Andrey Anshin
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <[email protected]>
> wrote:
> >> >> >> >>>
> >> >> >> >>> Potentially it is a good idea to deprecate this option and
> recommend for users to set it to 0? WDYT? Maybe someone has already tried
> or investigated this?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> The main reason for adding that cleanup was -- if you don't do
> that, you will have many rows, similar to the TaskInstance table. And the
> RenderedTIFields were mainly added for checking rendered TI fields on the
> Webserver only because after DAG Serialization, the webserver won't have
> access to DAG files.
> >> >> >> >>
> >> >> >> >> And the larger tables create problems during database
> migrations.
> >> >> >> >>
> >> >> >> >> Do you have any data for locks or performance degradation?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin <
> [email protected]> wrote:
> >> >> >> >>>
> >> >> >> >>> Greetings!
> >> >> >> >>>
> >> >> >> >>> During migrating our ORM syntax to compatible with SQLAlchemy
> 2.0 I probably found skeletons in the closet.
> >> >> >> >>>
> >> >> >> >>> Let's start from the beginning, initially I got this warning
> >> >> >> >>>
> >> >> >> >>> airflow/models/renderedtifields.py:245
> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is
> deprecated and will be removed in SQLAlchemy 2.0.  SELECT statements with
> DISTINCT should be written to explicitly include the appropriate columns in
> the columns clause (Background on SQLAlchemy 2.0 at:
> https://sqlalche.me/e/b8d9)')
> >> >> >> >>>
> >> >> >> >>> "OK let's fix it!", I thought at first and started to
> investigate RenderedTaskInstanceFields model
> >> >> >> >>>
> >> >> >> >>> Skeleton #1:
> >> >> >> >>>
> >> >> >> >>> When I first time look on the code and comments it got me to
> thinking that part which keep only latest N Rendered Task Fields
> potentially could lead different performance degradation (Locks, Dead
> Locks, Data Bloating): see code
> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
> >> >> >> >>>
> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate
> this SQL Statement (pg backend)
> >> >> >> >>>
> >> >> >> >>> DELETE FROM rendered_task_instance_fields
> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
> >> >> >> >>>   AND rendered_task_instance_fields.task_id = %(task_id_1) s
> >> >> >> >>>   AND (
> >> >> >> >>>     (
> >> >> >> >>>       rendered_task_instance_fields.dag_id,
> >> >> >> >>>       rendered_task_instance_fields.task_id,
> >> >> >> >>>       rendered_task_instance_fields.run_id
> >> >> >> >>>     ) NOT IN (
> >> >> >> >>>       SELECT
> >> >> >> >>>         anon_1.dag_id,
> >> >> >> >>>         anon_1.task_id,
> >> >> >> >>>         anon_1.run_id
> >> >> >> >>>       FROM
> >> >> >> >>>         (
> >> >> >> >>>           SELECT DISTINCT
> >> >> >> >>>             rendered_task_instance_fields.dag_id AS dag_id,
> >> >> >> >>>             rendered_task_instance_fields.task_id AS task_id,
> >> >> >> >>>             rendered_task_instance_fields.run_id AS run_id,
> >> >> >> >>>             dag_run.execution_date AS execution_date
> >> >> >> >>>           FROM rendered_task_instance_fields
> >> >> >> >>>             JOIN dag_run ON
> rendered_task_instance_fields.dag_id = dag_run.dag_id
> >> >> >> >>>             AND rendered_task_instance_fields.run_id =
> dag_run.run_id
> >> >> >> >>>           WHERE
> >> >> >> >>>             rendered_task_instance_fields.dag_id =
> %(dag_id_2) s
> >> >> >> >>>             AND rendered_task_instance_fields.task_id =
> %(task_id_2) s
> >> >> >> >>>           ORDER BY
> >> >> >> >>>             dag_run.execution_date DESC
> >> >> >> >>>           limit %(param_1) s
> >> >> >> >>>         ) AS anon_1
> >> >> >> >>>     )
> >> >> >> >>>   )
> >> >> >> >>>
> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN
> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS
> clause), but it is not working for NOT IN SUBQUERY because it is not
> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible,
> see: https://commitfest.postgresql.org/27/2023/
> >> >> >> >>>
> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if
> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather
> than default 30 it could improve performance and reduce number of
> DeadLocks, however the table size will increase but I think we don't do any
> maintenance job for other tables.
> >> >> >> >>>
> >> >> >> >>> Potentially it is a good idea to deprecate this option and
> recommend for users to set it to 0? WDYT? Maybe someone has already tried
> or investigated this?
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> Skeleton #2:
> >> >> >> >>>
> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by K8S
> executors.
> >> >> >> >>>
> >> >> >> >>> Should we also decouple this field as part of AIP-51?
> >> >> >> >>>
> >> >> >> >>> ----
> >> >> >> >>> Best Wishes
> >> >> >> >>> Andrey Anshin
> >> >> >> >>>
>

Re: [Discussion] Deprecate auto cleanup RenderedTaskInstanceFields and decouple k8s_pod_yaml

Reply via email to