Maybe the “auto-maintenance” is true. Hopefully for not long though :)

On Thu, 2 Feb 2023 at 01:41, Kaxil Naik <[email protected]> wrote:

>
> >Users often asked (slack, issues, discussions) to give the ability to
>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>> you this opportunity, please use something for Airflow Ecosystem Page". But
>> on the other hand we have auto-maintenance only for a single table.
>
>
> That is not true, we added "airflow db clean" command to give this control
> to users.
>
> What about these tables: session, log, job? I expected the answer would be
>> "They are not so specific."
>
> The RTIF can vary in size based on what is stored and is really useless
> beyond last X for each task. RTIF also has a foreign key constraints
> <https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73>
>  on
> TI and that is the most common table that is affected during the migrations
> and can affect scheduling decisions. Log and Session tables aren't affected
> by it.
>
> I will have a detailed reply if I manage to find the time, it has just
> been too difficult
>
> On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <[email protected]> wrote:
>
>> BTW, I would really love to hear what the original authors have to say
>> here. I am merely trying to put myself in their shoes and guess what
>> the reasoning is :).
>>
>> I think this is really a question of: Do we want to keep all rendered
>> arbitrary size rendered task instance fields in our database forever
>> by default, same as other fields.
>> I believe the original authors answered the question to be "no". And
>> the "num_row" was a way to limit it.
>>
>> And I really do not want to "win" this argument, I just want to
>> protect our users (and environment).
>>
>> There is (IMHO) currently a big difference between
>> session/logg/task_instance fields and rendered task_instance fields
>> that justify different behaviour.
>>
>> The former are generally fixed in max size of rows (this is one of the
>> reasons we have limited string sizes in our DB) - to be able to limit
>> them growing uncontrollably large. We simply do not keep arbitrary
>> size data in those tables.
>> On the other hand, the rendered task instance is arbitrary in size
>> (JSONField) and the need for deletion accounts for the "worst" case
>> scenario.
>> Until we get rid of that "property" of the rendered task instance
>> table, I think "just" skipping deletion of those fields without fixing
>> the "worst" case scenario is not a good idea.
>>
>> Maybe in your test cases (and many others) those tables are not
>> bigger, but I think the protection here is implemented to account for
>> the case where the rendered task instance field is "big". I think the
>> protection here is done for the cases where the rendered task instance
>> fields are really "big".
>>
>> But very interestingly - if the rendered task instance is "big" then
>> likely it is next to useless to be displayed in the Web UI in its
>> entirety.
>>
>> So maybe you are actually right Andrey, Maybe we can skip deleting
>> those and maybe we could solve it differently and apply the same rules
>> as other tables?
>>
>> Let me then - constructively - propose another idea which actually
>> might solve both yours and my own concerns. Maybe we can fix the
>> "worst case" scenario differently? We do not have to limit the number
>> of rows, we can limit the size of the row instead.
>>
>> Why don't we simply limit the size of the rendered task instance JSON
>> and if they are too big (we can configure the maximum size), we will
>> render something like (probably a bit more unique and not
>> "accidentally triggerable"):
>>
>> {
>>    "error": "This task instance has too large rendered task instances
>> to display it"
>> }
>>
>> And implement an escape hatch in the web server to handle it properly
>> when displaying such "truncated" rendered task instance field.
>>
>> We should be able come up with a sensible max size that we might think
>> makes sense when rendering it in the web UI.  And we could make the
>> max size configurable by the user if they have a lot of GB to spare.
>> We could even apply it automatically. If there is a max_num_row_limit
>> - we allow any size, if we have no limit on the number of rows, we
>> limit the maximum row size.
>>
>> If we have such "upper-bounded" reasonable size of each row in that
>> table - then I am perfectly happy with not deleting the rows
>> automatically.
>> But only if we limit that and handle the "worst" case properly.
>>
>> That would be my proposal how we can handle it to get both views taken
>> into account.
>>
>> What do you Andrey (and others) think? Does it make sense? Or do we
>> think we should not have any such protections in place ?
>>
>> J.
>>
>>
>> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <[email protected]>
>> wrote:
>> >
>> > > I think that is all something to be tested with explaining plans.
>> >
>> > I would be really careful with these results. DELETE in Transaction
>> with Rollback usually shows more optimistic than actually executed.
>> >
>> >
>> > > I think we would not know before we try - and possibly there are other
>> > optimisation approaches. The optimisation I proposed was only first
>> > that came to my mind to avoid the "not in" query. The problem  with
>> > "not in query" is that there is no way to optimise it by the DB.
>> > Effectively you have to get every record (or index entry) and test it.
>> > Maybe it can be done better :). And yes locking the index with
>> > anti-insert locks and the need to rebalance trees during the delete is
>> > a concern.
>> >
>> > My point is still the same, I would rather remove it in the future or
>> make policy about maintenance more consistent: all or nothing. Right now we
>> are close to nothing rather than all.
>> >
>> >
>> > > It's not about index size or JSON access. It is about the size of the
>> > actual rows and storage it takes - i.e. general size of the database.
>> >
>> > I'm tired, but I'm not sure that I understand the actual point.
>> > Is it not really a matter of size of the table if you always access by
>> pattern:
>> > 1 request which returns exactly 1 record accessed by a unique index.
>> > Basically query travercial by index find reference to single CTID/rowid
>> (or whatever name used in other RDBMS).
>> > So at this point it really matters how fast your index grows rather
>> than table size.
>> >
>> >
>> > > The problem with it is that (especially with dynamic task mapping), it
>> > might grow really, really fast. Basically you have NUM_DAGS x
>> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>> > of records there.
>> > Back-of-the envelope calculation Assuming you have a
>> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>> > fields, each field evaluating to 1K string.  Then you have 10 tasks *
>> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>> > your database from a single DAG grows by whooping 1.44 GB of data
>> > every single day (from single DAG).
>> >
>> > Depends on DB. If we talk about Postgres you could easily miss up to
>> 3-4 times (thanks for inline compression before/instead of TOASTs).
>> > I have a couple questions:
>> > 1. Do you know how big would be indexes in
>> rendered_task_instance_fields after one day? (Spoiler alert I could give
>> estimation in the morning)
>> > 2. In this scenario with default settings always would keep up to 30
>> 000 TI for this DAG.
>> > Could someone who wants to optimize the query make it more optimal
>> rather than access the table by index (Unique Index -> ctid/rowid - record)
>> and where this crosspoint?
>> >
>> > > This is of course an estimation that assumes a lot, but it's not at
>> all unrealistic.
>> >
>> > 144,000 TI per/day on single DAG (initially I want to put here some
>> sarcastic message).
>> > How would Airflow feel with 144k Task Instances per day? How
>> > I asked because right now I've always had a task_instance table bigger
>> than rendered_task_instance_fields.
>> >
>> >
>> > > This table is very specific compared with the other tables. The only
>> reason for it being here is to be able
>> > to show the rendered fields in the UI if you go to the specific run of
>> a task. If you clean-up other tables you basically lose the history of
>> execution of the tasks and you cannot really know if the data has been
>> > processed, you cannot do backfills effectively, you lose all the
>> > context. Cleaning this table is merely about the values that have been
>> > rendered for a specific run and the assumption there is that the older
>> > it gets, the less interesting it is.
>> >
>> > What about these tables: session, log, job? I expected the answer would
>> be "They are not so specific."
>> > For me every table is specific for their purpose.
>> > Users often asked (slack, issues, discussions) to give the ability to
>> auto-maintain tables/logs and receive the usual answer: "No, we don't give
>> you this opportunity, please use something for Airflow Ecosystem Page". But
>> on the other hand we have auto-maintenance only for a single table.
>> >
>> > >> It is opposite of what we have right now, we scan tables (maybe
>> multiple times), read all records tuples which contain JSON.
>> > > Not sure if I get the point here :). Yes -in my proposal I think the
>> records will not be touched - only indexes. So the cleanup should be way
>> faster, contentions less of problem, due to the way the delete
>> > uses < ordering, deadlocks will not be possible at all (as opposed to
>> the current "not in" - there is a very easy way to get into deadlocks when
>> two parallel deletes are trying to delete same rows in a different
>> sequence. I think my proposal improves all the characteristics of the
>> "cleanup" with very little extra penalty on record creation.
>> >
>> > I was talking about the current solution and why it is also slow (and
>> if abstract users use some DBaaS where they also pay for IOPs then it is
>> expensive). Let's talk about the benefits of optimised queries for 4
>> different DB backends (3 if excluding SQLite) when we have it.
>> >
>> > > We pay for table/index size linary more records, more size. But other
>> operations vary and depend on B-Tree implementation and usually it has
>> logarithmic growth. Or do we worry only about table/toast/index size on
>> disk?
>> > >> Yep. I (and I believe the original author had the same worry) am
>> worried a lot about the size of the table and the fact that this table will
>> be by far the biggest table in our DB while most of the old records will
>> never be touched. And by the fact that this is the only table that our
>> users will have to know about to clean up separately from all others pretty
>> much always.
>> >
>> > Same as previous.
>> >
>> > > If not even worrying about money spent by our users, and performance
>> degradation that comes with databases that are bigger - that's a lot of
>> environmental effects that we might incur. Airflow is heavily used, if
>> suddenly all our users
>> > will start having 10 bigger databases that they have now because we
>> will deprecate the values and keep all the history, then we have a big
>> number of extra disks that will have to be used. I'd strongly prefer a
>> solution where we keep the data usage lower in this case.
>> >
>> > Am I right that this is all about "lets don't delete by default as we
>> do for other tables" rather than the current default implementation?
>> > Because I get the result which is opposite what you said. And
>> rendered_task_instance_fields don't grow faster than other tables that what
>> I got.
>> > I would like to compare it with other findings and some reproducible
>> metrics rather than with hypothetical things.
>> >
>> > ----
>> > Best Wishes
>> > Andrey Anshin
>> >
>> >
>> >
>> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <[email protected]> wrote:
>> >>
>> >> COMMENT: While writing the answer here, I think I found a deeper
>> >> problem (and optimisation needed)  - i.e I think the delete should be
>> >> even more fine-grained than it is today and include map_index) -
>> >> please take a look at the end (Also maybe TP might comment on that
>> >> one).
>> >>
>> >> > 1. Additional indexes add additional performance degradation on
>> Insert but gain potential improvements on delete and unknown on update,
>> RDBMS still require rebalance index and make it consistent to the table.
>> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially if
>> the user set a huge number for offset (which? unknown).
>> >> > 3. Mixing two indexes could improve performance in a single query
>> but in concurrent execution might lead to degradation because it needs to
>> create a bitmap table for comparison between these two indexes, as result
>> it might lead different issues, such as OOM on DB backend, use swaps or
>> optimiser decided that better not to use this indexes.
>> >>
>> >> I think that is all something to be tested with explain plans. I think
>> >> we would not know before we try - and possibly there are other
>> >> optimisation approaches. The optimisation I proposed was only first
>> >> that came to my mind to avoid the "not in" query. The problem  with
>> >> "not in query" is that there is no way to optimise it by the DB.
>> >> Effectively you have to get every record (or index entry) and test it.
>> >> Maybe it can be done better :). And yes locking the index with
>> >> anti-insert locks and the need to rebalance trees during the delete is
>> >> a concern.
>> >>
>> >> > Is it a real problem? Until we access only by indexes, which doesn't
>> include this JSON, it really doesn't matter. I guess we almost always
>> should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a single
>> record.
>> >>
>> >> Yes I think so, and while. I was not the author of this "cleanup"
>> >> code, I believe I know the intention.
>> >>
>> >> It's not about index size or JSON access. It is about the size of the
>> >> actual rows and storage it takes - i.e. general size of the database.
>> >> The problem with it is that (especially with dynamic task mapping), it
>> >> might grow really, really fast. Basically you have NUM_DAGS x
>> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS  * NUM_RUNS number
>> >> of records there. Back-of-the envelope calculation Assuming you have a
>> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10
>> >> fields, each field evaluating to 1K string.  Then you have 10 tasks *
>> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store
>> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day
>> >> your database from a single DAG grows by whooping 1.44 GB of data
>> >> every single day (from single DAG).This is of course an estimation
>> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And
>> >> if we want the user to do the cleanup then a) they need to know it b)
>> >> they need to specifically clean up this table only because all the
>> >> other data is relatively small. This table is very specific compared
>> >> with the other tables. The only reason for it being here is to be able
>> >> to show the rendered fields in the UI if you go to the specific run of
>> >> a task. If you clean-up other tables you basically lose the history of
>> >> execution of the tasks and you cannot really know if the data has been
>> >> processed, you cannot do backfills effectively, you lose all the
>> >> context. Cleaning this table is merely about the values that have been
>> >> rendered for a specific run and the assumption there is that the older
>> >> it gets, the less interesting it is.
>> >>
>> >> > It is opposite of what we have right now, we scan tables (maybe
>> multiple times), read all records tuples which contain JSON.
>> >>
>> >> Not sure if I get the point here :). Yes -in my proposal I think the
>> >> records will not be touched - only indexes. So the cleanup should be
>> >> way faster, contentions less of problem, due to the way the delete
>> >> uses < ordering, deadlocks will not be possible at all (as opposed to
>> >> the current "not in" - there is a very easy way to get into deadlocks
>> >> when two parallel deletes are trying to delete same rows in a
>> >> different sequence. I think my proposal improves all the
>> >> characteristics of the "cleanup" with very little extra penalty on
>> >> record creation.
>> >>
>> >> > We pay for table/index size linary more records, more size. But
>> other operations vary and depend on B-Tree implementation and usually it
>> has logarithmic growth. Or do we worry only about table/toast/index size on
>> disk?
>> >>
>> >> Yep. I (and I believe the original author had the same worry) am
>> >> worried a lot about the size of the table and the fact that this table
>> >> will be by far the biggest table in our DB while most of the old
>> >> records will never be touched. And by the fact that this is the only
>> >> table that our users will have to know about to clean up separately
>> >> from all others pretty much always. If not even worrying about money
>> >> spent by our users, and performance degradation that comes with
>> >> databases that are bigger - that's a lot of environmental effects that
>> >> we might incur. Airflow is heavily used, if suddenly all our users
>> >> will start having 10 bigger databases that they have now because we
>> >> will deprecate the values and keep all the history, then we have a big
>> >> number of extra disks that will have to be used. I'd strongly prefer a
>> >> solution where we keep the data usage lower in this case.
>> >>
>> >> > If we do not want to grant users the ability to clean up rendered
>> templates tables, there could be another option:
>> >> > - Do not delete records on every task instance run.
>> >> > - Delete once per defined period (hourly, daily, weekly, monthly).
>> In this case you really could not care about locks.
>> >>
>> >> Yes we could come up with a different strategy as to "when" run the
>> >> cleanup. This is also a viable option. If you can propose one that
>> >> will be equally adaptive as the current solution, I am all ears.
>> >> Basically my goal is to keep the usage of the table low, possibly
>> >> controlled by the same parameter we had. How we do it - this is a
>> >> different story. If we - for example add a thread in the scheduler
>> >> (for example) that performs such cleanup effectively in parallel and
>> >> scales, I am happy with that.
>> >>
>> >> But I am trying to get into the head of the author trying to
>> >> understand why the original implementation was done this way. I
>> >> believe (and maybe those who remember it better could confirm it) that
>> >> distributing the deletion to tasks to clean up after itself is a
>> >> better idea than centralising the cleanup. This makes each cleanup
>> >> smaller, locks are held for a shorter time (at least that was the
>> >> assumption where no full table scan was used), it is more "immediate"
>> >> and you do not have to decide upfront what should be the cleanup
>> >> frequency. It seems this is the best logical approach to keep the
>> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is
>> >> complete, you can be sure that there are no more than this number of
>> >> fields per task in the DB. With a scheduled run, that would be a much
>> >> more "eventual" consistency and it will be potentially fluctuating
>> >> much more.
>> >>
>> >> But there are risks involved in having a single thread doing the
>> >> cleanup. I think it has a huge risk of being a "stop-the world" and
>> >> "deadlock-prone" kind of event - if in big instances there are a lot
>> >> of rows to cleanup in a single pass. When you delete entries from a
>> >> table, there are anti-insert locks created for existing index entries,
>> >> which makes it possible to rollback the whole DELETE transaction.
>> >> Which means that when you try to insert the row with the same index,
>> >> the index will be held. And this means that when you run a single huge
>> >> DELETE for multiple rows affected with multiple (all?) index keys
>> >> matching select query, it will effectively prevent new rows with the
>> >> same indexes that are matching the SELECT. It would mean that if you
>> >> have some tasks running while deleting existing run_id rendered
>> >> fields, then you could REALLY start having deadlocks on those tasks
>> >> trying to insert rendered task instance rows. That's why I think the
>> >> only viable strategy for single "cleanup" thread is to do such cleanup
>> >> as separate DELETE for each of the "dag/task/map_index/run" - in order
>> >> to avoid such deadlocks. Which effectively will turn into what have
>> >> currently - only that currently those transactions are done by tasks,
>> >> not by a single cleanup thread.
>> >>
>> >> Also using tasks to delete old rows is more "effective" when you have
>> >> vast differences in frequency of DAGs. Naturally - when you do it in
>> >> task, you will only do it "when needed" for given DAG + Task. If you
>> >> try to centralize the cleanup, unless you include somehow schedule and
>> >> frequency of each dag, you are going to check every DAG every time
>> >> your run the cleanup - no matter if that DAG is run daily or every
>> >> minute, you will have to run the cleanup frequently enough to match
>> >> your most frequent dags. If you have 1000 dags that run hourly and one
>> >> DAG that runs every minue, then you have to run a cleanup job that
>> >> scans all DAGs every few minutes. That's a big waste.
>> >>
>> >> So I am not sure if we gain anything by centralizing the cleanup.
>> >> Decentralising it to Task seems to be a well thought and sound
>> >> decision (but I think the problem we have now is that we need to
>> >> optimize it after Dynamic Task Mapping has been added).
>> >>
>> >> ANOTHER FINDING:
>> >>
>> >> While looking at the code and discussing it and looking more closely I
>> >> **think** there is another problem that we have to fix regardless of a
>> >> solution. I THINK a problem we might have now is that we do not
>> >> include map_index in this DELETE. While we are curreently delete all
>> >> the rendered task fields without including map_index - and for big
>> >> dynamic tasks, it means that exacly the same DELETE query is run by
>> >> every single mapped instance of that tasks and that is where a lot of
>> >> contention and locking might happen (basically when single task
>> >> instance does the delete, anti-insert locks held the other mapped
>> >> instances of the same task from inserting rendered fields).
>> >>
>> >> It does not change much in the optimisation proposal of mine, other
>> >> than we should include map_index in those queries. But I think this
>> >> might cause a lot of delays in the current implementation.
>> >>
>> >> J.
>> >>
>> >> > ----
>> >> > Best Wishes
>> >> > Andrey Anshin
>> >> >
>> >> >
>> >> >
>> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <[email protected]> wrote:
>> >> >>
>> >> >> I think there is a good reason to clean those up automatically.
>> >> >> rendered task instance fields are almost arbitrary in size. If we
>> try
>> >> >> to keep all historical values there by default, there are numerous
>> >> >> cases it will grow very fast - far, far too quickly.
>> >> >>
>> >> >> And I am not worried at all about locks on this table if we do it
>> the
>> >> >> way I described it and it uses the indexes. The contention this way
>> >> >> might only be between the two deleting tasks. and with the query I
>> >> >> proposed, they will only last for a short time - the index will be
>> >> >> locked when two DELETES  or SELECT DISTINCT - which should both be
>> >> >> fast.
>> >> >>
>> >> >>
>> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin <
>> [email protected]> wrote:
>> >> >> >
>> >> >> > I guess two things involved to reduce performance on this query
>> through the time: Dynamic Task Mapping and run_id instead of execution date.
>> >> >> >
>> >> >> > I still personally think that changing the default value from 30
>> to 0 might improve performance of multiple concurrent tasks, just because
>> this query does not run and there are no locks on multiple records/pages.
>> >> >> >
>> >> >> > I do not have any proof (yet?) other than simple DAGs. I think
>> that there is some cross point exists when keeping this table growth worse
>> rather than cleanup for each TI run. But users have ability to cleanup
>> table by execute airflow db clean which should improve performance again
>> >> >> >
>> >> >> > And also there is interesting behavior with this query: if user
>> already have more that value specified by
>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill
>> than rendered templates not written to table (or may be inserted and after
>> that immediately deleted), the same is valid for cleanup old tasks.
>> >> >> >
>> >> >> > ----
>> >> >> > Best Wishes
>> >> >> > Andrey Anshin
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <[email protected]>
>> wrote:
>> >> >> >>
>> >> >> >> Yep. Agree this is not an efficient query and dynamic task
>> mapping
>> >> >> >> makes the effect much worse. Generally speaking, selecting "what
>> >> >> >> should be left" and then deleting stuff where the key is "not
>> in" is
>> >> >> >> never an efficient way of running an sql query.  And the query
>> not
>> >> >> >> using index at all makes it rather terrible.
>> >> >> >>
>> >> >> >> I think we should not deprecate it though, but find a more
>> efficient
>> >> >> >> way of deleting the old keys. I think we could slightly
>> denormalize
>> >> >> >> RenderedTaskInstance + DagRun tables, and add
>> DAG_RUN_EXECUTION_DATE
>> >> >> >> to the RenderedTaskInstance table and that will be enough to
>> optimise
>> >> >> >> it.
>> >> >> >>
>> >> >> >> Then we could have either:
>> >> >> >>
>> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID,
>> TASK_ID,
>> >> >> >> RUN_ID_EXECUTION_DATE
>> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and
>> separate
>> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE
>> >> >> >>
>> >> >> >> Probably the latter is better as I am not sure how < , >
>> comparison
>> >> >> >> looks like for composite B-TREE indexes when char + date columns
>> are
>> >> >> >> mixed. Also we could have hit the infamous MySQL index key length
>> >> >> >> limit.
>> >> >> >>
>> >> >> >> Then deletion process would look roughly like:
>> >> >> >>
>> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM
>> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>,
>> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY
>> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET
>> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK>
>> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID
>> =<DAG_ID>,
>> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date
>> >> >> >>
>> >> >> >> I believe that would be fast, and it would use the B-TREE index
>> >> >> >> features nicely (ordering support)
>> >> >> >>
>> >> >> >> J
>> >> >> >>
>> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin <
>> [email protected]> wrote:
>> >> >> >> >
>> >> >> >> > First of all I want to highlight that this approach I guess
>> worked well until Dynamic Task Mappings introduced.
>> >> >> >> >
>> >> >> >> > > The main reason for adding that cleanup was -- if you don't
>> do that, you will have many rows, similar to the TaskInstance table
>> >> >> >> >
>> >> >> >> > The problem itself is not how big your table/indexes, rather
>> then what kind of operation you run.
>> >> >> >> >
>> >> >> >> > > Do you have any data for locks or performance degradation?
>> >> >> >> >
>> >> >> >> > In this case if we try to clean up
>> rendered_task_instance_fields table when a new TI is created/cleared we
>> make almost two full/sequential scans (note: need to check) against the
>> table without any index usage, so we pay here a couple times:
>> >> >> >> > 1. We scan without indexes - not all parts of the composite
>> key are included to query, plus we need to filter everything except 30
>> records with order and distinct
>> >> >> >> > 2. After that we make another full scan for find 1 record or
>> map_size records
>> >> >> >> >
>> >> >> >> > And I guess the situation becomes worse if you have a lot of
>> tasks, even if we have a small table, we need to do ineffective operations.
>> >> >> >> >
>> >> >> >> > That how looks like Query Plan (please note without commit
>> transaction DELETE operation doesn't have all information):
>> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b
>> >> >> >> >
>> >> >> >> > In case if we do not clean up the table, we only use these
>> operations:
>> >> >> >> > 1. SELECT single record by index
>> >> >> >> > 2. INSERT new record
>> >> >> >> > 3. DELETE old record(s), which were found by index.
>> >> >> >> >
>> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so we
>> should not consider to use any findings as totally truth):
>> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480
>> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times
>> >> >> >> > DAG with single map tasks: degradation up to 7-10 times
>> >> >> >> >
>> >> >> >> > I have a plan for more complex and more close to real use
>> cases with Database which do not have network latency almost 0 as I have in
>> my local.
>> >> >> >> > But I will not refuse if someone also does their tests with
>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value.
>> >> >> >> >
>> >> >> >> > About deadlock we know that it exists at least in MySQL:
>> https://github.com/apache/airflow/pull/18616
>> >> >> >> >
>> >> >> >> > > And the larger tables create problems during database
>> migrations.
>> >> >> >> >
>> >> >> >> > That is a very good point, so if we found that problem only
>> related to migrations we could:
>> >> >> >> > 1. Cleanup this table in migration
>> >> >> >> > 2. Add cli command to airflow db which could cleanup only
>> rendered fields, so it would be user's choice cleanup or not before
>> migration, do periodical maintenance or not
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ----
>> >> >> >> > Best Wishes
>> >> >> >> > Andrey Anshin
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <[email protected]>
>> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>> or investigated this?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> The main reason for adding that cleanup was -- if you don't
>> do that, you will have many rows, similar to the TaskInstance table. And
>> the RenderedTIFields were mainly added for checking rendered TI fields on
>> the Webserver only because after DAG Serialization, the webserver won't
>> have access to DAG files.
>> >> >> >> >>
>> >> >> >> >> And the larger tables create problems during database
>> migrations.
>> >> >> >> >>
>> >> >> >> >> Do you have any data for locks or performance degradation?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin <
>> [email protected]> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Greetings!
>> >> >> >> >>>
>> >> >> >> >>> During migrating our ORM syntax to compatible with
>> SQLAlchemy 2.0 I probably found skeletons in the closet.
>> >> >> >> >>>
>> >> >> >> >>> Let's start from the beginning, initially I got this warning
>> >> >> >> >>>
>> >> >> >> >>> airflow/models/renderedtifields.py:245
>> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is
>> deprecated and will be removed in SQLAlchemy 2.0.  SELECT statements with
>> DISTINCT should be written to explicitly include the appropriate columns in
>> the columns clause (Background on SQLAlchemy 2.0 at:
>> https://sqlalche.me/e/b8d9)')
>> >> >> >> >>>
>> >> >> >> >>> "OK let's fix it!", I thought at first and started to
>> investigate RenderedTaskInstanceFields model
>> >> >> >> >>>
>> >> >> >> >>> Skeleton #1:
>> >> >> >> >>>
>> >> >> >> >>> When I first time look on the code and comments it got me to
>> thinking that part which keep only latest N Rendered Task Fields
>> potentially could lead different performance degradation (Locks, Dead
>> Locks, Data Bloating): see code
>> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245
>> >> >> >> >>>
>> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate
>> this SQL Statement (pg backend)
>> >> >> >> >>>
>> >> >> >> >>> DELETE FROM rendered_task_instance_fields
>> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s
>> >> >> >> >>>   AND rendered_task_instance_fields.task_id = %(task_id_1) s
>> >> >> >> >>>   AND (
>> >> >> >> >>>     (
>> >> >> >> >>>       rendered_task_instance_fields.dag_id,
>> >> >> >> >>>       rendered_task_instance_fields.task_id,
>> >> >> >> >>>       rendered_task_instance_fields.run_id
>> >> >> >> >>>     ) NOT IN (
>> >> >> >> >>>       SELECT
>> >> >> >> >>>         anon_1.dag_id,
>> >> >> >> >>>         anon_1.task_id,
>> >> >> >> >>>         anon_1.run_id
>> >> >> >> >>>       FROM
>> >> >> >> >>>         (
>> >> >> >> >>>           SELECT DISTINCT
>> >> >> >> >>>             rendered_task_instance_fields.dag_id AS dag_id,
>> >> >> >> >>>             rendered_task_instance_fields.task_id AS task_id,
>> >> >> >> >>>             rendered_task_instance_fields.run_id AS run_id,
>> >> >> >> >>>             dag_run.execution_date AS execution_date
>> >> >> >> >>>           FROM rendered_task_instance_fields
>> >> >> >> >>>             JOIN dag_run ON
>> rendered_task_instance_fields.dag_id = dag_run.dag_id
>> >> >> >> >>>             AND rendered_task_instance_fields.run_id =
>> dag_run.run_id
>> >> >> >> >>>           WHERE
>> >> >> >> >>>             rendered_task_instance_fields.dag_id =
>> %(dag_id_2) s
>> >> >> >> >>>             AND rendered_task_instance_fields.task_id =
>> %(task_id_2) s
>> >> >> >> >>>           ORDER BY
>> >> >> >> >>>             dag_run.execution_date DESC
>> >> >> >> >>>           limit %(param_1) s
>> >> >> >> >>>         ) AS anon_1
>> >> >> >> >>>     )
>> >> >> >> >>>   )
>> >> >> >> >>>
>> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN
>> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS
>> clause), but it is not working for NOT IN SUBQUERY because it is not
>> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible,
>> see: https://commitfest.postgresql.org/27/2023/
>> >> >> >> >>>
>> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if
>> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather
>> than default 30 it could improve performance and reduce number of
>> DeadLocks, however the table size will increase but I think we don't do any
>> maintenance job for other tables.
>> >> >> >> >>>
>> >> >> >> >>> Potentially it is a good idea to deprecate this option and
>> recommend for users to set it to 0? WDYT? Maybe someone has already tried
>> or investigated this?
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> Skeleton #2:
>> >> >> >> >>>
>> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by
>> K8S executors.
>> >> >> >> >>>
>> >> >> >> >>> Should we also decouple this field as part of AIP-51?
>> >> >> >> >>>
>> >> >> >> >>> ----
>> >> >> >> >>> Best Wishes
>> >> >> >> >>> Andrey Anshin
>> >> >> >> >>>
>>
>

Reply via email to