Maybe the “auto-maintenance” is true. Hopefully for not long though :) On Thu, 2 Feb 2023 at 01:41, Kaxil Naik <[email protected]> wrote:
> > >Users often asked (slack, issues, discussions) to give the ability to >> auto-maintain tables/logs and receive the usual answer: "No, we don't give >> you this opportunity, please use something for Airflow Ecosystem Page". But >> on the other hand we have auto-maintenance only for a single table. > > > That is not true, we added "airflow db clean" command to give this control > to users. > > What about these tables: session, log, job? I expected the answer would be >> "They are not so specific." > > The RTIF can vary in size based on what is stored and is really useless > beyond last X for each task. RTIF also has a foreign key constraints > <https://github.com/apache/airflow/blob/d80b583db07197c8c3d0549a805e83ceaaf10d52/airflow/models/renderedtifields.py#L63-L73> > on > TI and that is the most common table that is affected during the migrations > and can affect scheduling decisions. Log and Session tables aren't affected > by it. > > I will have a detailed reply if I manage to find the time, it has just > been too difficult > > On Tue, 31 Jan 2023 at 21:46, Jarek Potiuk <[email protected]> wrote: > >> BTW, I would really love to hear what the original authors have to say >> here. I am merely trying to put myself in their shoes and guess what >> the reasoning is :). >> >> I think this is really a question of: Do we want to keep all rendered >> arbitrary size rendered task instance fields in our database forever >> by default, same as other fields. >> I believe the original authors answered the question to be "no". And >> the "num_row" was a way to limit it. >> >> And I really do not want to "win" this argument, I just want to >> protect our users (and environment). >> >> There is (IMHO) currently a big difference between >> session/logg/task_instance fields and rendered task_instance fields >> that justify different behaviour. >> >> The former are generally fixed in max size of rows (this is one of the >> reasons we have limited string sizes in our DB) - to be able to limit >> them growing uncontrollably large. We simply do not keep arbitrary >> size data in those tables. >> On the other hand, the rendered task instance is arbitrary in size >> (JSONField) and the need for deletion accounts for the "worst" case >> scenario. >> Until we get rid of that "property" of the rendered task instance >> table, I think "just" skipping deletion of those fields without fixing >> the "worst" case scenario is not a good idea. >> >> Maybe in your test cases (and many others) those tables are not >> bigger, but I think the protection here is implemented to account for >> the case where the rendered task instance field is "big". I think the >> protection here is done for the cases where the rendered task instance >> fields are really "big". >> >> But very interestingly - if the rendered task instance is "big" then >> likely it is next to useless to be displayed in the Web UI in its >> entirety. >> >> So maybe you are actually right Andrey, Maybe we can skip deleting >> those and maybe we could solve it differently and apply the same rules >> as other tables? >> >> Let me then - constructively - propose another idea which actually >> might solve both yours and my own concerns. Maybe we can fix the >> "worst case" scenario differently? We do not have to limit the number >> of rows, we can limit the size of the row instead. >> >> Why don't we simply limit the size of the rendered task instance JSON >> and if they are too big (we can configure the maximum size), we will >> render something like (probably a bit more unique and not >> "accidentally triggerable"): >> >> { >> "error": "This task instance has too large rendered task instances >> to display it" >> } >> >> And implement an escape hatch in the web server to handle it properly >> when displaying such "truncated" rendered task instance field. >> >> We should be able come up with a sensible max size that we might think >> makes sense when rendering it in the web UI. And we could make the >> max size configurable by the user if they have a lot of GB to spare. >> We could even apply it automatically. If there is a max_num_row_limit >> - we allow any size, if we have no limit on the number of rows, we >> limit the maximum row size. >> >> If we have such "upper-bounded" reasonable size of each row in that >> table - then I am perfectly happy with not deleting the rows >> automatically. >> But only if we limit that and handle the "worst" case properly. >> >> That would be my proposal how we can handle it to get both views taken >> into account. >> >> What do you Andrey (and others) think? Does it make sense? Or do we >> think we should not have any such protections in place ? >> >> J. >> >> >> On Tue, Jan 31, 2023 at 9:43 PM Andrey Anshin <[email protected]> >> wrote: >> > >> > > I think that is all something to be tested with explaining plans. >> > >> > I would be really careful with these results. DELETE in Transaction >> with Rollback usually shows more optimistic than actually executed. >> > >> > >> > > I think we would not know before we try - and possibly there are other >> > optimisation approaches. The optimisation I proposed was only first >> > that came to my mind to avoid the "not in" query. The problem with >> > "not in query" is that there is no way to optimise it by the DB. >> > Effectively you have to get every record (or index entry) and test it. >> > Maybe it can be done better :). And yes locking the index with >> > anti-insert locks and the need to rebalance trees during the delete is >> > a concern. >> > >> > My point is still the same, I would rather remove it in the future or >> make policy about maintenance more consistent: all or nothing. Right now we >> are close to nothing rather than all. >> > >> > >> > > It's not about index size or JSON access. It is about the size of the >> > actual rows and storage it takes - i.e. general size of the database. >> > >> > I'm tired, but I'm not sure that I understand the actual point. >> > Is it not really a matter of size of the table if you always access by >> pattern: >> > 1 request which returns exactly 1 record accessed by a unique index. >> > Basically query travercial by index find reference to single CTID/rowid >> (or whatever name used in other RDBMS). >> > So at this point it really matters how fast your index grows rather >> than table size. >> > >> > >> > > The problem with it is that (especially with dynamic task mapping), it >> > might grow really, really fast. Basically you have NUM_DAGS x >> > NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS * NUM_RUNS number >> > of records there. >> > Back-of-the envelope calculation Assuming you have a >> > DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10 >> > fields, each field evaluating to 1K string. Then you have 10 tasks * >> > 100 map indexes * 10 fields * 1K rendered string size = 10MB to store >> > per one(!) run of one(1) DAG. Run it every 10 minutes and every day >> > your database from a single DAG grows by whooping 1.44 GB of data >> > every single day (from single DAG). >> > >> > Depends on DB. If we talk about Postgres you could easily miss up to >> 3-4 times (thanks for inline compression before/instead of TOASTs). >> > I have a couple questions: >> > 1. Do you know how big would be indexes in >> rendered_task_instance_fields after one day? (Spoiler alert I could give >> estimation in the morning) >> > 2. In this scenario with default settings always would keep up to 30 >> 000 TI for this DAG. >> > Could someone who wants to optimize the query make it more optimal >> rather than access the table by index (Unique Index -> ctid/rowid - record) >> and where this crosspoint? >> > >> > > This is of course an estimation that assumes a lot, but it's not at >> all unrealistic. >> > >> > 144,000 TI per/day on single DAG (initially I want to put here some >> sarcastic message). >> > How would Airflow feel with 144k Task Instances per day? How >> > I asked because right now I've always had a task_instance table bigger >> than rendered_task_instance_fields. >> > >> > >> > > This table is very specific compared with the other tables. The only >> reason for it being here is to be able >> > to show the rendered fields in the UI if you go to the specific run of >> a task. If you clean-up other tables you basically lose the history of >> execution of the tasks and you cannot really know if the data has been >> > processed, you cannot do backfills effectively, you lose all the >> > context. Cleaning this table is merely about the values that have been >> > rendered for a specific run and the assumption there is that the older >> > it gets, the less interesting it is. >> > >> > What about these tables: session, log, job? I expected the answer would >> be "They are not so specific." >> > For me every table is specific for their purpose. >> > Users often asked (slack, issues, discussions) to give the ability to >> auto-maintain tables/logs and receive the usual answer: "No, we don't give >> you this opportunity, please use something for Airflow Ecosystem Page". But >> on the other hand we have auto-maintenance only for a single table. >> > >> > >> It is opposite of what we have right now, we scan tables (maybe >> multiple times), read all records tuples which contain JSON. >> > > Not sure if I get the point here :). Yes -in my proposal I think the >> records will not be touched - only indexes. So the cleanup should be way >> faster, contentions less of problem, due to the way the delete >> > uses < ordering, deadlocks will not be possible at all (as opposed to >> the current "not in" - there is a very easy way to get into deadlocks when >> two parallel deletes are trying to delete same rows in a different >> sequence. I think my proposal improves all the characteristics of the >> "cleanup" with very little extra penalty on record creation. >> > >> > I was talking about the current solution and why it is also slow (and >> if abstract users use some DBaaS where they also pay for IOPs then it is >> expensive). Let's talk about the benefits of optimised queries for 4 >> different DB backends (3 if excluding SQLite) when we have it. >> > >> > > We pay for table/index size linary more records, more size. But other >> operations vary and depend on B-Tree implementation and usually it has >> logarithmic growth. Or do we worry only about table/toast/index size on >> disk? >> > >> Yep. I (and I believe the original author had the same worry) am >> worried a lot about the size of the table and the fact that this table will >> be by far the biggest table in our DB while most of the old records will >> never be touched. And by the fact that this is the only table that our >> users will have to know about to clean up separately from all others pretty >> much always. >> > >> > Same as previous. >> > >> > > If not even worrying about money spent by our users, and performance >> degradation that comes with databases that are bigger - that's a lot of >> environmental effects that we might incur. Airflow is heavily used, if >> suddenly all our users >> > will start having 10 bigger databases that they have now because we >> will deprecate the values and keep all the history, then we have a big >> number of extra disks that will have to be used. I'd strongly prefer a >> solution where we keep the data usage lower in this case. >> > >> > Am I right that this is all about "lets don't delete by default as we >> do for other tables" rather than the current default implementation? >> > Because I get the result which is opposite what you said. And >> rendered_task_instance_fields don't grow faster than other tables that what >> I got. >> > I would like to compare it with other findings and some reproducible >> metrics rather than with hypothetical things. >> > >> > ---- >> > Best Wishes >> > Andrey Anshin >> > >> > >> > >> > On Tue, 31 Jan 2023 at 11:12, Jarek Potiuk <[email protected]> wrote: >> >> >> >> COMMENT: While writing the answer here, I think I found a deeper >> >> problem (and optimisation needed) - i.e I think the delete should be >> >> even more fine-grained than it is today and include map_index) - >> >> please take a look at the end (Also maybe TP might comment on that >> >> one). >> >> >> >> > 1. Additional indexes add additional performance degradation on >> Insert but gain potential improvements on delete and unknown on update, >> RDBMS still require rebalance index and make it consistent to the table. >> >> > 2. LIMIT x OFFSET y could easily become full seq scan, especially if >> the user set a huge number for offset (which? unknown). >> >> > 3. Mixing two indexes could improve performance in a single query >> but in concurrent execution might lead to degradation because it needs to >> create a bitmap table for comparison between these two indexes, as result >> it might lead different issues, such as OOM on DB backend, use swaps or >> optimiser decided that better not to use this indexes. >> >> >> >> I think that is all something to be tested with explain plans. I think >> >> we would not know before we try - and possibly there are other >> >> optimisation approaches. The optimisation I proposed was only first >> >> that came to my mind to avoid the "not in" query. The problem with >> >> "not in query" is that there is no way to optimise it by the DB. >> >> Effectively you have to get every record (or index entry) and test it. >> >> Maybe it can be done better :). And yes locking the index with >> >> anti-insert locks and the need to rebalance trees during the delete is >> >> a concern. >> >> >> >> > Is it a real problem? Until we access only by indexes, which doesn't >> include this JSON, it really doesn't matter. I guess we almost always >> should make a UNIQUE INDEX SCAN for SELECT or DELETE (UPDATE) a single >> record. >> >> >> >> Yes I think so, and while. I was not the author of this "cleanup" >> >> code, I believe I know the intention. >> >> >> >> It's not about index size or JSON access. It is about the size of the >> >> actual rows and storage it takes - i.e. general size of the database. >> >> The problem with it is that (especially with dynamic task mapping), it >> >> might grow really, really fast. Basically you have NUM_DAGS x >> >> NUM_TASKS * NUM_MAP_INDEXES * NUM_TEMPLATED_FIELDS * NUM_RUNS number >> >> of records there. Back-of-the envelope calculation Assuming you have a >> >> DAG with 10 dynamically mapped tasks with 100 mapped indexes with 10 >> >> fields, each field evaluating to 1K string. Then you have 10 tasks * >> >> 100 map indexes * 10 fields * 1K rendered string size = 10MB to store >> >> per one(!) run of one(1) DAG. Run it every 10 minutes and every day >> >> your database from a single DAG grows by whooping 1.44 GB of data >> >> every single day (from single DAG).This is of course an estimation >> >> that assumes a lot, but it's not at all unrealistic. That's a lot. And >> >> if we want the user to do the cleanup then a) they need to know it b) >> >> they need to specifically clean up this table only because all the >> >> other data is relatively small. This table is very specific compared >> >> with the other tables. The only reason for it being here is to be able >> >> to show the rendered fields in the UI if you go to the specific run of >> >> a task. If you clean-up other tables you basically lose the history of >> >> execution of the tasks and you cannot really know if the data has been >> >> processed, you cannot do backfills effectively, you lose all the >> >> context. Cleaning this table is merely about the values that have been >> >> rendered for a specific run and the assumption there is that the older >> >> it gets, the less interesting it is. >> >> >> >> > It is opposite of what we have right now, we scan tables (maybe >> multiple times), read all records tuples which contain JSON. >> >> >> >> Not sure if I get the point here :). Yes -in my proposal I think the >> >> records will not be touched - only indexes. So the cleanup should be >> >> way faster, contentions less of problem, due to the way the delete >> >> uses < ordering, deadlocks will not be possible at all (as opposed to >> >> the current "not in" - there is a very easy way to get into deadlocks >> >> when two parallel deletes are trying to delete same rows in a >> >> different sequence. I think my proposal improves all the >> >> characteristics of the "cleanup" with very little extra penalty on >> >> record creation. >> >> >> >> > We pay for table/index size linary more records, more size. But >> other operations vary and depend on B-Tree implementation and usually it >> has logarithmic growth. Or do we worry only about table/toast/index size on >> disk? >> >> >> >> Yep. I (and I believe the original author had the same worry) am >> >> worried a lot about the size of the table and the fact that this table >> >> will be by far the biggest table in our DB while most of the old >> >> records will never be touched. And by the fact that this is the only >> >> table that our users will have to know about to clean up separately >> >> from all others pretty much always. If not even worrying about money >> >> spent by our users, and performance degradation that comes with >> >> databases that are bigger - that's a lot of environmental effects that >> >> we might incur. Airflow is heavily used, if suddenly all our users >> >> will start having 10 bigger databases that they have now because we >> >> will deprecate the values and keep all the history, then we have a big >> >> number of extra disks that will have to be used. I'd strongly prefer a >> >> solution where we keep the data usage lower in this case. >> >> >> >> > If we do not want to grant users the ability to clean up rendered >> templates tables, there could be another option: >> >> > - Do not delete records on every task instance run. >> >> > - Delete once per defined period (hourly, daily, weekly, monthly). >> In this case you really could not care about locks. >> >> >> >> Yes we could come up with a different strategy as to "when" run the >> >> cleanup. This is also a viable option. If you can propose one that >> >> will be equally adaptive as the current solution, I am all ears. >> >> Basically my goal is to keep the usage of the table low, possibly >> >> controlled by the same parameter we had. How we do it - this is a >> >> different story. If we - for example add a thread in the scheduler >> >> (for example) that performs such cleanup effectively in parallel and >> >> scales, I am happy with that. >> >> >> >> But I am trying to get into the head of the author trying to >> >> understand why the original implementation was done this way. I >> >> believe (and maybe those who remember it better could confirm it) that >> >> distributing the deletion to tasks to clean up after itself is a >> >> better idea than centralising the cleanup. This makes each cleanup >> >> smaller, locks are held for a shorter time (at least that was the >> >> assumption where no full table scan was used), it is more "immediate" >> >> and you do not have to decide upfront what should be the cleanup >> >> frequency. It seems this is the best logical approach to keep the >> >> "MAX_NUM_RENDERED_TI_FIELDS_PER_TASK" promise. Simply after task is >> >> complete, you can be sure that there are no more than this number of >> >> fields per task in the DB. With a scheduled run, that would be a much >> >> more "eventual" consistency and it will be potentially fluctuating >> >> much more. >> >> >> >> But there are risks involved in having a single thread doing the >> >> cleanup. I think it has a huge risk of being a "stop-the world" and >> >> "deadlock-prone" kind of event - if in big instances there are a lot >> >> of rows to cleanup in a single pass. When you delete entries from a >> >> table, there are anti-insert locks created for existing index entries, >> >> which makes it possible to rollback the whole DELETE transaction. >> >> Which means that when you try to insert the row with the same index, >> >> the index will be held. And this means that when you run a single huge >> >> DELETE for multiple rows affected with multiple (all?) index keys >> >> matching select query, it will effectively prevent new rows with the >> >> same indexes that are matching the SELECT. It would mean that if you >> >> have some tasks running while deleting existing run_id rendered >> >> fields, then you could REALLY start having deadlocks on those tasks >> >> trying to insert rendered task instance rows. That's why I think the >> >> only viable strategy for single "cleanup" thread is to do such cleanup >> >> as separate DELETE for each of the "dag/task/map_index/run" - in order >> >> to avoid such deadlocks. Which effectively will turn into what have >> >> currently - only that currently those transactions are done by tasks, >> >> not by a single cleanup thread. >> >> >> >> Also using tasks to delete old rows is more "effective" when you have >> >> vast differences in frequency of DAGs. Naturally - when you do it in >> >> task, you will only do it "when needed" for given DAG + Task. If you >> >> try to centralize the cleanup, unless you include somehow schedule and >> >> frequency of each dag, you are going to check every DAG every time >> >> your run the cleanup - no matter if that DAG is run daily or every >> >> minute, you will have to run the cleanup frequently enough to match >> >> your most frequent dags. If you have 1000 dags that run hourly and one >> >> DAG that runs every minue, then you have to run a cleanup job that >> >> scans all DAGs every few minutes. That's a big waste. >> >> >> >> So I am not sure if we gain anything by centralizing the cleanup. >> >> Decentralising it to Task seems to be a well thought and sound >> >> decision (but I think the problem we have now is that we need to >> >> optimize it after Dynamic Task Mapping has been added). >> >> >> >> ANOTHER FINDING: >> >> >> >> While looking at the code and discussing it and looking more closely I >> >> **think** there is another problem that we have to fix regardless of a >> >> solution. I THINK a problem we might have now is that we do not >> >> include map_index in this DELETE. While we are curreently delete all >> >> the rendered task fields without including map_index - and for big >> >> dynamic tasks, it means that exacly the same DELETE query is run by >> >> every single mapped instance of that tasks and that is where a lot of >> >> contention and locking might happen (basically when single task >> >> instance does the delete, anti-insert locks held the other mapped >> >> instances of the same task from inserting rendered fields). >> >> >> >> It does not change much in the optimisation proposal of mine, other >> >> than we should include map_index in those queries. But I think this >> >> might cause a lot of delays in the current implementation. >> >> >> >> J. >> >> >> >> > ---- >> >> > Best Wishes >> >> > Andrey Anshin >> >> > >> >> > >> >> > >> >> > On Mon, 30 Jan 2023 at 23:51, Jarek Potiuk <[email protected]> wrote: >> >> >> >> >> >> I think there is a good reason to clean those up automatically. >> >> >> rendered task instance fields are almost arbitrary in size. If we >> try >> >> >> to keep all historical values there by default, there are numerous >> >> >> cases it will grow very fast - far, far too quickly. >> >> >> >> >> >> And I am not worried at all about locks on this table if we do it >> the >> >> >> way I described it and it uses the indexes. The contention this way >> >> >> might only be between the two deleting tasks. and with the query I >> >> >> proposed, they will only last for a short time - the index will be >> >> >> locked when two DELETES or SELECT DISTINCT - which should both be >> >> >> fast. >> >> >> >> >> >> >> >> >> On Mon, Jan 30, 2023 at 8:37 PM Andrey Anshin < >> [email protected]> wrote: >> >> >> > >> >> >> > I guess two things involved to reduce performance on this query >> through the time: Dynamic Task Mapping and run_id instead of execution date. >> >> >> > >> >> >> > I still personally think that changing the default value from 30 >> to 0 might improve performance of multiple concurrent tasks, just because >> this query does not run and there are no locks on multiple records/pages. >> >> >> > >> >> >> > I do not have any proof (yet?) other than simple DAGs. I think >> that there is some cross point exists when keeping this table growth worse >> rather than cleanup for each TI run. But users have ability to cleanup >> table by execute airflow db clean which should improve performance again >> >> >> > >> >> >> > And also there is interesting behavior with this query: if user >> already have more that value specified by >> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK and tried run backfill >> than rendered templates not written to table (or may be inserted and after >> that immediately deleted), the same is valid for cleanup old tasks. >> >> >> > >> >> >> > ---- >> >> >> > Best Wishes >> >> >> > Andrey Anshin >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Sun, 29 Jan 2023 at 14:16, Jarek Potiuk <[email protected]> >> wrote: >> >> >> >> >> >> >> >> Yep. Agree this is not an efficient query and dynamic task >> mapping >> >> >> >> makes the effect much worse. Generally speaking, selecting "what >> >> >> >> should be left" and then deleting stuff where the key is "not >> in" is >> >> >> >> never an efficient way of running an sql query. And the query >> not >> >> >> >> using index at all makes it rather terrible. >> >> >> >> >> >> >> >> I think we should not deprecate it though, but find a more >> efficient >> >> >> >> way of deleting the old keys. I think we could slightly >> denormalize >> >> >> >> RenderedTaskInstance + DagRun tables, and add >> DAG_RUN_EXECUTION_DATE >> >> >> >> to the RenderedTaskInstance table and that will be enough to >> optimise >> >> >> >> it. >> >> >> >> >> >> >> >> Then we could have either: >> >> >> >> >> >> >> >> * a composite B-TREE indexed (non-unique) index on DAG_ID, >> TASK_ID, >> >> >> >> RUN_ID_EXECUTION_DATE >> >> >> >> * or maybe even regular HASH index on DAG_ID, TASK_ID and >> separate >> >> >> >> B-TREE index (non-unique) on just RUN_ID_EXECUTION_DATE >> >> >> >> >> >> >> >> Probably the latter is better as I am not sure how < , > >> comparison >> >> >> >> looks like for composite B-TREE indexes when char + date columns >> are >> >> >> >> mixed. Also we could have hit the infamous MySQL index key length >> >> >> >> limit. >> >> >> >> >> >> >> >> Then deletion process would look roughly like: >> >> >> >> >> >> >> >> 1) dag_run_execution_date = SELECT RUN_ID_EXECUTION_DATE FROM >> >> >> >> RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID =<DAG_ID>, >> >> >> >> TASK_ID=<TASK_ID> ORDER BY RUN_ID_EXECUTION_DATE GROUP BY >> >> >> >> RUN_ID_EXECUTION_DATE DESC LIMIT 1 OFFSET >> >> >> >> <MAX_NUM_RENDERED_TI_FIELDS_PER_TASK> >> >> >> >> 2) DELETE FROM RENDERED_TASK_INSTANCE_FIELDS WHERE DAG_ID >> =<DAG_ID>, >> >> >> >> TASK_ID=<TASK_ID> AND RENDER_TIME < dag_run_execution_date >> >> >> >> >> >> >> >> I believe that would be fast, and it would use the B-TREE index >> >> >> >> features nicely (ordering support) >> >> >> >> >> >> >> >> J >> >> >> >> >> >> >> >> On Sun, Jan 29, 2023 at 2:09 AM Andrey Anshin < >> [email protected]> wrote: >> >> >> >> > >> >> >> >> > First of all I want to highlight that this approach I guess >> worked well until Dynamic Task Mappings introduced. >> >> >> >> > >> >> >> >> > > The main reason for adding that cleanup was -- if you don't >> do that, you will have many rows, similar to the TaskInstance table >> >> >> >> > >> >> >> >> > The problem itself is not how big your table/indexes, rather >> then what kind of operation you run. >> >> >> >> > >> >> >> >> > > Do you have any data for locks or performance degradation? >> >> >> >> > >> >> >> >> > In this case if we try to clean up >> rendered_task_instance_fields table when a new TI is created/cleared we >> make almost two full/sequential scans (note: need to check) against the >> table without any index usage, so we pay here a couple times: >> >> >> >> > 1. We scan without indexes - not all parts of the composite >> key are included to query, plus we need to filter everything except 30 >> records with order and distinct >> >> >> >> > 2. After that we make another full scan for find 1 record or >> map_size records >> >> >> >> > >> >> >> >> > And I guess the situation becomes worse if you have a lot of >> tasks, even if we have a small table, we need to do ineffective operations. >> >> >> >> > >> >> >> >> > That how looks like Query Plan (please note without commit >> transaction DELETE operation doesn't have all information): >> https://gist.github.com/Taragolis/3ca7621c51b00f077aa1646401ddf31b >> >> >> >> > >> >> >> >> > In case if we do not clean up the table, we only use these >> operations: >> >> >> >> > 1. SELECT single record by index >> >> >> >> > 2. INSERT new record >> >> >> >> > 3. DELETE old record(s), which were found by index. >> >> >> >> > >> >> >> >> > I have not done any real tests yet, only synthetic DAGs (so we >> should not consider to use any findings as totally truth): >> https://gist.github.com/Taragolis/6eec9f81efdf360c4239fc6ea385a480 >> >> >> >> > DAG with parallel tasks: degradation up to 2-3 times >> >> >> >> > DAG with single map tasks: degradation up to 7-10 times >> >> >> >> > >> >> >> >> > I have a plan for more complex and more close to real use >> cases with Database which do not have network latency almost 0 as I have in >> my local. >> >> >> >> > But I will not refuse if someone also does their tests with >> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK=0 vs default value. >> >> >> >> > >> >> >> >> > About deadlock we know that it exists at least in MySQL: >> https://github.com/apache/airflow/pull/18616 >> >> >> >> > >> >> >> >> > > And the larger tables create problems during database >> migrations. >> >> >> >> > >> >> >> >> > That is a very good point, so if we found that problem only >> related to migrations we could: >> >> >> >> > 1. Cleanup this table in migration >> >> >> >> > 2. Add cli command to airflow db which could cleanup only >> rendered fields, so it would be user's choice cleanup or not before >> migration, do periodical maintenance or not >> >> >> >> > >> >> >> >> > >> >> >> >> > ---- >> >> >> >> > Best Wishes >> >> >> >> > Andrey Anshin >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > On Sat, 28 Jan 2023 at 23:41, Kaxil Naik <[email protected]> >> wrote: >> >> >> >> >>> >> >> >> >> >>> Potentially it is a good idea to deprecate this option and >> recommend for users to set it to 0? WDYT? Maybe someone has already tried >> or investigated this? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> The main reason for adding that cleanup was -- if you don't >> do that, you will have many rows, similar to the TaskInstance table. And >> the RenderedTIFields were mainly added for checking rendered TI fields on >> the Webserver only because after DAG Serialization, the webserver won't >> have access to DAG files. >> >> >> >> >> >> >> >> >> >> And the larger tables create problems during database >> migrations. >> >> >> >> >> >> >> >> >> >> Do you have any data for locks or performance degradation? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Sat, 28 Jan 2023 at 13:06, Andrey Anshin < >> [email protected]> wrote: >> >> >> >> >>> >> >> >> >> >>> Greetings! >> >> >> >> >>> >> >> >> >> >>> During migrating our ORM syntax to compatible with >> SQLAlchemy 2.0 I probably found skeletons in the closet. >> >> >> >> >>> >> >> >> >> >>> Let's start from the beginning, initially I got this warning >> >> >> >> >>> >> >> >> >> >>> airflow/models/renderedtifields.py:245 >> RemovedIn20Warning('ORDER BY columns added implicitly due to DISTINCT is >> deprecated and will be removed in SQLAlchemy 2.0. SELECT statements with >> DISTINCT should be written to explicitly include the appropriate columns in >> the columns clause (Background on SQLAlchemy 2.0 at: >> https://sqlalche.me/e/b8d9)') >> >> >> >> >>> >> >> >> >> >>> "OK let's fix it!", I thought at first and started to >> investigate RenderedTaskInstanceFields model >> >> >> >> >>> >> >> >> >> >>> Skeleton #1: >> >> >> >> >>> >> >> >> >> >>> When I first time look on the code and comments it got me to >> thinking that part which keep only latest N Rendered Task Fields >> potentially could lead different performance degradation (Locks, Dead >> Locks, Data Bloating): see code >> https://github.com/apache/airflow/blob/6c479437b1aedf74d029463bda56b42950278287/airflow/models/renderedtifields.py#L185-L245 >> >> >> >> >>> >> >> >> >> >>> Also this historical part (from Airflow 1.10.10) generate >> this SQL Statement (pg backend) >> >> >> >> >>> >> >> >> >> >>> DELETE FROM rendered_task_instance_fields >> >> >> >> >>> WHERE rendered_task_instance_fields.dag_id = %(dag_id_1) s >> >> >> >> >>> AND rendered_task_instance_fields.task_id = %(task_id_1) s >> >> >> >> >>> AND ( >> >> >> >> >>> ( >> >> >> >> >>> rendered_task_instance_fields.dag_id, >> >> >> >> >>> rendered_task_instance_fields.task_id, >> >> >> >> >>> rendered_task_instance_fields.run_id >> >> >> >> >>> ) NOT IN ( >> >> >> >> >>> SELECT >> >> >> >> >>> anon_1.dag_id, >> >> >> >> >>> anon_1.task_id, >> >> >> >> >>> anon_1.run_id >> >> >> >> >>> FROM >> >> >> >> >>> ( >> >> >> >> >>> SELECT DISTINCT >> >> >> >> >>> rendered_task_instance_fields.dag_id AS dag_id, >> >> >> >> >>> rendered_task_instance_fields.task_id AS task_id, >> >> >> >> >>> rendered_task_instance_fields.run_id AS run_id, >> >> >> >> >>> dag_run.execution_date AS execution_date >> >> >> >> >>> FROM rendered_task_instance_fields >> >> >> >> >>> JOIN dag_run ON >> rendered_task_instance_fields.dag_id = dag_run.dag_id >> >> >> >> >>> AND rendered_task_instance_fields.run_id = >> dag_run.run_id >> >> >> >> >>> WHERE >> >> >> >> >>> rendered_task_instance_fields.dag_id = >> %(dag_id_2) s >> >> >> >> >>> AND rendered_task_instance_fields.task_id = >> %(task_id_2) s >> >> >> >> >>> ORDER BY >> >> >> >> >>> dag_run.execution_date DESC >> >> >> >> >>> limit %(param_1) s >> >> >> >> >>> ) AS anon_1 >> >> >> >> >>> ) >> >> >> >> >>> ) >> >> >> >> >>> >> >> >> >> >>> Which is especially not effective in PostgreSQL. When IN >> SUBQUERY could be easily transform internaly into SEMI-JOIN (aka EXISTS >> clause), but it is not working for NOT IN SUBQUERY because it is not >> transformed into ANTI JOIN (aka NOT EXISTS clause) even if it possible, >> see: https://commitfest.postgresql.org/27/2023/ >> >> >> >> >>> >> >> >> >> >>> I didn't do any performance benchmarks yet but I guess if >> users set AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK to 0 rather >> than default 30 it could improve performance and reduce number of >> DeadLocks, however the table size will increase but I think we don't do any >> maintenance job for other tables. >> >> >> >> >>> >> >> >> >> >>> Potentially it is a good idea to deprecate this option and >> recommend for users to set it to 0? WDYT? Maybe someone has already tried >> or investigated this? >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> Skeleton #2: >> >> >> >> >>> >> >> >> >> >>> We have a k8s_pod_yaml field which is exclusively used by >> K8S executors. >> >> >> >> >>> >> >> >> >> >>> Should we also decouple this field as part of AIP-51? >> >> >> >> >>> >> >> >> >> >>> ---- >> >> >> >> >>> Best Wishes >> >> >> >> >>> Andrey Anshin >> >> >> >> >>> >> >
