The thing is that it is "last count per DAG file". I do not think we can actually calculate this per DAG, well we can split total number of queries by number of DAGs in the file, but this maybe confusing.
On Fri, Jun 14, 2024 at 12:24 PM Jarek Potiuk <[email protected]> wrote: > > the cardinality of those logs is too high. > > I was thinking about only showing "last count per DAG" - then cardinality > would be "good enough" I think. It could also be exposed via metrics now > that I think of it - no real need to see it in UI or API. > > On Fri, Jun 14, 2024 at 12:14 PM Kaxil Naik <[email protected]> wrote: > >> Yeah, valuable to show it in logs. For showing it in a web server or >> storing it in DB, the cardinality of those logs is too high. >> >> On Fri, 14 Jun 2024 at 11:09, Eugen Kosteev <[email protected]> wrote: >> >> > Yeah, I also think it is a good idea to expose it in the Airflow UI. >> > >> > Although, atm we do not have an entity such as DAG file (and this >> metric is >> > per DAG file) in Airflow database, so we would need to design it a >> little >> > bit. >> > And attaching to the DAG model is not correct. >> > >> > But I totally agree, it would be good to have it in Airflow UI as well >> for >> > "operation users" to have access to this information. >> > >> > On Fri, Jun 14, 2024 at 11:22 AM Jarek Potiuk <[email protected]> wrote: >> > >> > > Good idea, it would also be good if we could have access to the >> > information >> > > exposed in the UI - so that "operations users" can see it and maybe >> even >> > > act on it + API/ CLI to check it. I think in the future of Airflow 3 >> > where >> > > we will have task isolation, having `0` for all the DAGs will be a >> > > prerequisite for switching to "task isolation" mode and this could be >> > > actually verified in a migration tool. >> > > >> > > On Fri, Jun 14, 2024 at 10:59 AM Eugen Kosteev <[email protected]> >> > wrote: >> > > >> > > > Hi. >> > > > >> > > > I would like to discuss the proposal of adding a new column to the >> "DAG >> > > > File Processing Stats" of DAG processor logs. >> > > > >> > > > Currently in the logs of DAG processor, there is following data >> > > > (screenshot below) that includes # of DAGs, runtime, etc. per DAG >> file. >> > > > [image: image.png] >> > > > >> > > > It seems that it would be beneficial to have also there data about >> the >> > > > number of queries performed to the Airflow database during parsing >> of >> > > each >> > > > file. >> > > > It maybe convenient to have it in case of debugging issues related >> to >> > > high >> > > > load on Airflow database, e.g. typical scenario is when DAG file(s) >> > have >> > > > a lot of queries to database done on the top level of code and those >> > are >> > > > executed each time during parsing of these DAG files. >> > > > One common example is excessive usage of "Variables.get" as >> top-level >> > > > statements in DAG files. >> > > > >> > > > Having information about "number of queries to Airflow database" per >> > DAG >> > > > file may help a lot during debugging issues related to high load on >> > > > database or issues related to long parsing of the DAG files. >> > > > >> > > > One caveat is that due to e.g. caching enabled for Variables or >> because >> > > of >> > > > other reasons (dynamic DAGs), number of queries may be very >> different >> > for >> > > > each parsing of the DAG file, >> > > > but at least we can have it as "Last Run Number of Queries" - that >> > would >> > > > already give some idea and engineer can also review logs >> historically >> > to >> > > > see its data in the past. >> > > > >> > > > What are your thoughts? >> > > > >> > > > -- >> > > > Eugene >> > > > >> > > >> > >> > >> > -- >> > Eugene >> > >> > -- Eugene
