I think you had the finger on it. If you have a frequent query against a
large-ish table that cannot leverage an index, that will result in a lot of
workload.
If I was in your shoes I'd run a CREATE INDEX statement against that
table/field and see how it reduces your resource consumptions and make
Thanks for the reply.
We are using 1.7.1.3 and it looks the index is not there.
https://github.com/apache/incubator-airflow/blob/1.7.1.3/airflow/models.py#L660-#L664
Is Airflow 1.8 officially released ? I saw the version tag and discussion,
but not saw it in pypi..
I did run Dan's SQL statement
Wait. That field does have an index and it looks like Dan added it 8 months
ago.
https://github.com/apache/incubator-airflow/blame/master/airflow/models.py#L744
Here's the related DB migration script:
https://github.com/apache/incubator-airflow/blob/master/airflow/migrations/versions/211e584da130_
We will need to come up with a plan soon (better DB indexes and/or the
ability to rotate out old task instances according to some policy). Nothing
concrete as of yet though.
On Tue, Mar 7, 2017 at 6:18 PM, Jason Chen
wrote:
> Hi Dan,
>
> Thanks so much. This is exactly what I am looking for.
>
Hi Dan,
Thanks so much. This is exactly what I am looking for.
Is there a plan on the future airflow road map to clean this up from
Airflow system level? Say, in airflow.cfg, a setting to clean up data older
than specified time.
Your solution is to run an airflow job to clean up the data. That'
FWIW we use the following DAG at Airbnb to reap the task instances table
(this is a stopgap):
# DAG to delete old TIs so that UI operations on the webserver are fast.
This DAG is a
# stopgap, ideally we would make the UI not query all task instances and
add indexes to
# the task_instance table whe
Hi Bolke,
Thanks, but it looks you are actually talking about Harish's use case.
My use case is about 50 Dags (each one with about 2-3 tasks). I feel our
run interval setting for the dags are too low (~15 mins). It may result in
high CPU of MySQL.
Meanwhile, I dig to MySQL and I noticed a fre
Hi Jason
I think you need to back it up with more numbers. You assume that a load of
100% is bad and also that 16GB of mem is a lot.
30x25 = 750 tasks per hour = 12,5 tasks per minute. For every task we launch a
couple of processes (at least 2) that do not share memory, this is to ensure
tasks
I see.
Thanks.
Airflow team,
I noticed a frequently running SQL as below. It's without proper index on
column task_instance.state.
Shouldn't it index "state", given that there could be million of rows in
task_instance?
"SELECT task_instance.task_id AS task_instance_task_id,
task_instance.dag_id A
it does and does not.
say, scheduler heartbeat = 30 sec
You will see a spiky cpu consumption graph every 30 seconds.
But we did not go that route and kept the scheduler heartbeat = 5 sec so
that we do not lose time when a task is ready to run (I think there is
another known bug here - tasks dont
Hi Harish,
Thanks for the fast response and feedback.
Yeah, I want to see the fix or more discussion !
BTW, I assume that, given your 30 dags, airflow runs fine after your
increase of heartbeat ?
The default is 5 secs.
Thanks.
Jason
On Tue, Mar 7, 2017 at 10:24 AM, harish singh
wrote:
> I
I had seen a similar behavior, a year ago, when we were are < 5 Dags. Even
then the cpu utilization was reaching 100%.
One way to deal with this is - You could play with "heatbeat" numbers (i.e
increase heartbeat).
But then you are introducing more delay to start jobs that are ready to run
(ready t
12 matches
Mail list logo