Maybe we can start with " .with_hint(TI, 'USE INDEX (PRIMARY)', dialect_name='mysql')"
and see if other databases exhibit the same query plan issue ? Best, Arthur On Mon, Jan 23, 2017 at 3:27 PM, Chris Riccomini <criccom...@apache.org> wrote: > With this patch: > > $ git diff > diff --git a/airflow/jobs.py b/airflow/jobs.py > index f1de333..9d08e75 100644 > --- a/airflow/jobs.py > +++ b/airflow/jobs.py > @@ -544,6 +544,7 @@ class SchedulerJob(BaseJob): > .query( > TI.task_id, > func.max(TI.execution_date).label('max_ti')) > + .with_hint(TI, 'USE INDEX (PRIMARY)') > .filter(TI.dag_id == dag.dag_id) > .filter(TI.state == State.SUCCESS) > .filter(TI.task_id.in_(dag.task_ids)) > > The db.py file parses in 90s. About 10x faster. > > Th slow_query log is not showing any offending queries anymore. I will open > a JIRA for this. Note that with_hint stuff I put in is MySQL specific, I > think. Any one have any other ideas on how to affect the equivalent > outcome? > > On Mon, Jan 23, 2017 at 3:08 PM, Chris Riccomini <criccom...@apache.org> > wrote: > > > OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint, > > query takes .47s. Without hint, 10s. Going to try and patch. > > > > On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini <criccom...@apache.org> > > wrote: > > > >> This inner query takes 10s: > >> > >> SELECT task_instance.task_id AS task_id, max(task_instance.execution_ > date) > >> AS max_ti > >> FROM task_instance > >> WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' > >> AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id > >> > >> Explain seems OK: > >> > >> +----+-------------+---------------+------+----------------- > >> -----------------------------------+----------+---------+--- > >> ----+-------+--------------------------+ > >> | id | select_type | table | type | possible_keys > >> | key | key_len | ref | rows | Extra > >> | > >> +----+-------------+---------------+------+----------------- > >> -----------------------------------+----------+---------+--- > >> ----+-------+--------------------------+ > >> | 1 | SIMPLE | task_instance | ref | > PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state > >> | ti_state | 63 | const | 81898 | Using where; Using index | > >> +----+-------------+---------------+------+----------------- > >> -----------------------------------+----------+---------+--- > >> ----+-------+--------------------------+ > >> > >> A look at the query indicates that it's using the state field as its > >> index lookup, which isn't good. > >> > >> On Mon, Jan 23, 2017 at 2:49 PM, Chris Riccomini <criccom...@apache.org > > > >> wrote: > >> > >>> It's this query: > >>> > >>> SELECT task_instance.task_id AS task_instance_task_id, > >>> task_instance.dag_id AS task_instance_dag_id, > task_instance.execution_date > >>> AS task_instance_execution_date, task_instance.start_date AS > >>> task_instance_start_date, task_instance.end_date AS > task_instance_end_date, > >>> task_instance.duration AS task_instance_duration, task_instance.state > AS > >>> task_instance_state, task_instance.try_number AS > task_instance_try_number, > >>> task_instance.hostname AS task_instance_hostname, > task_instance.unixname AS > >>> task_instance_unixname, task_instance.job_id AS task_instance_job_id, > >>> task_instance.pool AS task_instance_pool, task_instance.queue AS > >>> task_instance_queue, task_instance.priority_weight AS > >>> task_instance_priority_weight, task_instance.operator AS > >>> task_instance_operator, task_instance.queued_dttm AS > >>> task_instance_queued_dttm, task_instance.pid AS task_instance_pid > >>> FROM task_instance, (SELECT task_instance.task_id AS task_id, > >>> max(task_instance.execution_date) AS max_ti > >>> FROM task_instance > >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state = > >>> 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5', > 't6') > >>> GROUP BY task_instance.task_id) AS sq > >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id = > >>> sq.task_id AND task_instance.execution_date = sq.max_ti > >>> > >>> Taking 10-20s > >>> > >>> On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini < > criccom...@apache.org> > >>> wrote: > >>> > >>>> Can confirm it's a slow query on task_instance table. Still digging. > >>>> Unfortunately, the query is truncated in my UI right now: > >>>> > >>>> SELECT task_instance.task_id AS task_instance_... > >>>> > >>>> On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini < > criccom...@apache.org > >>>> > wrote: > >>>> > >>>>> Digging. Might be a bit. > >>>>> > >>>>> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbr...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Slow query log? Db load? > >>>>>> > >>>>>> B. > >>>>>> > >>>>>> Verstuurd vanaf mijn iPad > >>>>>> > >>>>>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini < > >>>>>> criccom...@apache.org> het volgende geschreven: > >>>>>> > > >>>>>> > Note: 6.5 million TIs in the task_instance table. > >>>>>> > > >>>>>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini < > >>>>>> criccom...@apache.org> > >>>>>> > wrote: > >>>>>> > > >>>>>> >> Hey Bolke, > >>>>>> >> > >>>>>> >> Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost > >>>>>> all free > >>>>>> >> as well. > >>>>>> >> > >>>>>> >> I am thinking that this is DB related, given that it's pausing > when > >>>>>> >> executing an update. Was looking at the update_state method in > >>>>>> models.py, > >>>>>> >> which logs right before the 15s pause. > >>>>>> >> > >>>>>> >> Cheers, > >>>>>> >> Chris > >>>>>> >> > >>>>>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin < > >>>>>> bdbr...@gmail.com> > >>>>>> >> wrote: > >>>>>> >> > >>>>>> >>> Hi Chris, > >>>>>> >>> > >>>>>> >>> What is the load on your machine? (CPU/IO/MEM) It seems that the > >>>>>> executor > >>>>>> >>> is faster in checking the state than the TaskInstance is able to > >>>>>> exit > >>>>>> >>> itself. No, I don’t consider it normal, but it was sometimes > >>>>>> reported ( > >>>>>> >>> https://github.com/apache/incubator-airflow/pull/1821) though > >>>>>> not really > >>>>>> >>> replicable as of yet. > >>>>>> >>> > >>>>>> >>> Parsing seems exceptionally slow, it might be worth looking at > it > >>>>>> with a > >>>>>> >>> debugger. Very faint guess might be that something with the > >>>>>> multiprocessing > >>>>>> >>> part could do something with memory that is costly, but then we > >>>>>> need to > >>>>>> >>> know more about what is running on the system. Never clue of > >>>>>> system metrics > >>>>>> >>> could be helpful here. > >>>>>> >>> > >>>>>> >>> Bolke > >>>>>> >>> > >>>>>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini < > criccom...@apache.org > >>>>>> > > >>>>>> >>> wrote: > >>>>>> >>>> > >>>>>> >>>> Also, seeing this in EVERY task that runs: > >>>>>> >>>> > >>>>>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of > this > >>>>>> >>>> instance has been externally set to queued. Taking the poison > >>>>>> pill. So > >>>>>> >>>> long. > >>>>>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited > with > >>>>>> return > >>>>>> >>> code 0 > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> All successful tasks are showing this at the end of their logs. > >>>>>> Is this > >>>>>> >>>> normal? > >>>>>> >>>> > >>>>>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini < > >>>>>> >>> criccom...@apache.org> > >>>>>> >>>> wrote: > >>>>>> >>>> > >>>>>> >>>>> Hey all, > >>>>>> >>>>> > >>>>>> >>>>> I've upgraded on production. Things seem to be working so far > >>>>>> (only > >>>>>> >>> been > >>>>>> >>>>> an hour), but I am seeing this in the scheduler logs: > >>>>>> >>>>> > >>>>>> >>>>> File Path > >>>>>> >>> PID > >>>>>> >>>>> Runtime Last Runtime Last Run > >>>>>> >>>>> ------------------------------------------------------------ > >>>>>> ------ > >>>>>> >>> ----- > >>>>>> >>>>> --------- -------------- ------------------- > >>>>>> >>>>> ... > >>>>>> >>>>> /etc/airflow/dags/dags/elt/el/db.py > >>>>>> >>> 24793 > >>>>>> >>>>> 43.41s 986.63s 2017-01-23T20:04:09 > >>>>>> >>>>> ... > >>>>>> >>>>> > >>>>>> >>>>> It seems to be taking more than 15 minutes to parse this DAG. > >>>>>> Any idea > >>>>>> >>>>> what's causing this? Scheduler config: > >>>>>> >>>>> > >>>>>> >>>>> [scheduler] > >>>>>> >>>>> job_heartbeat_sec = 5 > >>>>>> >>>>> scheduler_heartbeat_sec = 5 > >>>>>> >>>>> max_threads = 2 > >>>>>> >>>>> child_process_log_directory = /var/log/airflow/scheduler > >>>>>> >>>>> > >>>>>> >>>>> The db.py file, itself, doesn't interact with any outside > >>>>>> systems, so I > >>>>>> >>>>> would have expected this to not take so long. It does, > however, > >>>>>> >>>>> programmatically generate many DAGs within the single .py > file. > >>>>>> >>>>> > >>>>>> >>>>> A snippet of the scheduler log is here: > >>>>>> >>>>> > >>>>>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669 > >>>>>> e8ffd65 > >>>>>> >>>>> > >>>>>> >>>>> Note how there are 10-15 second gaps occasionally. Any idea > >>>>>> what's > >>>>>> >>> going > >>>>>> >>>>> on? > >>>>>> >>>>> > >>>>>> >>>>> Cheers, > >>>>>> >>>>> Chris > >>>>>> >>>>> > >>>>>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin < > >>>>>> bdbr...@gmail.com> > >>>>>> >>> wrote: > >>>>>> >>>>> > >>>>>> >>>>>> I created: > >>>>>> >>>>>> > >>>>>> >>>>>> - AIRFLOW-791: At start up all running dag_runs are being > >>>>>> checked, but > >>>>>> >>>>>> not fixed > >>>>>> >>>>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but > >>>>>> don’t get > >>>>>> >>> fixed > >>>>>> >>>>>> - AIRFLOW-788: Context unexpectedly added to hive conf > >>>>>> >>>>>> - AIRFLOW-792: Allow fixing of schedule when wrong > start_date / > >>>>>> >>> interval > >>>>>> >>>>>> was specified > >>>>>> >>>>>> > >>>>>> >>>>>> I created AIRFLOW-789 to update UPDATING.md, it is also out > as > >>>>>> a PR. > >>>>>> >>>>>> > >>>>>> >>>>>> Please note that I don't consider any of these blockers for a > >>>>>> release > >>>>>> >>> of > >>>>>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track > for > >>>>>> an RC > >>>>>> >>> on Feb > >>>>>> >>>>>> 2. However if people are using a restarting scheduler > >>>>>> (run_duration > >>>>>> >>> is set) > >>>>>> >>>>>> and have a lot of running dag runs they won’t like > >>>>>> AIRFLOW-791. So a > >>>>>> >>>>>> workaround for this would be nice (we just updated dag_runs > >>>>>> directly > >>>>>> >>> in the > >>>>>> >>>>>> database to say ‘finished’ before a certain date, but we are > >>>>>> also not > >>>>>> >>> using > >>>>>> >>>>>> the run_duration). > >>>>>> >>>>>> > >>>>>> >>>>>> Bolke > >>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com > > > >>>>>> wrote: > >>>>>> >>>>>>> > >>>>>> >>>>>>> Will do. And thanks. > >>>>>> >>>>>>> > >>>>>> >>>>>>> Adding another issue: > >>>>>> >>>>>>> > >>>>>> >>>>>>> * Some of our DAGs are not getting scheduled for some > unknown > >>>>>> reason. > >>>>>> >>>>>>> Need to investigate why. > >>>>>> >>>>>>> > >>>>>> >>>>>>> Related but not root cause: > >>>>>> >>>>>>> * Logging is so chatty that it gets really hard to find the > >>>>>> real > >>>>>> >>> issue > >>>>>> >>>>>>> > >>>>>> >>>>>>> Bolke. > >>>>>> >>>>>>> > >>>>>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan Davydov < > >>>>>> dan.davy...@airbnb.com > >>>>>> >>> .INVALID> > >>>>>> >>>>>> wrote: > >>>>>> >>>>>>>> > >>>>>> >>>>>>>> I'd be happy to lend a hand fixing these issues and > >>>>>> hopefully some > >>>>>> >>>>>> others > >>>>>> >>>>>>>> are too. Do you mind creating jiras for these since you > have > >>>>>> the > >>>>>> >>> full > >>>>>> >>>>>>>> context? I have created a JIRA for (1) and have assigned it > >>>>>> to > >>>>>> >>> myself: > >>>>>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780 > >>>>>> >>>>>>>> > >>>>>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin < > >>>>>> bdbr...@gmail.com> > >>>>>> >>>>>> wrote: > >>>>>> >>>>>>>> > >>>>>> >>>>>>>>> This is to report back on some of the (early) experiences > >>>>>> we have > >>>>>> >>> with > >>>>>> >>>>>>>>> Airflow 1.8.0 (beta 1 at the moment): > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> 1. The UI does not show faulty DAG, leading to confusion > for > >>>>>> >>>>>> developers. > >>>>>> >>>>>>>>> When a faulty dag is placed in the dags folder the UI > would > >>>>>> report > >>>>>> >>> a > >>>>>> >>>>>>>>> parsing error. Now it doesn’t due to the separate parising > >>>>>> (but not > >>>>>> >>>>>>>>> reporting back errors) > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive > >>>>>> >>>>>>>>> We run in a secure environment which requires this > variable > >>>>>> to be > >>>>>> >>>>>>>>> whitelisted if it is modified (needs to be added to > >>>>>> UPDATING.md) > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> 3. DagRuns do not exist for certain tasks, but don’t get > >>>>>> fixed > >>>>>> >>>>>>>>> Log gets flooded without a suggestion what to do > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> 4. At start up all running dag_runs are being checked, we > >>>>>> seemed to > >>>>>> >>>>>> have a > >>>>>> >>>>>>>>> lot of “left over” dag_runs (couple of thousand) > >>>>>> >>>>>>>>> - Checking was logged to INFO -> requires a fsync for > every > >>>>>> log > >>>>>> >>>>>> message > >>>>>> >>>>>>>>> making it very slow > >>>>>> >>>>>>>>> - Checking would happen at every restart, but dag_runs’ > >>>>>> states were > >>>>>> >>>>>> not > >>>>>> >>>>>>>>> being updated > >>>>>> >>>>>>>>> - These dag_runs would never er be marked anything else > than > >>>>>> >>> running > >>>>>> >>>>>> for > >>>>>> >>>>>>>>> some reason > >>>>>> >>>>>>>>> -> Applied work around to update all dag_run in sql > before a > >>>>>> >>> certain > >>>>>> >>>>>> date > >>>>>> >>>>>>>>> to -> finished > >>>>>> >>>>>>>>> -> need to investigate why dag_runs did not get marked > >>>>>> >>>>>> “finished/failed” > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> 5. Our umask is set to 027 > >>>>>> >>>>>>>>> > >>>>>> >>>>>>>>> > >>>>>> >>>>>>> > >>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> >>>>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > >