Re: Experiences with 1.8.0

Chris Riccomini Mon, 23 Jan 2017 14:58:16 -0800

This inner query takes 10s:

SELECT task_instance.task_id AS task_id, max(task_instance.execution_date)
AS max_ti
FROM task_instance
WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' AND
task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id


Explain seems OK:

+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
| id | select_type | table         | type | possible_keys
                   | key      | key_len | ref   | rows  | Extra
       |
+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
|  1 | SIMPLE      | task_instance | ref  |
PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state | ti_state | 63      |
const | 81898 | Using where; Using index |
+----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+

A look at the query indicates that it's using the state field as its index
lookup, which isn't good.

On Mon, Jan 23, 2017 at 2:49 PM, Chris Riccomini <criccom...@apache.org>
wrote:

> It's this query:
>
> SELECT task_instance.task_id AS task_instance_task_id,
> task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date
> AS task_instance_execution_date, task_instance.start_date AS
> task_instance_start_date, task_instance.end_date AS task_instance_end_date,
> task_instance.duration AS task_instance_duration, task_instance.state AS
> task_instance_state, task_instance.try_number AS task_instance_try_number,
> task_instance.hostname AS task_instance_hostname, task_instance.unixname AS
> task_instance_unixname, task_instance.job_id AS task_instance_job_id,
> task_instance.pool AS task_instance_pool, task_instance.queue AS
> task_instance_queue, task_instance.priority_weight AS
> task_instance_priority_weight, task_instance.operator AS
> task_instance_operator, task_instance.queued_dttm AS
> task_instance_queued_dttm, task_instance.pid AS task_instance_pid
> FROM task_instance, (SELECT task_instance.task_id AS task_id,
> max(task_instance.execution_date) AS max_ti
> FROM task_instance
> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state =
> 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5', 't6')
> GROUP BY task_instance.task_id) AS sq
> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id =
> sq.task_id AND task_instance.execution_date = sq.max_ti
>
> Taking 10-20s
>
> On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini <criccom...@apache.org>
> wrote:
>
>> Can confirm it's a slow query on task_instance table. Still digging.
>> Unfortunately, the query is truncated in my UI right now:
>>
>> SELECT task_instance.task_id AS task_instance_...
>>
>> On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini <criccom...@apache.org>
>> wrote:
>>
>>> Digging. Might be a bit.
>>>
>>> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbr...@gmail.com>
>>> wrote:
>>>
>>>> Slow query log? Db load?
>>>>
>>>> B.
>>>>
>>>> Verstuurd vanaf mijn iPad
>>>>
>>>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini <criccom...@apache.org>
>>>> het volgende geschreven:
>>>> >
>>>> > Note: 6.5 million TIs in the task_instance table.
>>>> >
>>>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini <
>>>> criccom...@apache.org>
>>>> > wrote:
>>>> >
>>>> >> Hey Bolke,
>>>> >>
>>>> >> Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all
>>>> free
>>>> >> as well.
>>>> >>
>>>> >> I am thinking that this is DB related, given that it's pausing when
>>>> >> executing an update. Was looking at the update_state method in
>>>> models.py,
>>>> >> which logs right before the 15s pause.
>>>> >>
>>>> >> Cheers,
>>>> >> Chris
>>>> >>
>>>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin <bdbr...@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi Chris,
>>>> >>>
>>>> >>> What is the load on your machine? (CPU/IO/MEM) It seems that the
>>>> executor
>>>> >>> is faster in checking the state than the TaskInstance is able to
>>>> exit
>>>> >>> itself. No, I don’t consider it normal, but it was sometimes
>>>> reported (
>>>> >>> https://github.com/apache/incubator-airflow/pull/1821) though not
>>>> really
>>>> >>> replicable as of yet.
>>>> >>>
>>>> >>> Parsing seems exceptionally slow, it might be worth looking at it
>>>> with a
>>>> >>> debugger. Very faint guess might be that something with the
>>>> multiprocessing
>>>> >>> part could do something with memory that is costly, but then we
>>>> need to
>>>> >>> know more about what is running on the system. Never clue of system
>>>> metrics
>>>> >>> could be helpful here.
>>>> >>>
>>>> >>> Bolke
>>>> >>>
>>>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini <criccom...@apache.org>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Also, seeing this in EVERY task that runs:
>>>> >>>>
>>>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this
>>>> >>>> instance has been externally set to queued. Taking the poison
>>>> pill. So
>>>> >>>> long.
>>>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with
>>>> return
>>>> >>> code 0
>>>> >>>>
>>>> >>>>
>>>> >>>> All successful tasks are showing this at the end of their logs. Is
>>>> this
>>>> >>>> normal?
>>>> >>>>
>>>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini <
>>>> >>> criccom...@apache.org>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> Hey all,
>>>> >>>>>
>>>> >>>>> I've upgraded on production. Things seem to be working so far
>>>> (only
>>>> >>> been
>>>> >>>>> an hour), but I am seeing this in the scheduler logs:
>>>> >>>>>
>>>> >>>>> File Path
>>>> >>> PID
>>>> >>>>> Runtime    Last Runtime    Last Run
>>>> >>>>> ------------------------------------------------------------
>>>> ------
>>>> >>> -----
>>>> >>>>> ---------  --------------  -------------------
>>>> >>>>> ...
>>>> >>>>> /etc/airflow/dags/dags/elt/el/db.py
>>>> >>> 24793
>>>> >>>>> 43.41s     986.63s         2017-01-23T20:04:09
>>>> >>>>> ...
>>>> >>>>>
>>>> >>>>> It seems to be taking more than 15 minutes to parse this DAG. Any
>>>> idea
>>>> >>>>> what's causing this? Scheduler config:
>>>> >>>>>
>>>> >>>>> [scheduler]
>>>> >>>>> job_heartbeat_sec = 5
>>>> >>>>> scheduler_heartbeat_sec = 5
>>>> >>>>> max_threads = 2
>>>> >>>>> child_process_log_directory = /var/log/airflow/scheduler
>>>> >>>>>
>>>> >>>>> The db.py file, itself, doesn't interact with any outside
>>>> systems, so I
>>>> >>>>> would have expected this to not take so long. It does, however,
>>>> >>>>> programmatically generate many DAGs within the single .py file.
>>>> >>>>>
>>>> >>>>> A snippet of the scheduler log is here:
>>>> >>>>>
>>>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669
>>>> e8ffd65
>>>> >>>>>
>>>> >>>>> Note how there are 10-15 second gaps occasionally. Any idea what's
>>>> >>> going
>>>> >>>>> on?
>>>> >>>>>
>>>> >>>>> Cheers,
>>>> >>>>> Chris
>>>> >>>>>
>>>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <
>>>> bdbr...@gmail.com>
>>>> >>> wrote:
>>>> >>>>>
>>>> >>>>>> I created:
>>>> >>>>>>
>>>> >>>>>> - AIRFLOW-791: At start up all running dag_runs are being
>>>> checked, but
>>>> >>>>>> not fixed
>>>> >>>>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t
>>>> get
>>>> >>> fixed
>>>> >>>>>> - AIRFLOW-788: Context unexpectedly added to hive conf
>>>> >>>>>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date /
>>>> >>> interval
>>>> >>>>>> was specified
>>>> >>>>>>
>>>> >>>>>> I created AIRFLOW-789 to update UPDATING.md, it is also out as a
>>>> PR.
>>>> >>>>>>
>>>> >>>>>> Please note that I don't consider any of these blockers for a
>>>> release
>>>> >>> of
>>>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for
>>>> an RC
>>>> >>> on Feb
>>>> >>>>>> 2. However if people are using a restarting scheduler
>>>> (run_duration
>>>> >>> is set)
>>>> >>>>>> and have a lot of running dag runs they won’t like AIRFLOW-791.
>>>> So a
>>>> >>>>>> workaround for this would be nice (we just updated dag_runs
>>>> directly
>>>> >>> in the
>>>> >>>>>> database to say ‘finished’ before a certain date, but we are
>>>> also not
>>>> >>> using
>>>> >>>>>> the run_duration).
>>>> >>>>>>
>>>> >>>>>> Bolke
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Will do. And thanks.
>>>> >>>>>>>
>>>> >>>>>>> Adding another issue:
>>>> >>>>>>>
>>>> >>>>>>> * Some of our DAGs are not getting scheduled for some unknown
>>>> reason.
>>>> >>>>>>> Need to investigate why.
>>>> >>>>>>>
>>>> >>>>>>> Related but not root cause:
>>>> >>>>>>> * Logging is so chatty that it gets really hard to find the real
>>>> >>> issue
>>>> >>>>>>>
>>>> >>>>>>> Bolke.
>>>> >>>>>>>
>>>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davy...@airbnb.com
>>>> >>> .INVALID>
>>>> >>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> I'd be happy to lend a hand fixing these issues and hopefully
>>>> some
>>>> >>>>>> others
>>>> >>>>>>>> are too. Do you mind creating jiras for these since you have
>>>> the
>>>> >>> full
>>>> >>>>>>>> context? I have created a JIRA for (1) and have assigned it to
>>>> >>> myself:
>>>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780
>>>> >>>>>>>>
>>>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <
>>>> bdbr...@gmail.com>
>>>> >>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> This is to report back on some of the (early) experiences we
>>>> have
>>>> >>> with
>>>> >>>>>>>>> Airflow 1.8.0 (beta 1 at the moment):
>>>> >>>>>>>>>
>>>> >>>>>>>>> 1. The UI does not show faulty DAG, leading to confusion for
>>>> >>>>>> developers.
>>>> >>>>>>>>> When a faulty dag is placed in the dags folder the UI would
>>>> report
>>>> >>> a
>>>> >>>>>>>>> parsing error. Now it doesn’t due to the separate parising
>>>> (but not
>>>> >>>>>>>>> reporting back errors)
>>>> >>>>>>>>>
>>>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
>>>> >>>>>>>>> We run in a secure environment which requires this variable
>>>> to be
>>>> >>>>>>>>> whitelisted if it is modified (needs to be added to
>>>> UPDATING.md)
>>>> >>>>>>>>>
>>>> >>>>>>>>> 3. DagRuns do not exist for certain tasks, but don’t get fixed
>>>> >>>>>>>>> Log gets flooded without a suggestion what to do
>>>> >>>>>>>>>
>>>> >>>>>>>>> 4. At start up all running dag_runs are being checked, we
>>>> seemed to
>>>> >>>>>> have a
>>>> >>>>>>>>> lot of “left over” dag_runs (couple of thousand)
>>>> >>>>>>>>> - Checking was logged to INFO -> requires a fsync for every
>>>> log
>>>> >>>>>> message
>>>> >>>>>>>>> making it very slow
>>>> >>>>>>>>> - Checking would happen at every restart, but dag_runs’
>>>> states were
>>>> >>>>>> not
>>>> >>>>>>>>> being updated
>>>> >>>>>>>>> - These dag_runs would never er be marked anything else than
>>>> >>> running
>>>> >>>>>> for
>>>> >>>>>>>>> some reason
>>>> >>>>>>>>> -> Applied work around to update all dag_run in sql before a
>>>> >>> certain
>>>> >>>>>> date
>>>> >>>>>>>>> to -> finished
>>>> >>>>>>>>> -> need to investigate why dag_runs did not get marked
>>>> >>>>>> “finished/failed”
>>>> >>>>>>>>>
>>>> >>>>>>>>> 5. Our umask is set to 027
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>
>>>
>>
>

Re: Experiences with 1.8.0

Reply via email to