Re: Experiences with 1.8.0

2017-01-25 Thread Chris Riccomini
Hey all, I have sent in a PR and JIRA here: https://github.com/apache/incubator-airflow/pull/2021 https://issues.apache.org/jira/browse/AIRFLOW-807 Please have a look. EDIT: I see Arthur just did haha. Cheers, Chris On Tue, Jan 24, 2017 at 9:41 PM, Chris Riccomini wrote: > @Max, ran both an

Re: Experiences with 1.8.0

2017-01-24 Thread Chris Riccomini
@Max, ran both analyze/optimize. Didn't help. Explain still tries to use `state` index. :( On Tue, Jan 24, 2017 at 5:54 AM, Bolke de Bruin wrote: > I have looked into the issue and it is harmless. What happens is that a > TaskInstance writes “success” to the database and the monitoring catches >

Re: Experiences with 1.8.0

2017-01-24 Thread Bolke de Bruin
I have looked into the issue and it is harmless. What happens is that a TaskInstance writes “success” to the database and the monitoring catches this change before the process is exited. It reports wrongly (ie. queued) as self.task_instance is not updated. I have opened AIRFLOW-798, but will not

Re: Experiences with 1.8.0

2017-01-24 Thread Bolke de Bruin
Hey Chris, Could you dive into the below a bit more? I don’t like that the LocalTask job is saying the external state is set to queued, although it might just be that the monitoring just does not take into account the queued state which it should (still I am wondering why it happens after the t

Re: Experiences with 1.8.0

2017-01-23 Thread Maxime Beauchemin
Can you rebuild your indexes and recompute the table's stats and see if the optimizer is still off tracks? Assuming InnoDB and from memory: OPTIMIZE TABLE task_instances; ANALYZE TABLE task_instances; Max On Mon, Jan 23, 2017 at 3:45 PM, Arthur Wiedmer wrote: > Maybe we can start with > " .wi

Re: Experiences with 1.8.0

2017-01-23 Thread Arthur Wiedmer
Maybe we can start with " .with_hint(TI, 'USE INDEX (PRIMARY)', dialect_name='mysql')" and see if other databases exhibit the same query plan issue ? Best, Arthur On Mon, Jan 23, 2017 at 3:27 PM, Chris Riccomini wrote: > With this patch: > > $ git diff > diff --git a/airflow/jobs.py b/airflow/

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
With this patch: $ git diff diff --git a/airflow/jobs.py b/airflow/jobs.py index f1de333..9d08e75 100644 --- a/airflow/jobs.py +++ b/airflow/jobs.py @@ -544,6 +544,7 @@ class SchedulerJob(BaseJob): .query( TI.task_id, func.max(TI.execution_date).label

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint, query takes .47s. Without hint, 10s. Going to try and patch. On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini wrote: > This inner query takes 10s: > > SELECT task_instance.task_id AS task_id, max(task_instance.execution_date)

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
This inner query takes 10s: SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) AS max_ti FROM task_instance WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id Explain seems OK: +

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
It's this query: SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Can confirm it's a slow query on task_instance table. Still digging. Unfortunately, the query is truncated in my UI right now: SELECT task_instance.task_id AS task_instance_... On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini wrote: > Digging. Might be a bit. > > On Mon, Jan 23, 2017 at 1:32 PM

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Digging. Might be a bit. On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin wrote: > Slow query log? Db load? > > B. > > Verstuurd vanaf mijn iPad > > > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini > het volgende geschreven: > > > > Note: 6.5 million TIs in the task_instance table. > > > > On Mo

Re: Experiences with 1.8.0

2017-01-23 Thread Bolke de Bruin
Slow query log? Db load? B. Verstuurd vanaf mijn iPad > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini het > volgende geschreven: > > Note: 6.5 million TIs in the task_instance table. > > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini > wrote: > >> Hey Bolke, >> >> Re: system usage, it

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Note: 6.5 million TIs in the task_instance table. On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini wrote: > Hey Bolke, > > Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all free > as well. > > I am thinking that this is DB related, given that it's pausing when > executing an

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Hey Bolke, Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all free as well. I am thinking that this is DB related, given that it's pausing when executing an update. Was looking at the update_state method in models.py, which logs right before the 15s pause. Cheers, Chris On Mon

Re: Experiences with 1.8.0

2017-01-23 Thread Bolke de Bruin
Hi Chris, What is the load on your machine? (CPU/IO/MEM) It seems that the executor is faster in checking the state than the TaskInstance is able to exit itself. No, I don’t consider it normal, but it was sometimes reported (https://github.com/apache/incubator-airflow/pull/1821) though not real

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Oops, sorry again. I misread what you said (did I mention lack of sleep?) The parsing of the db.py file takes 15 MINUTES (900+ seconds), as shown in the `last runtime`. The log lines WITHIN the db.py.log file (from the scheduler) show 15 second gaps. On Mon, Jan 23, 2017 at 12:35 PM, Chris Riccomi

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Oops, yes, 15 seconds, sorry. Operating without much sleep. :P On Mon, Jan 23, 2017 at 12:35 PM, Arthur Wiedmer wrote: > Chris, > > Just double checking, you mean more than 15 seconds not 15 minutes, right? > > Best, > Arthur > > On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini > wrote: > > >

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Also, seeing this in EVERY task that runs: [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this instance has been externally set to queued. Taking the poison pill. So long. [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with return code 0 All successful tasks are showing

Re: Experiences with 1.8.0

2017-01-23 Thread Arthur Wiedmer
Chris, Just double checking, you mean more than 15 seconds not 15 minutes, right? Best, Arthur On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini wrote: > Hey all, > > I've upgraded on production. Things seem to be working so far (only been an > hour), but I am seeing this in the scheduler logs

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Hey all, I've upgraded on production. Things seem to be working so far (only been an hour), but I am seeing this in the scheduler logs: File Path PID RuntimeLast RuntimeLast Run --

Re: Experiences with 1.8.0

2017-01-22 Thread Bolke de Bruin
I created: - AIRFLOW-791: At start up all running dag_runs are being checked, but not fixed - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t get fixed - AIRFLOW-788: Context unexpectedly added to hive conf - AIRFLOW-792: Allow fixing of schedule when wrong start_date / interval was

Re: Experiences with 1.8.0

2017-01-20 Thread Alex Van Boxel
Bolke, I will tackle logging because I started integrating with Google StackDriver logging. I have a separate branch, but this will be for after 1.8. It will be configurable and context aware and everyone will benefit (not only stackdriver logging). On Fri, Jan 20, 2017 at 11:55 PM Bolke de Bruin

Re: Experiences with 1.8.0

2017-01-20 Thread Bolke de Bruin
Will do. And thanks. Adding another issue: * Some of our DAGs are not getting scheduled for some unknown reason. Need to investigate why. Related but not root cause: * Logging is so chatty that it gets really hard to find the real issue Bolke. > On 20 Jan 2017, at 23:45, Dan Davydov wrote: >

Re: Experiences with 1.8.0

2017-01-20 Thread Dan Davydov
I'd be happy to lend a hand fixing these issues and hopefully some others are too. Do you mind creating jiras for these since you have the full context? I have created a JIRA for (1) and have assigned it to myself: https://issues.apache.org/jira/browse/AIRFLOW-780 On Fri, Jan 20, 2017 at 1:01 AM,

Experiences with 1.8.0 (updated)

2017-01-20 Thread Bolke de Bruin
— continued accidentally pressed send — This is to report back on some of the (early) experiences we have with Airflow 1.8.0 (beta 1 at the moment): 1. The UI does not show faulty DAG, leading to confusion for developers. When a faulty dag is placed in the dags folder the UI would report a pars

Experiences with 1.8.0

2017-01-20 Thread Bolke de Bruin
This is to report back on some of the (early) experiences we have with Airflow 1.8.0 (beta 1 at the moment): 1. The UI does not show faulty DAG, leading to confusion for developers. When a faulty dag is placed in the dags folder the UI would report a parsing error. Now it doesn’t due to the sep