[jira] [Created] (AIRFLOW-366) SchedulerJob gets locked up when when child processes attempt to log to single file

2016-07-26 Thread Greg Neiheisel (JIRA)
Greg Neiheisel created AIRFLOW-366:
--

 Summary: SchedulerJob gets locked up when when child processes 
attempt to log to single file
 Key: AIRFLOW-366
 URL: https://issues.apache.org/jira/browse/AIRFLOW-366
 Project: Apache Airflow
  Issue Type: Bug
  Components: scheduler
Reporter: Greg Neiheisel


After running the scheduler for a while (usually after 1 - 5 hours) it will 
eventually lock up, and nothing will get scheduled.

A `SchedulerJob` will end up getting stuck in the `while` loop around line 730 
of `airflow/jobs.py`.

>From what I can tell this is related to logging from within a forked process 
>using pythons multiprocessing module.

The job will fork off some child processes to process the DAGs but one (or 
more) will end up getting suck and not terminating, resulting in the while loop 
getting hung up.  You can `kill -9 PID` the child process manually, and the 
loop will end and the scheduler will go on it's way, until it happens again.

The issue is due to usage of the logging module from within the child 
processes.  From what I can tell, logging to a file from multiple processes is 
not supported by the multiprocessing module, but it is supported using python 
multithreading, using some sort of locking mechanism.

I think a child process will somehow inherit a logger that is locked, right 
when it is forked, resulting it the process completely locking up.

I went in and commented out all the logging statements that could possibly be 
hit by the child process (jobs.py, models.py), and was able to keep the 
scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-366) SchedulerJob gets locked up when when child processes attempt to log to single file

2016-07-27 Thread Greg Neiheisel (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396528#comment-15396528
 ] 

Greg Neiheisel commented on AIRFLOW-366:


Latest - 1.7.1.3

> SchedulerJob gets locked up when when child processes attempt to log to 
> single file
> ---
>
> Key: AIRFLOW-366
> URL: https://issues.apache.org/jira/browse/AIRFLOW-366
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Greg Neiheisel
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will 
> eventually lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line 
> 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process 
> using pythons multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or 
> more) will end up getting suck and not terminating, resulting in the while 
> loop getting hung up.  You can `kill -9 PID` the child process manually, and 
> the loop will end and the scheduler will go on it's way, until it happens 
> again.
> The issue is due to usage of the logging module from within the child 
> processes.  From what I can tell, logging to a file from multiple processes 
> is not supported by the multiprocessing module, but it is supported using 
> python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right 
> when it is forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be 
> hit by the child process (jobs.py, models.py), and was able to keep the 
> scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace

2016-09-20 Thread Greg Neiheisel (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508672#comment-15508672
 ] 

Greg Neiheisel commented on AIRFLOW-401:


Hey guys - just saw this issue and it may be related to an issue I added a 
while back.  What do you think?  
https://issues.apache.org/jira/plugins/servlet/mobile#issue/AIRFLOW-366

> scheduler gets stuck without a trace
> 
>
> Key: AIRFLOW-401
> URL: https://issues.apache.org/jira/browse/AIRFLOW-401
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor, scheduler
>Affects Versions: Airflow 1.7.1.3
>Reporter: Nadeem Ahmed Nazeer
>Assignee: Bolke de Bruin
>Priority: Minor
> Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, 
> scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU 
> usage of scheduler service is at 100%. No jobs get submitted and everything 
> comes to a halt. Looks it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the 
> scheduler service. But again, after running some tasks it gets stuck. I've 
> tried with both Celery and Local executors but same issue occurs. I am using 
> the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-258) Always load plugin executors

2016-09-20 Thread Greg Neiheisel (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508702#comment-15508702
 ] 

Greg Neiheisel commented on AIRFLOW-258:


Hey guys - this seems like the closest thing to an issue I'm seeing.

I have a custom executor and a custom operator as a plugin.  If I specify my 
custom executor in the config, airflow takes a different path than if I don't 
specify using the custom executor.  If I'm configured with a built in executor 
like LocalExecutor, the plugin loads fine.  If I'm configured to use my custom 
executor, then it loads the plugins which also loads the modules my custom 
operator is defined in, which is normal.  I'm subclassing BaseSensorOperator 
and as its importing modules it throws an error about not being able to import 
BaseOperator from airflow.models from airflow/operators/__init__.py.

Long story, any ideas on how to tackle this?  Would be glad to work on a PR 
with a bit of guidance.  Thanks!

> Always load plugin executors
> 
>
> Key: AIRFLOW-258
> URL: https://issues.apache.org/jira/browse/AIRFLOW-258
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: plugins
>Reporter: Alexandr Nikitin
>
> At the moment we load plugin executors only when the specified `EXECUTOR` 
> field in config isn't from existing executors: 'LocalExecutor', 
> 'CeleryExecutor', 'SequentialExecutor', 'MesosExecutor'.
> If the default executor belongs to existing executors we won't load plugins 
> executors at all. This breaks if the plugin executor used in SubDAG.
> We should load plugin executors always.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-366) SchedulerJob gets locked up when when child processes attempt to log to single file

2016-09-22 Thread Greg Neiheisel (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514960#comment-15514960
 ] 

Greg Neiheisel commented on AIRFLOW-366:


Using python 2.7.

> SchedulerJob gets locked up when when child processes attempt to log to 
> single file
> ---
>
> Key: AIRFLOW-366
> URL: https://issues.apache.org/jira/browse/AIRFLOW-366
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Greg Neiheisel
>Assignee: Bolke de Bruin
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will 
> eventually lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line 
> 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process 
> using pythons multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or 
> more) will end up getting suck and not terminating, resulting in the while 
> loop getting hung up.  You can `kill -9 PID` the child process manually, and 
> the loop will end and the scheduler will go on it's way, until it happens 
> again.
> The issue is due to usage of the logging module from within the child 
> processes.  From what I can tell, logging to a file from multiple processes 
> is not supported by the multiprocessing module, but it is supported using 
> python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right 
> when it is forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be 
> hit by the child process (jobs.py, models.py), and was able to keep the 
> scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AIRFLOW-613) Add Astronomer as Airflow user

2016-11-02 Thread Greg Neiheisel (JIRA)
Greg Neiheisel created AIRFLOW-613:
--

 Summary: Add Astronomer as Airflow user
 Key: AIRFLOW-613
 URL: https://issues.apache.org/jira/browse/AIRFLOW-613
 Project: Apache Airflow
  Issue Type: Task
Reporter: Greg Neiheisel
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AIRFLOW-613) Add Astronomer as Airflow user

2016-11-02 Thread Greg Neiheisel (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Neiheisel updated AIRFLOW-613:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/1864

> Add Astronomer as Airflow user
> --
>
> Key: AIRFLOW-613
> URL: https://issues.apache.org/jira/browse/AIRFLOW-613
> Project: Apache Airflow
>  Issue Type: Task
>Reporter: Greg Neiheisel
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AIRFLOW-3177) Change scheduler_heartbeat metric from gauge to counter

2018-10-09 Thread Greg Neiheisel (JIRA)
Greg Neiheisel created AIRFLOW-3177:
---

 Summary: Change scheduler_heartbeat metric from gauge to counter
 Key: AIRFLOW-3177
 URL: https://issues.apache.org/jira/browse/AIRFLOW-3177
 Project: Apache Airflow
  Issue Type: Improvement
  Components: scheduler
Reporter: Greg Neiheisel
Assignee: Greg Neiheisel


Currently, the scheduler_heartbeat metric exposed with the statsd integration 
is a gauge. I'm proposing to change the gauge to a counter for a better 
integration with Prometheus via the 
[statsd_exporter|[https://github.com/prometheus/statsd_exporter].]

Rather than pointing Airflow at an actual statsd server, you can point it at 
this exporter, which will accumulate the metrics and expose them to be scraped 
by Prometheus at /metrics. The problem is that once this value is set when the 
scheduler runs its first loop, it will always be exposed to Prometheus as 1. 
The scheduler can crash, or be turned off and the statsd exporter will report a 
1 until it is restarted and rebuilds its internal state.

By turning this metric into a counter, we can detect an issue with the 
scheduler by graphing and alerting using a rate. If the rate of change of the 
counter drops below what it should be at (determined by the 
scheduler_heartbeat_secs setting), we can fire an alert.

This should be helpful for adoption in Kubernetes environments where Prometheus 
is pretty much the standard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1899) Airflow Kubernetes Executor [basic]

2018-04-10 Thread Greg Neiheisel (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432659#comment-16432659
 ] 

Greg Neiheisel commented on AIRFLOW-1899:
-

Hey [~dimberman], glad to see this issue here. My company is currently using 
the Mesos executor, as well as running the Celery executor on Kubernetes. While 
digging into what it would look like to have a native Kubernetes executor, we 
found this. What's the latest on this issue? Have you made much progress down 
this path yet? Any interest in collaboration on it?

> Airflow Kubernetes Executor [basic]
> ---
>
> Key: AIRFLOW-1899
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1899
> Project: Apache Airflow
>  Issue Type: Sub-task
>  Components: contrib
>Reporter: Daniel Imberman
>Assignee: Daniel Imberman
>Priority: Major
> Fix For: Airflow 2.0
>
>
> The basic Kubernetes Executor PR should launch basic pods using the same pod 
> launcher as the kubernetes operator. This PR should not concern itself with a 
> lot of of the extra features which can be added in future PRs. a successful 
> PR for this issue should be able to launch a pod, watch using the watcher 
> API, and track failures/successes. Should also include basic testing for the 
> executor using [~grantnicholas]'s testing library. cc: [~benjigoldberg] 
> [~bolke]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (AIRFLOW-1899) Airflow Kubernetes Executor [basic]

2018-04-10 Thread Greg Neiheisel (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Neiheisel updated AIRFLOW-1899:

Comment: was deleted

(was: Hey [~dimberman], glad to see this issue here. My company is currently 
using the Mesos executor, as well as running the Celery executor on Kubernetes. 
While digging into what it would look like to have a native Kubernetes 
executor, we found this. What's the latest on this issue? Have you made much 
progress down this path yet? Any interest in collaboration on it?)

> Airflow Kubernetes Executor [basic]
> ---
>
> Key: AIRFLOW-1899
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1899
> Project: Apache Airflow
>  Issue Type: Sub-task
>  Components: contrib
>Reporter: Daniel Imberman
>Assignee: Daniel Imberman
>Priority: Major
> Fix For: Airflow 2.0
>
>
> The basic Kubernetes Executor PR should launch basic pods using the same pod 
> launcher as the kubernetes operator. This PR should not concern itself with a 
> lot of of the extra features which can be added in future PRs. a successful 
> PR for this issue should be able to launch a pod, watch using the watcher 
> API, and track failures/successes. Should also include basic testing for the 
> executor using [~grantnicholas]'s testing library. cc: [~benjigoldberg] 
> [~bolke]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)