[
https://issues.apache.org/jira/browse/AIRFLOW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518211#comment-16518211
]
Ksenia Stroykova commented on AIRFLOW-972:
--
This is a joblib issue.
In 11 version I can reproduce it with code:
{code}
from joblib import Parallel, delayed
import multiprocessing
from multiprocessing import Pool
import subprocess
import signal
def f(i):
return multiprocessing.current_process().pid
def f3():
print('f3')
def signal_handler(signum, frame):
raise ValueError("received SIGTERM signal")
signal.signal(signal.SIGTERM, signal_handler)
for r in Parallel(n_jobs=5, verbose=1000)(delayed(f)(i) for i in
range(20)):
print(r)
{code}
And when I call f3() sometimes(!!!) I see an exception:
{code}
In [3]: f3()
f3
[Parallel(n_jobs=5)]: Done 1 tasks | elapsed:0.0s
[Parallel(n_jobs=5)]: Batch computation too fast (0.0011s.) Setting
batch_size=374.
[Parallel(n_jobs=5)]: Done 2 tasks | elapsed:0.0s
[Parallel(n_jobs=5)]: Done 3 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 4 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 5 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 6 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 7 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 8 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 18 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 20 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 20 out of 20 | elapsed:0.0s finished
5260
5261
5263
5264
5262
5260
5261
5262
5261
5263
5260
5260
5260
5260
5260
5260
5260
5260
5260
5260
In [4]: f3()
f3
[Parallel(n_jobs=5)]: Done 1 tasks | elapsed:0.0s
[Parallel(n_jobs=5)]: Batch computation too fast (0.0013s.) Setting
batch_size=304.
[Parallel(n_jobs=5)]: Done 2 tasks | elapsed:0.0s
[Parallel(n_jobs=5)]: Done 3 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 4 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 5 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 6 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 7 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 8 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 9 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 20 out of 20 | elapsed:0.0s remaining:0.0s
[Parallel(n_jobs=5)]: Done 20 out of 20 | elapsed:0.0s finished
Process ForkPoolWorker-15:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/opt/conda/lib/python3.6/site-packages/joblib/pool.py", line 364, in get
rrelease()
File "", line 13, in signal_handler
raise ValueError("received SIGTERM signal")
ValueError: received SIGTERM signal
{code}
Airflow catches it and kills the task. If task is fast it ends successfully.
If task is slow airflow tries to kill it and in the end task dies.
I also checked master version of joblib and an error seems to be fixed. Will
check it with airflow.
> Airflow kills subprocesses created by task instances
>
>
> Key: AIRFLOW-972
> URL: https://issues.apache.org/jira/browse/AIRFLOW-972
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
>Affects Versions: Airflow 1.7.1
>Reporter: Richard Moorhead
>Priority: Minor
>
> We have a task which creates multiple subprocesses via
> [joblib|https://pythonhosted.org/joblib/parallel.html]; we're noticing that
> airflow seems to kill the subprocesses prior to their completion. Is there
> any way around this behavior?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)