[ https://issues.apache.org/jira/browse/AIRFLOW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523708#comment-16523708 ]
James Meickle commented on AIRFLOW-1463: ---------------------------------------- We ran into this in production last night. Our work instance ran out of memory; we suspect that it pulled messages from Celery, but then could not fork new worker processes. This resulted in a state where the task didn't exist in Celery, but the Scheduler thought it did. I would have expected this check to result in the `SCHEDULED`-but-missing-from-Celery tasks eventually getting reset: [https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/jobs.py#L213] But it looks like this only runs on scheduler startup, and not periodically? > Scheduler does not reschedule tasks in QUEUED state > --------------------------------------------------- > > Key: AIRFLOW-1463 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1463 > Project: Apache Airflow > Issue Type: Improvement > Components: cli > Environment: Ubuntu 14.04 > Airflow 1.8.0 > SQS backed task queue, AWS RDS backed meta storage > DAG folder is synced by script on code push: archive is downloaded from s3, > unpacked, moved, install script is run. airflow executable is replaced with > symlink pointing to the latest version of code, no airflow processes are > restarted. > Reporter: Stanislav Pak > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Our pipelines related code is deployed almost simultaneously on all airflow > boxes: scheduler+webserver box, workers boxes. Some common python package is > deployed on those boxes on every other code push (3-5 deployments per hour). > Due to installation specifics, a DAG that imports module from that package > might fail. If DAG import fails when worker runs a task, the task is still > removed from the queue but task state is not changed, so in this case the > task stays in QUEUED state forever. > Beside the described case, there is scenario when it happens because of DAG > update lag in scheduler. A task can be scheduled with old DAG and worker can > run the task with new DAG that fails to be imported. > There might be other scenarios when it happens. > Proposal: > Catch errors when importing DAG on task run and clear task instance state if > import fails. This should fix transient issues of this kind. -- This message was sent by Atlassian JIRA (v7.6.3#76005)