ROOBALJINDAL opened a new issue, #67272:
URL: https://github.com/apache/airflow/issues/67272
### Under which category would you file this issue?
Providers
### Apache Airflow version
3.0.6
### What happened and how to reproduce it?
We upgraded aws mwaa airflow from 2.7.2 to 3.0.6 and we noticed 1 random
issue. While submitting jobs to emr serverless from our dags i.e. via
EmrServerlessStartJobOperator, we see jobs are submitted fine to emr serverless
and are finished in emr but task status is marked as failure in airflow dag's
task. Out of 100 tasks, 98-99 proceed fine but we see random failures for 1 or
2 tasks. We saw a pattern, it fails in 20-21seconds. Its completely random, not
for particular task.
Something is wrong with new version of airflow or might be some
configuration is missing from our end
Requirements.txt for airflow of both versions
**Airflow 3.0.6**
```
--constraint "/usr/local/airflow/dags/constraints-3.11_spark_trino.txt"
apache-airflow-providers-apache-spark==5.3.2
apache-airflow-providers-amazon==9.12.0
apache-airflow-providers-ssh==4.1.3
types-paramiko==3.5.0.20250801
sshtunnel==0.4.0
requests==2.32.5
orjson==3.11.2
cachetools==5.5.2
Authlib==1.6.2
apache-airflow-providers-apache-livy==4.4.2
apache-airflow-providers-http==5.3.3
confluent-kafka==2.11.1
apache-airflow-providers-apache-kafka==1.10.2
fastavro==1.12.0
```
**Airflow 2.7.2**
```
--constraint "/usr/local/airflow/dags/constraints-3.7_spark_trino.txt"
apache-airflow-providers-apache-spark==3.0.0
apache-airflow-providers-amazon==6.0.0
apache-airflow-providers-ssh==3.2.0
types-paramiko==2.11.6
sshtunnel==0.4.0
requests==2.28.1
apache-airflow-providers-apache-livy==3.1.0
apache-airflow-providers-http==4.0.0
```
Following are the logs of the task which fails randomly
```
Reading remote log from Cloudwatch log_group:
arn:aws:logs:xxxxx:log-group:airflow-abc-MwaaEnvironment-Task log_stream:
dag_id=xxx/run_id=manual__2026-05-19T10_35_27.159729+00_00/task_id=mytaskid/attempt=1.log
An error occurred (ResourceNotFoundException) when calling the GetLogEvents
operation: The specified log stream does not exist.
```
Ideally this error log should be printed for other tasks as well but I dont
think its failing due to missing log stream in the cloud-watch. It even didnt
print that job was submitted to EMR successfully as other tasks are doing.
I logged similar issue, airflow team fixed waiter error codes for throttling
and asked to log a separate issue since its a task management issue.
Reference to the original issue:
https://github.com/apache/airflow/issues/67178
I still faced the same issue with the fix airflow team provided in
https://github.com/apache/airflow/issues/67178 . Additional to the same issue I
mentioned above, there is another task where we see some task logs, sharing
task logs where it worked vs where it failed for the same table after the fix
was applied. For the failed one, job was submitted and succeeded fine in EMR.
**Passed:**
```
Reading remote log from Cloudwatch log_group:
arn:aws:logs:us-west-2:xxx:log-group:airflow-abc-MwaaEnvironment-Task
log_stream:
dag_id=mynamespace_xxxxx/run_id=manual__2026-05-20T06_53_06.800846+00_00/task_id=KP.mynamespace_csv_ingest_mytable/attempt=1.log
[2026-05-20, 12:35:55] WARNING -
/usr/local/airflow/.local/lib/python3.12/site-packages/flask_sqlalchemy/model.py:121:
SAWarning: This declarative base already contains a class with the same class
name and module name as iam.MWAASession, and will be replaced in the
string-lookup table. super(BindMetaMixin, cls).__init__(name, bases, d):
source="py.warnings"
[2026-05-20, 12:35:55] INFO - DAG bundles loaded: dags-folder:
source="airflow.dag_processing.bundles.manager.DagBundlesManager"
[2026-05-20, 12:35:55] INFO - Filling up the DagBag from
/usr/local/airflow/dags/mynamespace_ns/csv_load_dags/xxxxx.py:
source="airflow.models.dagbag.DagBag"
[2026-05-20, 12:35:55] WARNING -
/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/models/connection.py:471:
DeprecationWarning: Using Connection.get_connection_from_secrets from
`airflow.models` is deprecated.Please use `get` on Connection from
sdk(`airflow.sdk.Connection`) instead warnings.warn(: source="py.warnings"
[2026-05-20, 12:35:56] INFO - Connection Retrieved 'aws_default':
source="airflow.hooks.base"
[2026-05-20, 12:35:56] INFO - Starting job on Application: myappid:
source="airflow.task.operators.edfx_emr_serverless_operator.EdfxEmrServerlessStartJobOperator"
[2026-05-20, 12:35:56] INFO - EMR serverless job started: 00g5ql0rdccnpg0n:
source="airflow.task.operators.edfx_emr_serverless_operator.EdfxEmrServerlessStartJobOperator"
[2026-05-20, 12:35:56] INFO - Serverless Job status is: SUBMITTED -
SUBMITTED: source="waiter_with_logging"
[2026-05-20, 12:36:56] INFO - Serverless Job status is: RUNNING:
source="waiter_with_logging"
[2026-05-20, 12:37:56] INFO - Pushing xcom:
ti="RuntimeTaskInstance(id=UUID('019cc78ed-941e-7f0f656105c6'),
task_id='KP.mynamespace_csv_ingest_mytable', dag_id='mynamespace_xxxxx',
run_id='manual__2026-05-20T06:53:06.800846+00:00', try_number=1, map_index=-1,
hostname='ip-10-151-47-166.us-west-2.compute.internal', context_carrier={},
task=<Task(EdfxEmrServerlessStartJobOperator):
KP.mynamespace_csv_ingest_mytable>,
bundle_instance=LocalDagBundle(name=dags-folder), max_tries=0,
start_date=datetime.datetime(2026, 5, 20, 7, 5, 55, 318443,
tzinfo=datetime.timezone.utc), end_date=None, state=<TaskInstanceState.RUNNING:
'running'>, is_mapped=False, rendered_map_index=None,
log_url='https://a5cca3ac-1398-448f-a42f-1e87b05867a4-vpce.c29.airflow.us-west-2.on.awsdags/mynamespace_xxxxx/runs/manual__2026-05-20T06%3A53%3A06.800846%2B00%3A00/tasks/KP.mynamespace_csv_ingest_mytable?try_number=1%27)%22:
source="task"
[2026-05-20, 12:37:56] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_s3_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_mytable":
run_id="manual__2026-05-20T06:53:06.800846+00:00": map_index=-1: source="task"
[2026-05-20, 12:37:56] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_cloudwatch_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_mytable":
run_id="manual__2026-05-20T06:53:06.800846+00:00": map_index=-1: source="task"
[2026-05-20, 12:37:56] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_dashboard": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_mytable":
run_id="manual__2026-05-20T06:53:06.800846+00:00": map_index=-1: source="task"
[2026-05-20, 12:37:56] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_mytable":
run_id="manual__2026-05-20T06:53:06.800846+00:00": map_index=-1: source="task"
```
**Failed:**
```
[2026-05-20, 16:01:25] INFO - Starting job on Application: myappid:
source="airflow.task.operators.edfx_emr_serverless_operator.EdfxEmrServerlessStartJobOperator"
[2026-05-20, 16:01:25] INFO - EMR serverless job started: jobid:
source="airflow.task.operators.edfx_emr_serverless_operator.EdfxEmrServerlessStartJobOperator"
[2026-05-20, 16:01:25] INFO - Using backported waiter_with_logging.wait
(module=waiter_with_logging,
file=/usr/local/airflow/dags/mynamespace_ns/_commonutil/waiter_with_logging.py,
max_attempts=480, delay=60s, args={'applicationId': 'myappid', 'jobRunId':
'jobid'}): source="waiter_with_logging"
[2026-05-20, 16:01:25] INFO - Serverless Job status is [attempt 1/480]:
SUBMITTED - SUBMITTED: source="waiter_with_logging"
[2026-05-20, 16:01:41] ERROR - Server indicated the task shouldn't be
running anymore. Terminating process:
detail={"detail":{"reason":"not_running","message":"TI is no longer in the
running state and task should terminate","current_state":"failed"}}:
source="task"
[2026-05-20, 16:01:41] INFO - Stopping job run with jobId - jobid:
source="airflow.task.operators.edfx_emr_serverless_operator.EdfxEmrServerlessStartJobOperator"
[2026-05-20, 16:01:41] ERROR - Task failed with exception:
source="task"ClientError: An error occurred (AccessDeniedException) when
calling the CancelJobRun operation: User:
arn:aws:sts::accid:assumed-role/abc-MwaaEnvRole/AmazonMWAA-iamrole is not
authorized to perform: emr-serverless:CancelJobRun on resource:
arn:aws:emr-serverless:us-west-2:accid:/applications/myappid/jobruns/jobid
because no identity-based policy allows the emr-serverless:CancelJobRun action
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 920 in run
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 1215 in _execute_task
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py",
line 397 in wrapper
File
"/usr/local/airflow/dags/mynamespace_ns/_commonutil/edfx_emr_serverless_operator.py",
line 101 in execute
File
"/usr/local/airflow/dags/mynamespace_ns/_commonutil/waiter_with_logging.py",
line 101 in wait
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 891 in _on_term
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/providers/amazon/aws/operators/emr.py",
line 1294 in on_kill
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/botocore/client.py",
line 601 in _api_call
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/botocore/context.py",
line 123 in wrapper
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/botocore/client.py",
line 1074 in _make_api_call
[2026-05-20, 16:01:41] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_s3_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_endispositionreason":
run_id="manual__2026-05-20T09:55:43.497627+00:00": map_index=-1: source="task"
[2026-05-20, 16:01:41] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_cloudwatch_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_endispositionreason":
run_id="manual__2026-05-20T09:55:43.497627+00:00": map_index=-1: source="task"
[2026-05-20, 16:01:41] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_dashboard": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_endispositionreason":
run_id="manual__2026-05-20T09:55:43.497627+00:00": map_index=-1: source="task"
[2026-05-20, 16:01:41] WARNING - No XCom value found; defaulting to None.:
key="emr_serverless_logs": dag_id="mynamespace_xxxxx":
task_id="KP.mynamespace_csv_ingest_endispositionreason":
run_id="manual__2026-05-20T09:55:43.497627+00:00": map_index=-1: source="task"
[2026-05-20, 16:01:41] ERROR - Top level error: source="task"UndefinedError:
'airflow.sdk.execution_time.task_runner.RuntimeTaskInstance object' has no
attribute 'mark_success_url'
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 1353 in main
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 1330 in finalize
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 1161 in _send_task_error_email
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 411 in _get_email_subject_content
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 408 in render
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/utils/helpers.py",
line 244 in render_template_to_string
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/airflow/utils/helpers.py",
line 239 in render_template
File "<template>", line 26 in root
File
"/usr/local/airflow/.local/lib/python3.12/site-packages/jinja2/runtime.py",
line 859 in _fail_with_undefined_error
[2026-05-20, 16:01:41] WARNING - Process exited abnormally: exit_code=1:
source="task"
[2026-05-20, 16:01:41] ERROR - Task killed!: source="task"
```
### What you think should happen instead?
If job was submitted to emr successfully, task should reflect it and should
proceed fine without any failure.
### Operating System
_No response_
### Deployment
Amazon (AWS) MWAA
### Apache Airflow Provider(s)
amazon
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.12.0
### Official Helm Chart version
Not Applicable
### Kubernetes Version
_No response_
### Helm Chart configuration
_No response_
### Docker Image customizations
_No response_
### Anything else?
Tried using waiter max attempts=500 and waiter delay =60s but nothing helped
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]