Nishieee commented on issue #67178: URL: https://github.com/apache/airflow/issues/67178#issuecomment-4500117201
> I looked into this bug and found out why the EMR tasks are failing randomly the problem is that our AWS waiter fails immediately when it gets a throttling error from the API to fix this I will make the waiter ignore these temporary API throttling errors and keep retrying It will still fail right away for real errors like wrong permissions so that is safe. I am working on this fix right now and will open a pull request soon to solve it. i think the fix is heading the right direction but one thing worth thinking about - waiter_max_attempts still decrements on every iteration of the loop, including the throttle ones. so on a long running emr job with sustained throttling, in theory you could exhaust max_attempts not because the job is actually stuck but because too many polls got throttled. probably rare in practice given typical max_attempts values, but it's the same failure mode the original issue is reporting just shifted later in time, so might be worth tracking throttle-retries separately or at least logging when retries are eating into the attempts budget. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
