[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949736#comment-16949736 ]
ASF subversion and git services commented on AIRFLOW-5218: ---------------------------------------------------------- Commit a198969b5e3acaee67479ebab145d29866607453 in airflow's branch refs/heads/v1-10-stable from Darren Weber [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=a198969 ] [AIRFLOW-5218] Less polling of AWS Batch job status (#5825) https://issues.apache.org/jira/browse/AIRFLOW-5218 - avoid the AWS API throttle limits for highly concurrent tasks - a small increase in the backoff factor could avoid excessive polling - random sleep before polling to allow the batch task to spin-up - the random sleep helps to avoid API throttling - revise the retry logic slightly to avoid unnecessary pause when there are no more retries required (cherry picked from commit fc972fb6c82010f9809a437eb6b9772918a584d2) > AWS Batch Operator - status polling too often, esp. for high concurrency > ------------------------------------------------------------------------ > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib > Affects Versions: 1.10.4 > Reporter: Darren Weber > Assignee: Darren Weber > Priority: Major > Fix For: 2.0.0, 1.10.6 > > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1600000000000001, > 1.25, > 1.36, > 1.4900000000000002, > 1.6400000000000001, > 1.81, > 2.0, > 2.21, > 2.4400000000000004, > 2.6900000000000004, > 2.9600000000000004, > 3.25, > 3.5600000000000005, > 3.8900000000000006, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? > > Another option is to use something like the sensor-poke approach, with > rescheduling, e.g. > - > [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117] > -- This message was sent by Atlassian Jira (v8.3.4#803005)