[jira] [Commented] (HADOOP-17377) ABFS: MsiTokenProvider doesn't retry HTTP 429 from the Instance Metadata Service

ASF GitHub Bot (Jira) Tue, 14 Nov 2023 21:38:06 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786175#comment-17786175
 ]


ASF GitHub Bot commented on HADOOP-17377:
-----------------------------------------

anmolanmol1234 commented on PR #5273:
URL: https://github.com/apache/hadoop/pull/5273#issuecomment-1811839555

   > I think this this PR is great, however there's still one related open 
problem: the default values (2) for 
`fs.azure.oauth.token.fetch.retry.delta.backoff` is incorrect. The value of 2 
is consistent with MS recommendation 
(https://docs.microsoft.com/en-us/azure/active-directory/managed-service-identity/how-to-use-vm-token#retry-guidance),
 but it is assumed in **seconds**, but as this is used in Thread.sleep 
[here](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L326),
 it will be measured in **milliseconds**. I think we should change the default 
to 2000. @steveloughran @anmolanmol1234 do you think we can implement this 
minimal change in this PR, or we should open a separate one?
   
   Will update this change as an iteration of this PR, but will some time for 
the mockito upgrade PR.




> ABFS: MsiTokenProvider doesn't retry HTTP 429 from the Instance Metadata 
> Service
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-17377
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17377
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/azure
>    Affects Versions: 3.2.1
>            Reporter: Brandon
>            Priority: Major
>              Labels: pull-request-available
>
> *Summary*
>  The instance metadata service has its own guidance for error handling and 
> retry which are different from the Blob store. 
> [https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#error-handling]
> In particular, it responds with HTTP 429 if request rate is too high. Whereas 
> Blob store will respond with HTTP 503. The retry policy used only accounts 
> for the latter as it will retry any status >=500. This can result in job 
> instability when running multiple processes on the same host.
> *Environment*
>  * Spark talking to an ABFS store
>  * Hadoop 3.2.1
>  * Running on an Azure VM with user-assigned identity, ABFS configured to use 
> MsiTokenProvider
>  * 6 executor processes on each VM
> *Example*
>  Here's an example error message and stack trace. It's always the same stack 
> trace. This appears in logs a few hundred to low thousands of times a day. 
> It's luckily skating by since the download operation is wrapped in 3 retries.
> {noformat}
> AADToken: HTTP connection failed for getting token from AzureAD. Http 
> response: 429 null
> Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:  
> Proxies: none
> First 1K of Body: {"error":"invalid_request","error_description":"Temporarily 
> throttled, too many requests"}
>       at 
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
>       at 
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
>       at 
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
>       at 
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
>       at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
>       at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
>       at 
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
>       at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
>       at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
>       at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
>       at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
>       at 
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
>       at 
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
>       at 
> scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
>       at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>       at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>       at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>       at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>       at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
>       at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
>       at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748){noformat}
>  CC [~mackrorysd], [~ste...@apache.org]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-17377) ABFS: MsiTokenProvider doesn't retry HTTP 429 from the Instance Metadata Service

Reply via email to