[ 
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007978#comment-18007978
 ] 

Anuj Modi commented on HADOOP-19620:
------------------------------------

Yes, that is true, we have to fetch the token for all the requests, even for 
the requests that are retried due to some server error.
But the retryloops are not completely nested.

For every request, we first try to fetch the token and if all the retries for 
token fetch are exhausted, we simple fail the whole request and that request is 
not retried by AbfsRestOperation.

What you might be observing is the case where the token fetch for the original 
request failed with UnknownHostException for a few reties but eventually was 
able to fetch the token, then the request for which token was fetched failed 
with some retriable error and wnet for the retry. For this retried request we 
will again try to fetch the token. This time also if all the retries of token 
fetch requests fail, we won't further retry the original request and we will 
fail the operation.

I can see how this can be confusing but that's the way network errors are 
handled today.

> [ABFS] AzureADAuthenticator should be able to retry on UnknownHostException
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-19620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19620
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>    Affects Versions: 3.4.1
>            Reporter: Serhii Nesterov
>            Priority: Minor
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage, 
> *AbfsRestOperation* attempts to obtain an access token from Microsoft. 
> Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP 
> client.
> Occasionally, environments may run into network intermittent issues, 
> including DNS-related {*}UnknownHostException{*}. Technically, the HTTP 
> client throws *IOException* whose cause is {*}UnknownHostException{*}. 
> *AzureADAuthenticator* in its turn catches {*}IOException{*}, sets *httperror 
> = -1* and then checks whether the error is recoverable and can be retried. 
> However, it's neither an instance of {*}MalformedURLException{*}, nor an 
> instance of {*}FileNotFoundException{*}, nor a recoverable status code ({*}< 
> 100 || == 408 || >= 500 && != 501 && != 505{*}), hence a retry never occurs 
> which is sensitive for our project causing problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache 
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error 
> -1; url='https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
> login.microsoftonline.com
> at 
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation 
> AbfsRestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute 
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0 
> AbfsRestOperation.java:235
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
>  IOStatisticsBinding.java:494
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
>  IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute 
> AbfsRestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus 
> AbfsClient.java:1099
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus 
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs.Azure BlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
> HadoopInputFile.java:39
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter
>  ParquetFooterReader.java:39
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1
>  ParquetFileFormat.scala:211
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1
>  ParquetFile Format.scala:210
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project 
> (e.g., {*}DefaultAMSProcessor{*})
> We would like to have similar retry mechanisms for fetching tokens. Moreover, 
> *AbfsRestOperation* already handles and retries *UnknownHostException* but 
> that part seems to be applicable only to storage communication, not token 
> retrieval. I suppose the solution would be simple - just match the cause's 
> class name of *IOException* if it is an instance of *UnknownHostException* 
> and apply retry policies as for other types of recoverable errors.
> The link to the code where I believe *UnknownHostException* would be checked 
> for:
> [https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to