[ 
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-19620:
------------------------------------
    Summary: [ABFS] AzureADAuthenticator should be able to retry on 
UnknownHostException  (was: AzureADAuthenticator should be able to retry on 
UnknownHostException)

> [ABFS] AzureADAuthenticator should be able to retry on UnknownHostException
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-19620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19620
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>    Affects Versions: 3.4.1
>            Reporter: Serhii Nesterov
>            Priority: Minor
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage, 
> *AbfsRestOperation* attempts to obtain an access token from Microsoft. 
> Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP 
> client.
> Occasionally, environments may run into network intermittent issues, 
> including DNS-related {*}UnknownHostException{*}. Technically, the HTTP 
> client throws *IOException* whose cause is {*}UnknownHostException{*}. 
> *AzureADAuthenticator* in its turn catches {*}IOException{*}, sets *httperror 
> = -1* and then checks whether the error is recoverable and can be retried. 
> However, it's neither an instance of {*}MalformedURLException{*}, nor an 
> instance of {*}FileNotFoundException{*}, nor a recoverable status code ({*}< 
> 100 || == 408 || >= 500 && != 501 && != 505{*}), hence a retry never occurs 
> which is sensitive for our project causing problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache 
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error 
> -1; url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
> login.microsoftonline.com
> at org.apache.hadoop.fs.azurebfs.services. 
> AbfsRestOperation.executeHttpOperation AbfsRestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute 
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services. 
> AbfsRestOperation.lambda$exe_cute$0 AbfsRestOperation.java:235
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
>  IOStatisticsBinding.java:494
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
>  IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs 
> RestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus 
> AbfsClient.java:1099
> at
> org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus 
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
> HadoopInputFile.java:39
> at org.apache.spark.sql.execution.datasources. parquet. 
> ParquetFooterReader.readFooter ParquetFooterReader.java:39
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project 
> (e.g., {*}DefaultAMSProcessor{*})
> We would like to have similar retry mechanisms for fetching tokens. Moreover, 
> *AbfsRestOperation* already handles and retries *UnknownHostException* but 
> that part seems to be applicable only to storage communication, not token 
> retrieval. I suppose the solution would be simple - just match the cause's 
> class name of *IOException* if it is an instance of ** 
> {*}UnknownHostException{*}, ** and apply retry policies as for other types of 
> recoverable errors.
> The link to the code where I believe UnknownHostException would be checked 
> for:
> https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to