[ 
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serhii Nesterov updated HADOOP-19620:
-------------------------------------
    Component/s:     (was: auth)

> AzureADAuthenticator should be able to retry on UnknownHostException
> --------------------------------------------------------------------
>
>                 Key: HADOOP-19620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19620
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 3.4.1
>            Reporter: Serhii Nesterov
>            Priority: Minor
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage, 
> AbfsRestOperation attempts to obtain an access token from Microsoft. 
> Underneath the hood, it uses a simple java.net.HttpURLConnection HTTP client.
> Occasionally, enviroments may run into network intermittent issues, including 
> DNS-related UnknownHostException. Technically, the HTTP client throws 
> IOException whose cause is UnknownHostException. AzureADAuthenticator in turn 
> catches IOException, sets httperror = -1 and then checks whether the error is 
> recoverable and can be retried. However, it's neither an instance of 
> MalformedURLException, nor an instance of FileNotFoundException, nor a 
> recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), 
> hence a retry never occurs which is sensitive for our project causing 
> problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache 
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error 
> -1; url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
> login.microsoftonline.com
> at org.apache.hadoop.fs.azurebfs.services. 
> AbfsRestOperation.executeHttpOperation AbfsRestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute 
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services. 
> AbfsRestOperation.lambda$exe_cute$0 AbfsRestOperation.java:235
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
>  IOStatisticsBinding.java:494
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
>  IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs 
> RestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus 
> AbfsClient.java:1099
> at
> org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus 
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
> HadoopInputFile.java:39
> at org.apache.spark.sql.execution.datasources. parquet. 
> ParquetFooterReader.readFooter ParquetFooterReader.java:39
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
> at org.apache.spark.sql.execution.datasources.parquet. 
> ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project 
> (e.g., DefaultAMSProcessor)
> We would like to have similar retry mechanisms for fetching tokens. Moreover, 
> AbfsRestOperation already handles and retries UnknownHostException but that 
> part seems to be applicable only to storage communication, not token 
> retrieval.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to