[ https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Serhii Nesterov updated HADOOP-19620: ------------------------------------- Component/s: (was: auth) > AzureADAuthenticator should be able to retry on UnknownHostException > -------------------------------------------------------------------- > > Key: HADOOP-19620 > URL: https://issues.apache.org/jira/browse/HADOOP-19620 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 3.4.1 > Reporter: Serhii Nesterov > Priority: Minor > > When Hadoop is requested to perform operations against ADLS Gen2 storage, > AbfsRestOperation attempts to obtain an access token from Microsoft. > Underneath the hood, it uses a simple java.net.HttpURLConnection HTTP client. > Occasionally, enviroments may run into network intermittent issues, including > DNS-related UnknownHostException. Technically, the HTTP client throws > IOException whose cause is UnknownHostException. AzureADAuthenticator in turn > catches IOException, sets httperror = -1 and then checks whether the error is > recoverable and can be retried. However, it's neither an instance of > MalformedURLException, nor an instance of FileNotFoundException, nor a > recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), > hence a retry never occurs which is sensitive for our project causing > problems with state recovery. > The final exception stack trace on the client side looks as follows (Apache > Spark application, tenant ID is redacted): > {code:java} > Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most > recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor > 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error > -1; url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token' > AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: > login.microsoftonline.com > at org.apache.hadoop.fs.azurebfs.services. > AbfsRestOperation.executeHttpOperation AbfsRestOperation.java:321 > at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute > AbfsRestOperation.java:263 > at org.apache.hadoop.fs.azurebfs.services. > AbfsRestOperation.lambda$exe_cute$0 AbfsRestOperation.java:235 > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation > IOStatisticsBinding.java:494 > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation > IOStatisticsBinding.java:465 > at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs > RestOperation.java:233 > at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus > AbfsClient.java:1099 > at > org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus > AzureBlobFileSystemStore.java:1164 > at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus > AzureBlobFileSystem.java:766 > at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus > AzureBlobFileSystem.java:756 > at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath > HadoopInputFile.java:39 > at org.apache.spark.sql.execution.datasources. parquet. > ParquetFooterReader.readFooter ParquetFooterReader.java:39 > at org.apache.spark.sql.execution.datasources.parquet. > ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211 > at org.apache.spark.sql.execution.datasources.parquet. > ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210 > at org.apache.spark.sql.execution.datasources.parquet. > ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2 > ParquetFileFormat.scala:213 > ...{code} > I can see this exception is recovered in other parts of the Hadoop project > (e.g., DefaultAMSProcessor) > We would like to have similar retry mechanisms for fetching tokens. Moreover, > AbfsRestOperation already handles and retries UnknownHostException but that > part seems to be applicable only to storage communication, not token > retrieval. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org