[ 
https://issues.apache.org/jira/browse/HADOOP-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serhii Nesterov updated HADOOP-19620:
-------------------------------------
    Description: 
When Hadoop is requested to perform operations against ADLS Gen2 storage, 
*AbfsRestOperation* attempts to obtain an access token from Microsoft. 
Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP client.

Occasionally, environments may run into network intermittent issues, including 
DNS-related {*}UnknownHostException{*}. Technically, the HTTP client throws 
*IOException* whose cause is {*}UnknownHostException{*}. *AzureADAuthenticator* 
in its turn catches {*}IOException{*}, sets *httperror = -1* and then checks 
whether the error is recoverable and can be retried. However, it's neither an 
instance of {*}MalformedURLException{*}, nor an instance of 
{*}FileNotFoundException{*}, nor a recoverable status code ({*}< 100 || == 408 
|| >= 500 && != 501 && != 505{*}), hence a retry never occurs which is 
sensitive for our project causing problems with state recovery.

The final exception stack trace on the client side looks as follows (Apache 
Spark application, tenant ID is redacted):
{code:java}
Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29 
: Status code: -1 error code: null error message: Auth failure: HTTP Error -1; 
url='https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
login.microsoftonline.com
at 
org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation 
AbfsRestOperation.java:321
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute 
AbfsRestOperation.java:263
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0 
AbfsRestOperation.java:235
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
 IOStatisticsBinding.java:494
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
 IOStatisticsBinding.java:465
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute 
AbfsRestOperation.java:233
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus 
AbfsClient.java:1099
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus 
AzureBlobFileSystemStore.java:1164
at org.apache.hadoop.fs.azurebfs.Azure BlobFileSystem.getFileStatus 
AzureBlobFileSystem.java:766
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus 
AzureBlobFileSystem.java:756
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
HadoopInputFile.java:39
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter
 ParquetFooterReader.java:39
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1
 ParquetFileFormat.scala:211
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1
 ParquetFile Format.scala:210
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
ParquetFileFormat.scala:213
...{code}
I can see this exception is recovered in other parts of the Hadoop project 
(e.g., {*}DefaultAMSProcessor{*})

We would like to have similar retry mechanisms for fetching tokens. Moreover, 
*AbfsRestOperation* already handles and retries *UnknownHostException* but that 
part seems to be applicable only to storage communication, not token retrieval. 
I suppose the solution would be simple - just match the cause's class name of 
*IOException* if it is an instance of *UnknownHostException* and apply retry 
policies as for other types of recoverable errors.

The link to the code where I believe *UnknownHostException* would be checked 
for:

[https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354]

  was:
When Hadoop is requested to perform operations against ADLS Gen2 storage, 
*AbfsRestOperation* attempts to obtain an access token from Microsoft. 
Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP client.

Occasionally, environments may run into network intermittent issues, including 
DNS-related {*}UnknownHostException{*}. Technically, the HTTP client throws 
*IOException* whose cause is {*}UnknownHostException{*}. *AzureADAuthenticator* 
in its turn catches {*}IOException{*}, sets *httperror = -1* and then checks 
whether the error is recoverable and can be retried. However, it's neither an 
instance of {*}MalformedURLException{*}, nor an instance of 
{*}FileNotFoundException{*}, nor a recoverable status code ({*}< 100 || == 408 
|| >= 500 && != 501 && != 505{*}), hence a retry never occurs which is 
sensitive for our project causing problems with state recovery.

The final exception stack trace on the client side looks as follows (Apache 
Spark application, tenant ID is redacted):
{code:java}
Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29 
: Status code: -1 error code: null error message: Auth failure: HTTP Error -1; 
url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
login.microsoftonline.com
at org.apache.hadoop.fs.azurebfs.services. 
AbfsRestOperation.executeHttpOperation AbfsRestOperation.java:321
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute 
AbfsRestOperation.java:263
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.lambda$exe_cute$0 
AbfsRestOperation.java:235
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
 IOStatisticsBinding.java:494
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
 IOStatisticsBinding.java:465
at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs 
RestOperation.java:233
at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus 
AbfsClient.java:1099
at
org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus 
AzureBlobFileSystemStore.java:1164
at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus 
AzureBlobFileSystem.java:766
at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus 
AzureBlobFileSystem.java:756
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
HadoopInputFile.java:39
at org.apache.spark.sql.execution.datasources. parquet. 
ParquetFooterReader.readFooter ParquetFooterReader.java:39
at org.apache.spark.sql.execution.datasources.parquet. 
ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211
at org.apache.spark.sql.execution.datasources.parquet. 
ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210
at org.apache.spark.sql.execution.datasources.parquet. 
ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
ParquetFileFormat.scala:213
...{code}
I can see this exception is recovered in other parts of the Hadoop project 
(e.g., {*}DefaultAMSProcessor{*})

We would like to have similar retry mechanisms for fetching tokens. Moreover, 
*AbfsRestOperation* already handles and retries *UnknownHostException* but that 
part seems to be applicable only to storage communication, not token retrieval. 
I suppose the solution would be simple - just match the cause's class name of 
*IOException* if it is an instance of ** {*}UnknownHostException{*}, ** and 
apply retry policies as for other types of recoverable errors.

The link to the code where I believe UnknownHostException would be checked for:

https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354


> [ABFS] AzureADAuthenticator should be able to retry on UnknownHostException
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-19620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19620
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>    Affects Versions: 3.4.1
>            Reporter: Serhii Nesterov
>            Priority: Minor
>
> When Hadoop is requested to perform operations against ADLS Gen2 storage, 
> *AbfsRestOperation* attempts to obtain an access token from Microsoft. 
> Underneath the hood, it uses a simple *java.net.HttpURLConnection* HTTP 
> client.
> Occasionally, environments may run into network intermittent issues, 
> including DNS-related {*}UnknownHostException{*}. Technically, the HTTP 
> client throws *IOException* whose cause is {*}UnknownHostException{*}. 
> *AzureADAuthenticator* in its turn catches {*}IOException{*}, sets *httperror 
> = -1* and then checks whether the error is recoverable and can be retried. 
> However, it's neither an instance of {*}MalformedURLException{*}, nor an 
> instance of {*}FileNotFoundException{*}, nor a recoverable status code ({*}< 
> 100 || == 408 || >= 500 && != 501 && != 505{*}), hence a retry never occurs 
> which is sensitive for our project causing problems with state recovery.
> The final exception stack trace on the client side looks as follows (Apache 
> Spark application, tenant ID is redacted):
> {code:java}
> Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most 
> recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 
> 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error 
> -1; url='https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token' 
> AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: 
> login.microsoftonline.com
> at 
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation 
> AbfsRestOperation.java:321
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute 
> AbfsRestOperation.java:263
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0 
> AbfsRestOperation.java:235
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation
>  IOStatisticsBinding.java:494
> at 
> org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation
>  IOStatisticsBinding.java:465
> at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute 
> AbfsRestOperation.java:233
> at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus 
> AbfsClient.java:1099
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus 
> AzureBlobFileSystemStore.java:1164
> at org.apache.hadoop.fs.azurebfs.Azure BlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:766
> at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus 
> AzureBlobFileSystem.java:756
> at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath 
> HadoopInputFile.java:39
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter
>  ParquetFooterReader.java:39
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1
>  ParquetFileFormat.scala:211
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1
>  ParquetFile Format.scala:210
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2
> ParquetFileFormat.scala:213
> ...{code}
> I can see this exception is recovered in other parts of the Hadoop project 
> (e.g., {*}DefaultAMSProcessor{*})
> We would like to have similar retry mechanisms for fetching tokens. Moreover, 
> *AbfsRestOperation* already handles and retries *UnknownHostException* but 
> that part seems to be applicable only to storage communication, not token 
> retrieval. I suppose the solution would be simple - just match the cause's 
> class name of *IOException* if it is an instance of *UnknownHostException* 
> and apply retry policies as for other types of recoverable errors.
> The link to the code where I believe *UnknownHostException* would be checked 
> for:
> [https://github.com/apache/hadoop/blob/61096793f6368d16a21cde8b1c8f8dce41a4c102/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/oauth2/AzureADAuthenticator.java#L354]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to