(spark) branch master updated: [SPARK-47113][CORE] Revert S3A endpoint fixup logic of SPARK-35878

dongjoon Tue, 20 Feb 2024 23:13:03 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new b3e34629080b [SPARK-47113][CORE] Revert S3A endpoint fixup logic of 
SPARK-35878
b3e34629080b is described below

commit b3e34629080bfbbc0615bb16a961b9298c5d4756
Author: Steve Loughran <ste...@cloudera.com>
AuthorDate: Tue Feb 20 23:12:11 2024 -0800

    [SPARK-47113][CORE] Revert S3A endpoint fixup logic of SPARK-35878
    
    ### What changes were proposed in this pull request?
    
    Revert [SPARK-35878][CORE] Add fs.s3a.endpoint if unset and 
fs.s3a.endpoint.region is null
    
    Removing the region/endpoint patching code of SPARK-35878 avoids 
authentication problems with versions of the S3A connector built with AWS v2 
SDK -as is the case in Hadoop 3.4.0.
    
    That is: if fs.s3a.endpoint is unset it will stay unset.
    
    The v2 SDK does its binding to AWS Services differently, in what can be 
described as "region first" binding. Spark setting the endpoint blocks S3 
Express support and is incompatible with HADOOP-18975 S3A: Add option 
fs.s3a.endpoint.fips to use AWS FIPS endpoints
    
    - https://github.com/apache/hadoop/pull/6277
    
    The change is compatible with all releases of the s3a connector other than 
hadoop 3.3.1 binaries deployed outside EC2 and without the endpoint explicitly 
set.
    
    ### Why are the changes needed?
    
    AWS v2 SDK has a different/complex binding mechanism; it doesn't need the 
endpoint to
    be set if the region (fs.s3a.region) value is set. This means the spark 
code to
    fix an endpoint is not only un-needed, it causes problems when trying to 
use specific
    storage options (S3 Express) or security options (FIPS)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Only visible on hadoop 3.3.1 s3a connector when deployed outside of EC2 
-the situation the original patch was added to work around. All other 3.3.x 
releases are good.
    
    ### How was this patch tested?
    
    Removed some obsolete tests. Relying on github and jenkins to do the 
testing so marking this PR as WiP until they are happy.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #45193 from dongjoon-hyun/SPARK-47113.
    
    Authored-by: Steve Loughran <ste...@cloudera.com>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 .../org/apache/spark/deploy/SparkHadoopUtil.scala  | 10 -------
 .../apache/spark/deploy/SparkHadoopUtilSuite.scala | 33 ----------------------
 2 files changed, 43 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
index 628b688dedba..2edd80db2637 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -529,16 +529,6 @@ private[spark] object SparkHadoopUtil extends Logging {
     if 
(conf.getOption("spark.hadoop.fs.s3a.downgrade.syncable.exceptions").isEmpty) {
       hadoopConf.set("fs.s3a.downgrade.syncable.exceptions", "true", 
setBySpark)
     }
-    // In Hadoop 3.3.1, AWS region handling with the default "" endpoint only 
works
-    // in EC2 deployments or when the AWS CLI is installed.
-    // The workaround is to set the name of the S3 endpoint explicitly,
-    // if not already set. See HADOOP-17771.
-    if (hadoopConf.get("fs.s3a.endpoint", "").isEmpty &&
-      hadoopConf.get("fs.s3a.endpoint.region") == null) {
-      // set to US central endpoint which can also connect to buckets
-      // in other regions at the expense of a HEAD request during fs creation
-      hadoopConf.set("fs.s3a.endpoint", "s3.amazonaws.com", setBySpark)
-    }
   }
 
   private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: 
Configuration): Unit = {
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
index 2326d10d4164..9a81cb947257 100644
--- a/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala
@@ -39,19 +39,6 @@ class SparkHadoopUtilSuite extends SparkFunSuite {
     assertConfigMatches(hadoopConf, "orc.filterPushdown", "true", 
SOURCE_SPARK_HADOOP)
     assertConfigMatches(hadoopConf, "fs.s3a.downgrade.syncable.exceptions", 
"true",
       SET_TO_DEFAULT_VALUES)
-    assertConfigMatches(hadoopConf, "fs.s3a.endpoint", "s3.amazonaws.com", 
SET_TO_DEFAULT_VALUES)
-  }
-
-  /**
-   * An empty S3A endpoint will be overridden just as a null value
-   * would.
-   */
-  test("appendSparkHadoopConfigs with S3A endpoint set to empty string") {
-    val sc = new SparkConf()
-    val hadoopConf = new Configuration(false)
-    sc.set("spark.hadoop.fs.s3a.endpoint", "")
-    new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf)
-    assertConfigMatches(hadoopConf, "fs.s3a.endpoint", "s3.amazonaws.com", 
SET_TO_DEFAULT_VALUES)
   }
 
   /**
@@ -61,28 +48,8 @@ class SparkHadoopUtilSuite extends SparkFunSuite {
     val sc = new SparkConf()
     val hadoopConf = new Configuration(false)
     sc.set("spark.hadoop.fs.s3a.downgrade.syncable.exceptions", "false")
-    sc.set("spark.hadoop.fs.s3a.endpoint", "s3-eu-west-1.amazonaws.com")
     new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf)
     assertConfigValue(hadoopConf, "fs.s3a.downgrade.syncable.exceptions", 
"false")
-    assertConfigValue(hadoopConf, "fs.s3a.endpoint",
-      "s3-eu-west-1.amazonaws.com")
-  }
-
-  /**
-   * If the endpoint region is set (even to a blank string) in
-   * "spark.hadoop.fs.s3a.endpoint.region" then the endpoint is not set,
-   * even when the s3a endpoint is "".
-   * This supports a feature in hadoop 3.3.1 where this configuration
-   * pair triggers a revert to the "SDK to work out the region" algorithm,
-   * which works on EC2 deployments.
-   */
-  test("appendSparkHadoopConfigs with S3A endpoint region set to an empty 
string") {
-    val sc = new SparkConf()
-    val hadoopConf = new Configuration(false)
-    sc.set("spark.hadoop.fs.s3a.endpoint.region", "")
-    new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf)
-    // the endpoint value will not have been set
-    assertConfigValue(hadoopConf, "fs.s3a.endpoint", null)
   }
 
   /**


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47113][CORE] Revert S3A endpoint fixup logic of SPARK-35878

Reply via email to