tuanit03 opened a new issue, #3640:
URL: https://github.com/apache/polaris/issues/3640

   ### Describe the bug
   
   I am using Polaris to connect to an **internal S3** service. The internal S3 
does **not** provide Credential Vending / STS. I tried to configure Polaris so 
it can connect to the internal S3 without STS, but with the current setup 
Polaris still tries to access the internet and resolves the internal S3 domain 
externally.
   
   Below are the exact configurations I used (Docker Compose, Spark image 
Dockerfile, catalog payload and Spark configuration).
   
   ---
   
   ## Docker Compose (relevant parts)
   
   ```yaml
   services:
     postgres:
       image: postgres:17
       container_name: polaris-postgres
       ports:
         - "5433:5432"
       shm_size: 128mb
       environment:
         POSTGRES_USER: postgres
         POSTGRES_PASSWORD: postgres
         POSTGRES_DB: polaris
         POSTGRES_INITDB_ARGS: "--encoding UTF8 --data-checksums"
       volumes:
         - type: bind
           source: ./pg-config/postgresql.conf
           target: /etc/postgresql/postgresql.conf
       command:
         - "postgres"
         - "-c"
         - "config_file=/etc/postgresql/postgresql.conf"
       healthcheck:
         test: "pg_isready -U postgres"
         interval: 5s
         timeout: 2s
         retries: 15
       networks:
         - polaris_net
   
     polaris-bootstrap:
       image: apache/polaris-admin-tool:latest
       container_name: polaris-bootstrap
       depends_on:
         postgres:
           condition: service_healthy
       environment:
         POLARIS_PERSISTENCE_TYPE: relational-jdbc
         QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
         QUARKUS_DATASOURCE_USERNAME: postgres
         QUARKUS_DATASOURCE_PASSWORD: postgres
       command:
         - "bootstrap"
         - "--realm=polaris"
         - "--credential=polaris,root,s3cr3t"
       networks:
         - polaris_net
   
     polaris:
       image: apache/polaris:latest
       container_name: polaris-container
       ports:
         - "8181:8181"   # API
         - "8182:8182"   # management (metrics/health)
         - "5005:5005"   # optional JVM debugger
       environment:
         # Skip credential subscoping indirection
         polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"
         
polaris.features.realm-overrides."polaris"."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION":
 "true"
   
         # Java debug
         JAVA_DEBUG: true
         JAVA_DEBUG_PORT: "*:5005"
   
         POLARIS_PERSISTENCE_TYPE: relational-jdbc
         POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES: 5
         POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS: 100
         POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DURATION_IN_MS: 5000
         QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
         QUARKUS_DATASOURCE_USERNAME: postgres
         QUARKUS_DATASOURCE_PASSWORD: postgres
         POLARIS_REALM_CONTEXT_REALMS: polaris
         QUARKUS_OTEL_SDK_DISABLED: true
         POLARIS_BOOTSTRAP_CREDENTIALS: polaris,root,s3cr3t
         polaris.features."ALLOW_INSECURE_STORAGE_TYPES": "false"
         polaris.features."SUPPORTED_CATALOG_STORAGE_TYPES": 
"[\"S3\",\"GCS\",\"AZURE\"]"
         polaris.readiness.ignore-severe-issues: "true"
   
         # S3 Configuration
         polaris.features."ALLOW_SETTING_S3_ENDPOINTS": "true"
         polaris.storage.aws.access-key: "abcd"
         polaris.storage.aws.secret-key: "abcd"
         AWS_REGION: "us-east-1"
         AWS_ACCESS_KEY_ID: "abcd"
         AWS_SECRET_ACCESS_KEY: "abcd"
         AWS_ENDPOINT_URL: "http://s3-it-hn.company.net";
   
         # Proxy / no-proxy
         # HTTP_PROXY: http://proxy.company.vn:80
         # HTTPS_PROXY: http://proxy.company.vn:80
         JAVA_OPTS_APPEND: 
'-Dhttp.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"
 
-Dhttps.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"'
         NO_PROXY: 
localhost,127.0.0.1,polaris-postgres,polaris-container,.s3-it-hn.company.net,s3-it-hn.company.net
       depends_on:
         polaris-bootstrap:
           condition: service_completed_successfully
       healthcheck:
         test: ["CMD", "curl", "http://localhost:8182/q/health";]
         interval: 2s
         timeout: 10s
         retries: 10
         start_period: 10s
       networks:
         - polaris_net
   
     polaris-setup:
       image: alpine/curl
       container_name: polaris-setup
       depends_on:
         polaris:
           condition: service_healthy
       environment:
         HTTP_PROXY: http://proxy.company.vn:80
         HTTPS_PROXY: http://proxy.company.vn:80
         NO_PROXY: 
localhost,127.0.0.1,polaris-postgres,polaris-container,.s3-it-hn.company.net,s3-it-hn.company.net
         STORAGE_LOCATION: "s3://it-demo-data/testxxx/"
       volumes:
         - ./setup-polaris:/polaris
       entrypoint: '/bin/sh -c "chmod +x /polaris/create-catalog.sh && 
/polaris/create-catalog.sh"'
       networks:
         - polaris_net
   
     spark-jupyter:
       build: .
       container_name: polaris-spark-jupyter
       depends_on:
         polaris-setup:
           condition: service_completed_successfully
       ports:
         - "8888:8888"
         - "4040:4040"
       volumes:
         - ./notebooks:/home/jovyan/work
       healthcheck:
         test: "curl localhost:8888"
         interval: 5s
         retries: 15
       networks:
         - polaris_net
   
     trino:
       image: trinodb/trino:latest
       container_name: polaris-trino
       depends_on:
         polaris-setup:
           condition: service_completed_successfully
       stdin_open: true
       tty: true
       ports:
         - "8081:8080"
       volumes:
         - ./trino-config/:/etc/trino/catalog
       networks:
         - polaris_net
   
   networks:
     polaris_net:
       external: true
       name: open-metadata_app_net
   ```
   
   ---
   
   ## Dockerfile (Spark image)
   
   ```dockerfile
   FROM apache/spark:3.5.8-java17-python3
   
   USER root
   
   ENV HTTP_PROXY=http://proxy.company.vn:80/
   ENV HTTPS_PROXY=http://proxy.company.vn:80/
   ENV NO_PROXY=localhost,127.0.0.1,polaris-postgres,polaris-container
   
   RUN pip install --no-cache-dir \
       jupyterlab \
       notebook \
       pyspark==3.5.8 \
       && mkdir -p /home/jovyan/work
   
   RUN mkdir -p /opt/spark/jars && \
       wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e 
https_proxy=$HTTPS_PROXY \
       
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.10.0/iceberg-spark-runtime-3.5_2.12-1.10.0.jar
 -O /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.10.0.jar && \
       wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e 
https_proxy=$HTTPS_PROXY \
       
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.0/iceberg-aws-bundle-1.10.0.jar
 -O /opt/spark/jars/iceberg-aws-bundle-1.10.0.jar && \
       wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e 
https_proxy=$HTTPS_PROXY \
       
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
 -O /opt/spark/jars/hadoop-aws-3.3.4.jar && \
       wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e 
https_proxy=$HTTPS_PROXY \
       
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.797/aws-java-sdk-bundle-1.12.797.jar
 -O /opt/spark/jars/aws-java-sdk-bundle-1.12.797.jar
   
   # Clear proxy environment variables so they won't be present in later layers 
/ at runtime
   ENV HTTP_PROXY=""
   ENV HTTPS_PROXY=""
   ENV NO_PROXY=""
   
   ENV JUPYTER_TOKEN=polaris
   ENV JUPYTER_ENABLE_LAB=yes
   RUN useradd -m -s /bin/bash jovyan && chown -R jovyan /home/jovyan
   USER jovyan
   WORKDIR /home/jovyan/work
   
   EXPOSE 8888 4040
   
   CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root 
--NotebookApp.token=$JUPYTER_TOKEN
   ```
   
   ---
   
   ## Catalog creation payload (`create-catalog.sh` / jq payload)
   
   ```bash
   PAYLOAD=$(jq -n \
     --arg name "quickstart_catalog" \
     --arg defaultBase "$STORAGE_LOCATION" \
     '{
       catalog: {
         name: $name,
         type: "INTERNAL",
         properties: {
           "default-base-location": "s3://it-demo-data/testxxx/",
           "s3.endpoint": "http://s3-it-hn.company.net";,
           "s3.path-style-access": "true",
           "s3.access-key-id": "abcd",
           "s3.secret-access-key": "abcd",
           "s3.region": "us-east-1",
           "s3.sts-unavailable": "true"
         },
           storageConfigInfo: {
           storageType: "S3",
           stsUnavailable: true,
           pathStyleAccess: true,
           endpoint: "http://s3-it-hn.company.net";,
           endpointInternal: "http://s3-it-hn.company.net";
         }
       }
     }'
   )
   ```
   
   ---
   
   ## Spark configuration (Python snippet)
   
   ```python
   S3_HOST="s3-it-hn.company.net"
   
   current_python = sys.executable 
   os.environ["PYSPARK_PYTHON"] = current_python
   os.environ["PYSPARK_DRIVER_PYTHON"] = current_python
   
   os.environ["NO_PROXY"] = S3_HOST
   os.environ["no_proxy"] = S3_HOST
   
   # ---- JVM proxy settings (driver + executor) ----
   proxy_host = "http://proxy.company.vn";
   proxy_port = "80"
   
   java_gc_opts = (
       " -XX:+UseG1GC"
       " -XX:InitiatingHeapOccupancyPercent=50"
       " -XX:OnOutOfMemoryError='kill -9 %p'"
   )
   
   jvm_opts = (
       f"-Dhttp.proxyHost={proxy_host} -Dhttp.proxyPort={proxy_port} "
       f"-Dhttps.proxyHost={proxy_host} -Dhttps.proxyPort={proxy_port} "
       f"-Dhttp.nonProxyHosts={S3_HOST} -Dhttps.nonProxyHosts={S3_HOST} "
       "-Djava.net.useSystemProxies=false"
       "-Daws.java.v1.disableDeprecationAnnouncement=true"
       f"{java_gc_opts}"
   )
   
   spark = (
       SparkSession.builder
       .appName("SparkIceberg")
       .master("local[12]")
       .config("spark.driver.memory", "12g")
       .config("spark.executor.memory", "12g")
       .config("spark.driver.memoryOverhead", "1024m")
       # JARs and JVM options
       .config("spark.driver.extraJavaOptions", jvm_opts)
       .config("spark.executor.extraJavaOptions", jvm_opts)
   
       # Iceberg Catalog
       .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkSessionCatalog")
       .config('spark.sql.iceberg.vectorization.enabled', 'false')
       .config("spark.sql.catalog.polaris.type", "rest")
       .config("spark.sql.catalog.polaris", 
"org.apache.iceberg.spark.SparkCatalog")
       .config("spark.sql.catalog.polaris.uri", 
"http://polaris:8181/api/catalog";)
       .config("spark.sql.catalog.polaris.token-refresh-enabled", "false")
       .config("spark.sql.catalog.polaris.credential", 
f"{polaris_client_id}:{polaris_client_secret}")
       .config("spark.sql.catalog.polaris.warehouse", catalog_name)
       .config("spark.sql.catalog.polaris.scope", 'PRINCIPAL_ROLE:ALL')
   
       # S3A settings
       .config("spark.hadoop.fs.s3a.endpoint", S3_HOST)
       .config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
       .config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
       .config("spark.hadoop.fs.s3a.path.style.access", "true")
       .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
       .getOrCreate()
   )
   print("Spark Session created successfully!")
   ```
   
   
   
   
   ### To Reproduce
   
   When I run these commands:
   
   ```python
   spark.sql("USE polaris")
   spark.sql("SHOW NAMESPACES").show()
   
   spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST")
   spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST.PUBLIC")
   spark.sql("SHOW NAMESPACES IN COLLADO_TEST").show()
   
   spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
   spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
       id bigint NOT NULL COMMENT 'unique id',
       data string)
   USING iceberg;
   """)
   ```
   
   This Spark configuration can connect directly to S3 normally. However, when 
I connect through Polaris, creating the namespace succeeds, but creating a 
table in that namespace fails with:
   
   ```
   Py4JJavaError                             Traceback (most recent call last)
   Cell In[5], line 2
         1 spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
   ----> 2 spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
         3     id bigint NOT NULL COMMENT 'unique id',
         4     data string)
         5 USING iceberg;
         6 """)
   : org.apache.iceberg.exceptions.ServiceFailureException: Server error: 
SdkClientException: Received an UnknownHostException when attempting to 
interact with a service. See cause for the exact endpoint that is failing to 
resolve. If this is happening on an endpoint that previously worked, there may 
be a network connectivity issue or your DNS cache could be storing endpoints 
for too long.
   Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable 
to execute HTTP request: it-demo-data.s3-it-hn.company.net: Name or service not 
known (SDK Attempt Count: 6)
   ```
   
   When I uncomment the proxy section (HTTP_PROXY, HTTPS_PROXY) in 
`polaris-container`, I also encounter a similar error:
   ```yaml
         # HTTP_PROXY: http://proxy.company.vn:80
         # HTTPS_PROXY: http://proxy.company.vn:80
   ```
   
   
   ### Actual Behavior
   
   Despite the configuration above, Polaris (or components in the stack) still 
attempt to access the internet and perform external DNS/resolve for the 
internal S3 domain. This causes failures because the internal S3 host is only 
resolvable/accessible from within our internal network and should not require 
STS.
   
   ### Expected Behavior
   
   Polaris should connect to the internal S3 endpoint 
`http://s3-it-hn.company.net` directly using the provided static access/secret 
keys and `s3.sts-unavailable = true`, without attempting to resolve or contact 
anything on the public internet (or using an STS / credential vending service).
   
   `NO_PROXY` / `no_proxy` and JVM `-Dhttp.nonProxyHosts` / 
`-Dhttps.nonProxyHosts` are set where applicable to exclude the internal S3 
host from corporate proxy.
   
   ### Additional context
   
   Has any documentation been provided on this issue? I have read through all 
the related issues but haven't been able to resolve this problem.
   
   ### System information
   
   Docker/Docker compose


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to