tuanit03 opened a new issue, #3640:
URL: https://github.com/apache/polaris/issues/3640
### Describe the bug
I am using Polaris to connect to an **internal S3** service. The internal S3
does **not** provide Credential Vending / STS. I tried to configure Polaris so
it can connect to the internal S3 without STS, but with the current setup
Polaris still tries to access the internet and resolves the internal S3 domain
externally.
Below are the exact configurations I used (Docker Compose, Spark image
Dockerfile, catalog payload and Spark configuration).
---
## Docker Compose (relevant parts)
```yaml
services:
postgres:
image: postgres:17
container_name: polaris-postgres
ports:
- "5433:5432"
shm_size: 128mb
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: polaris
POSTGRES_INITDB_ARGS: "--encoding UTF8 --data-checksums"
volumes:
- type: bind
source: ./pg-config/postgresql.conf
target: /etc/postgresql/postgresql.conf
command:
- "postgres"
- "-c"
- "config_file=/etc/postgresql/postgresql.conf"
healthcheck:
test: "pg_isready -U postgres"
interval: 5s
timeout: 2s
retries: 15
networks:
- polaris_net
polaris-bootstrap:
image: apache/polaris-admin-tool:latest
container_name: polaris-bootstrap
depends_on:
postgres:
condition: service_healthy
environment:
POLARIS_PERSISTENCE_TYPE: relational-jdbc
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
QUARKUS_DATASOURCE_USERNAME: postgres
QUARKUS_DATASOURCE_PASSWORD: postgres
command:
- "bootstrap"
- "--realm=polaris"
- "--credential=polaris,root,s3cr3t"
networks:
- polaris_net
polaris:
image: apache/polaris:latest
container_name: polaris-container
ports:
- "8181:8181" # API
- "8182:8182" # management (metrics/health)
- "5005:5005" # optional JVM debugger
environment:
# Skip credential subscoping indirection
polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"
polaris.features.realm-overrides."polaris"."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION":
"true"
# Java debug
JAVA_DEBUG: true
JAVA_DEBUG_PORT: "*:5005"
POLARIS_PERSISTENCE_TYPE: relational-jdbc
POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES: 5
POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS: 100
POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DURATION_IN_MS: 5000
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
QUARKUS_DATASOURCE_USERNAME: postgres
QUARKUS_DATASOURCE_PASSWORD: postgres
POLARIS_REALM_CONTEXT_REALMS: polaris
QUARKUS_OTEL_SDK_DISABLED: true
POLARIS_BOOTSTRAP_CREDENTIALS: polaris,root,s3cr3t
polaris.features."ALLOW_INSECURE_STORAGE_TYPES": "false"
polaris.features."SUPPORTED_CATALOG_STORAGE_TYPES":
"[\"S3\",\"GCS\",\"AZURE\"]"
polaris.readiness.ignore-severe-issues: "true"
# S3 Configuration
polaris.features."ALLOW_SETTING_S3_ENDPOINTS": "true"
polaris.storage.aws.access-key: "abcd"
polaris.storage.aws.secret-key: "abcd"
AWS_REGION: "us-east-1"
AWS_ACCESS_KEY_ID: "abcd"
AWS_SECRET_ACCESS_KEY: "abcd"
AWS_ENDPOINT_URL: "http://s3-it-hn.company.net"
# Proxy / no-proxy
# HTTP_PROXY: http://proxy.company.vn:80
# HTTPS_PROXY: http://proxy.company.vn:80
JAVA_OPTS_APPEND:
'-Dhttp.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"
-Dhttps.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"'
NO_PROXY:
localhost,127.0.0.1,polaris-postgres,polaris-container,.s3-it-hn.company.net,s3-it-hn.company.net
depends_on:
polaris-bootstrap:
condition: service_completed_successfully
healthcheck:
test: ["CMD", "curl", "http://localhost:8182/q/health"]
interval: 2s
timeout: 10s
retries: 10
start_period: 10s
networks:
- polaris_net
polaris-setup:
image: alpine/curl
container_name: polaris-setup
depends_on:
polaris:
condition: service_healthy
environment:
HTTP_PROXY: http://proxy.company.vn:80
HTTPS_PROXY: http://proxy.company.vn:80
NO_PROXY:
localhost,127.0.0.1,polaris-postgres,polaris-container,.s3-it-hn.company.net,s3-it-hn.company.net
STORAGE_LOCATION: "s3://it-demo-data/testxxx/"
volumes:
- ./setup-polaris:/polaris
entrypoint: '/bin/sh -c "chmod +x /polaris/create-catalog.sh &&
/polaris/create-catalog.sh"'
networks:
- polaris_net
spark-jupyter:
build: .
container_name: polaris-spark-jupyter
depends_on:
polaris-setup:
condition: service_completed_successfully
ports:
- "8888:8888"
- "4040:4040"
volumes:
- ./notebooks:/home/jovyan/work
healthcheck:
test: "curl localhost:8888"
interval: 5s
retries: 15
networks:
- polaris_net
trino:
image: trinodb/trino:latest
container_name: polaris-trino
depends_on:
polaris-setup:
condition: service_completed_successfully
stdin_open: true
tty: true
ports:
- "8081:8080"
volumes:
- ./trino-config/:/etc/trino/catalog
networks:
- polaris_net
networks:
polaris_net:
external: true
name: open-metadata_app_net
```
---
## Dockerfile (Spark image)
```dockerfile
FROM apache/spark:3.5.8-java17-python3
USER root
ENV HTTP_PROXY=http://proxy.company.vn:80/
ENV HTTPS_PROXY=http://proxy.company.vn:80/
ENV NO_PROXY=localhost,127.0.0.1,polaris-postgres,polaris-container
RUN pip install --no-cache-dir \
jupyterlab \
notebook \
pyspark==3.5.8 \
&& mkdir -p /home/jovyan/work
RUN mkdir -p /opt/spark/jars && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e
https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.10.0/iceberg-spark-runtime-3.5_2.12-1.10.0.jar
-O /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.10.0.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e
https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.0/iceberg-aws-bundle-1.10.0.jar
-O /opt/spark/jars/iceberg-aws-bundle-1.10.0.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e
https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
-O /opt/spark/jars/hadoop-aws-3.3.4.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e
https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.797/aws-java-sdk-bundle-1.12.797.jar
-O /opt/spark/jars/aws-java-sdk-bundle-1.12.797.jar
# Clear proxy environment variables so they won't be present in later layers
/ at runtime
ENV HTTP_PROXY=""
ENV HTTPS_PROXY=""
ENV NO_PROXY=""
ENV JUPYTER_TOKEN=polaris
ENV JUPYTER_ENABLE_LAB=yes
RUN useradd -m -s /bin/bash jovyan && chown -R jovyan /home/jovyan
USER jovyan
WORKDIR /home/jovyan/work
EXPOSE 8888 4040
CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
--NotebookApp.token=$JUPYTER_TOKEN
```
---
## Catalog creation payload (`create-catalog.sh` / jq payload)
```bash
PAYLOAD=$(jq -n \
--arg name "quickstart_catalog" \
--arg defaultBase "$STORAGE_LOCATION" \
'{
catalog: {
name: $name,
type: "INTERNAL",
properties: {
"default-base-location": "s3://it-demo-data/testxxx/",
"s3.endpoint": "http://s3-it-hn.company.net",
"s3.path-style-access": "true",
"s3.access-key-id": "abcd",
"s3.secret-access-key": "abcd",
"s3.region": "us-east-1",
"s3.sts-unavailable": "true"
},
storageConfigInfo: {
storageType: "S3",
stsUnavailable: true,
pathStyleAccess: true,
endpoint: "http://s3-it-hn.company.net",
endpointInternal: "http://s3-it-hn.company.net"
}
}
}'
)
```
---
## Spark configuration (Python snippet)
```python
S3_HOST="s3-it-hn.company.net"
current_python = sys.executable
os.environ["PYSPARK_PYTHON"] = current_python
os.environ["PYSPARK_DRIVER_PYTHON"] = current_python
os.environ["NO_PROXY"] = S3_HOST
os.environ["no_proxy"] = S3_HOST
# ---- JVM proxy settings (driver + executor) ----
proxy_host = "http://proxy.company.vn"
proxy_port = "80"
java_gc_opts = (
" -XX:+UseG1GC"
" -XX:InitiatingHeapOccupancyPercent=50"
" -XX:OnOutOfMemoryError='kill -9 %p'"
)
jvm_opts = (
f"-Dhttp.proxyHost={proxy_host} -Dhttp.proxyPort={proxy_port} "
f"-Dhttps.proxyHost={proxy_host} -Dhttps.proxyPort={proxy_port} "
f"-Dhttp.nonProxyHosts={S3_HOST} -Dhttps.nonProxyHosts={S3_HOST} "
"-Djava.net.useSystemProxies=false"
"-Daws.java.v1.disableDeprecationAnnouncement=true"
f"{java_gc_opts}"
)
spark = (
SparkSession.builder
.appName("SparkIceberg")
.master("local[12]")
.config("spark.driver.memory", "12g")
.config("spark.executor.memory", "12g")
.config("spark.driver.memoryOverhead", "1024m")
# JARs and JVM options
.config("spark.driver.extraJavaOptions", jvm_opts)
.config("spark.executor.extraJavaOptions", jvm_opts)
# Iceberg Catalog
.config("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkSessionCatalog")
.config('spark.sql.iceberg.vectorization.enabled', 'false')
.config("spark.sql.catalog.polaris.type", "rest")
.config("spark.sql.catalog.polaris",
"org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.polaris.uri",
"http://polaris:8181/api/catalog")
.config("spark.sql.catalog.polaris.token-refresh-enabled", "false")
.config("spark.sql.catalog.polaris.credential",
f"{polaris_client_id}:{polaris_client_secret}")
.config("spark.sql.catalog.polaris.warehouse", catalog_name)
.config("spark.sql.catalog.polaris.scope", 'PRINCIPAL_ROLE:ALL')
# S3A settings
.config("spark.hadoop.fs.s3a.endpoint", S3_HOST)
.config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
.config("spark.hadoop.fs.s3a.path.style.access", "true")
.config("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
.getOrCreate()
)
print("Spark Session created successfully!")
```
### To Reproduce
When I run these commands:
```python
spark.sql("USE polaris")
spark.sql("SHOW NAMESPACES").show()
spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST")
spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST.PUBLIC")
spark.sql("SHOW NAMESPACES IN COLLADO_TEST").show()
spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
id bigint NOT NULL COMMENT 'unique id',
data string)
USING iceberg;
""")
```
This Spark configuration can connect directly to S3 normally. However, when
I connect through Polaris, creating the namespace succeeds, but creating a
table in that namespace fails with:
```
Py4JJavaError Traceback (most recent call last)
Cell In[5], line 2
1 spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
----> 2 spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
3 id bigint NOT NULL COMMENT 'unique id',
4 data string)
5 USING iceberg;
6 """)
: org.apache.iceberg.exceptions.ServiceFailureException: Server error:
SdkClientException: Received an UnknownHostException when attempting to
interact with a service. See cause for the exact endpoint that is failing to
resolve. If this is happening on an endpoint that previously worked, there may
be a network connectivity issue or your DNS cache could be storing endpoints
for too long.
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable
to execute HTTP request: it-demo-data.s3-it-hn.company.net: Name or service not
known (SDK Attempt Count: 6)
```
When I uncomment the proxy section (HTTP_PROXY, HTTPS_PROXY) in
`polaris-container`, I also encounter a similar error:
```yaml
# HTTP_PROXY: http://proxy.company.vn:80
# HTTPS_PROXY: http://proxy.company.vn:80
```
### Actual Behavior
Despite the configuration above, Polaris (or components in the stack) still
attempt to access the internet and perform external DNS/resolve for the
internal S3 domain. This causes failures because the internal S3 host is only
resolvable/accessible from within our internal network and should not require
STS.
### Expected Behavior
Polaris should connect to the internal S3 endpoint
`http://s3-it-hn.company.net` directly using the provided static access/secret
keys and `s3.sts-unavailable = true`, without attempting to resolve or contact
anything on the public internet (or using an STS / credential vending service).
`NO_PROXY` / `no_proxy` and JVM `-Dhttp.nonProxyHosts` /
`-Dhttps.nonProxyHosts` are set where applicable to exclude the internal S3
host from corporate proxy.
### Additional context
Has any documentation been provided on this issue? I have read through all
the related issues but haven't been able to resolve this problem.
### System information
Docker/Docker compose
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]