Copilot commented on code in PR #15034:
URL: https://github.com/apache/iceberg/pull/15034#discussion_r2681812049
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
Review Comment:
The ozone-datanode service is missing a depends_on clause to ensure proper
startup order. The datanode requires both ozone-scm (Storage Container Manager)
and ozone-om (Object Manager) to be running before it can start successfully.
Without this dependency, the datanode may fail to start or experience startup
issues.
```suggestion
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om
```
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
+
+ ozone-scm:
+ <<: *ozone-image
+ ports:
+ - 9876:9876
environment:
- - AWS_ACCESS_KEY_ID=admin
- - AWS_SECRET_ACCESS_KEY=password
- - AWS_REGION=us-east-1
- entrypoint: |
- /bin/sh -c "
- until (/usr/bin/mc alias set minio http://minio:9000 admin password) do
echo '...waiting...' && sleep 1; done;
- /usr/bin/mc rm -r --force minio/warehouse;
- /usr/bin/mc mb minio/warehouse;
- /usr/bin/mc policy set public minio/warehouse;
- tail -f /dev/null
- "
+ <<: *ozone-common-config
+ ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
+ command: ["ozone", "scm"]
+ networks:
+ iceberg_net:
+
+ ozone-recon:
+ <<: *ozone-image
+ ports:
+ - 9888:9888
+ environment:
+ <<: *ozone-common-config
Review Comment:
The ozone-recon service should have a depends_on clause to ensure the
ozone-scm and ozone-om services are available before it starts. Recon is the
monitoring and management service for Ozone and needs the other core services
to be operational to function properly.
```suggestion
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om
```
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
Review Comment:
The WAITFOR configuration on line 118 specifies 'ozone-scm:9876' to wait for
the SCM service, but there's no corresponding WAITFOR configuration in the
ozone-scm service to wait for its own initialization. The ozone-scm service is
the first in the dependency chain but doesn't have any startup coordination.
Consider verifying that the apache/ozone image handles the
ENSURE_SCM_INITIALIZED check properly without needing an explicit WAITFOR, or
document this dependency ordering requirement.
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
Review Comment:
The proxy user configuration allows any host and any group to proxy as the
hadoop user, which is overly permissive for a production environment. While
this may be acceptable for a quickstart/demo environment, consider adding a
comment in the documentation to warn users not to use these settings in
production, or at minimum document that these are demo-only settings for ease
of use.
```suggestion
<<: *ozone-common-config
# WARNING: The following proxyuser settings are for quickstart/demo
use only.
# Do not use this configuration as-is in production; restrict hosts
and groups appropriately.
```
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
+
+ ozone-scm:
+ <<: *ozone-image
+ ports:
+ - 9876:9876
environment:
- - AWS_ACCESS_KEY_ID=admin
- - AWS_SECRET_ACCESS_KEY=password
- - AWS_REGION=us-east-1
- entrypoint: |
- /bin/sh -c "
- until (/usr/bin/mc alias set minio http://minio:9000 admin password) do
echo '...waiting...' && sleep 1; done;
- /usr/bin/mc rm -r --force minio/warehouse;
- /usr/bin/mc mb minio/warehouse;
- /usr/bin/mc policy set public minio/warehouse;
- tail -f /dev/null
- "
+ <<: *ozone-common-config
+ ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
+ command: ["ozone", "scm"]
+ networks:
+ iceberg_net:
+
+ ozone-recon:
+ <<: *ozone-image
+ ports:
+ - 9888:9888
+ environment:
+ <<: *ozone-common-config
+ command: ["ozone", "recon"]
+ networks:
+ iceberg_net:
+
+ ozone-s3g:
+ <<: *ozone-image
+ ports:
+ - 9878:9878
+ environment:
+ <<: *ozone-common-config
Review Comment:
The Ozone S3 Gateway (ozone-s3g) configuration is missing AWS credential
environment variables, while the spark-iceberg and rest services are configured
with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For the S3 API to work
properly with authentication, the Ozone S3 Gateway needs to be configured to
either accept these credentials or operate in anonymous mode. Without proper
credential configuration, there may be authentication mismatches when Spark
tries to access the S3 endpoint with the provided credentials.
```suggestion
<<: *ozone-common-config
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
```
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
+
+ ozone-scm:
+ <<: *ozone-image
+ ports:
+ - 9876:9876
environment:
- - AWS_ACCESS_KEY_ID=admin
- - AWS_SECRET_ACCESS_KEY=password
- - AWS_REGION=us-east-1
- entrypoint: |
- /bin/sh -c "
- until (/usr/bin/mc alias set minio http://minio:9000 admin password) do
echo '...waiting...' && sleep 1; done;
- /usr/bin/mc rm -r --force minio/warehouse;
- /usr/bin/mc mb minio/warehouse;
- /usr/bin/mc policy set public minio/warehouse;
- tail -f /dev/null
- "
+ <<: *ozone-common-config
+ ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
+ command: ["ozone", "scm"]
+ networks:
+ iceberg_net:
+
+ ozone-recon:
+ <<: *ozone-image
+ ports:
+ - 9888:9888
+ environment:
+ <<: *ozone-common-config
+ command: ["ozone", "recon"]
+ networks:
+ iceberg_net:
+
+ ozone-s3g:
+ <<: *ozone-image
+ ports:
+ - 9878:9878
+ environment:
+ <<: *ozone-common-config
+ command:
+ - sh
+ - -c
+ - |
+ set -e
+ ozone s3g &
+ s3g_pid=$$!
+ until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' &&
sleep 1; done;
+ ozone sh bucket delete /s3v/warehouse || true
+ ozone sh bucket create /s3v/warehouse
+ wait "$$s3g_pid"
+ networks:
+ iceberg_net:
+ aliases:
+ - s3.ozone
+ - warehouse.s3.ozone
Review Comment:
The ozone-s3g service is missing a depends_on clause to ensure the
prerequisite services are started first. The S3 Gateway requires ozone-scm,
ozone-om, and ozone-datanode to be available before it can successfully handle
S3 requests. While the shell script waits for ozone-om to respond, the
depends_on clause is still needed to prevent the container from starting
prematurely and ensure proper orchestration.
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
+
+ ozone-scm:
+ <<: *ozone-image
+ ports:
+ - 9876:9876
environment:
- - AWS_ACCESS_KEY_ID=admin
- - AWS_SECRET_ACCESS_KEY=password
- - AWS_REGION=us-east-1
- entrypoint: |
- /bin/sh -c "
- until (/usr/bin/mc alias set minio http://minio:9000 admin password) do
echo '...waiting...' && sleep 1; done;
- /usr/bin/mc rm -r --force minio/warehouse;
- /usr/bin/mc mb minio/warehouse;
- /usr/bin/mc policy set public minio/warehouse;
- tail -f /dev/null
- "
+ <<: *ozone-common-config
+ ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
+ command: ["ozone", "scm"]
+ networks:
+ iceberg_net:
+
+ ozone-recon:
+ <<: *ozone-image
+ ports:
+ - 9888:9888
+ environment:
+ <<: *ozone-common-config
+ command: ["ozone", "recon"]
+ networks:
+ iceberg_net:
Review Comment:
The Ozone services (ozone-datanode, ozone-om, ozone-scm, ozone-recon) store
data in directories like /data/hdds and /data/metadata, but these directories
are not mounted to host volumes. This means that when the containers are
stopped or removed, all data will be lost. While this might be acceptable for a
quickstart/demo environment, consider documenting this behavior or adding
optional volume mounts for users who want to persist data across container
restarts.
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
+
+ ozone-scm:
+ <<: *ozone-image
+ ports:
+ - 9876:9876
environment:
- - AWS_ACCESS_KEY_ID=admin
- - AWS_SECRET_ACCESS_KEY=password
- - AWS_REGION=us-east-1
- entrypoint: |
- /bin/sh -c "
- until (/usr/bin/mc alias set minio http://minio:9000 admin password) do
echo '...waiting...' && sleep 1; done;
- /usr/bin/mc rm -r --force minio/warehouse;
- /usr/bin/mc mb minio/warehouse;
- /usr/bin/mc policy set public minio/warehouse;
- tail -f /dev/null
- "
+ <<: *ozone-common-config
+ ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
+ command: ["ozone", "scm"]
+ networks:
+ iceberg_net:
+
+ ozone-recon:
+ <<: *ozone-image
+ ports:
+ - 9888:9888
+ environment:
+ <<: *ozone-common-config
+ command: ["ozone", "recon"]
+ networks:
+ iceberg_net:
+
+ ozone-s3g:
+ <<: *ozone-image
+ ports:
+ - 9878:9878
+ environment:
+ <<: *ozone-common-config
+ command:
+ - sh
+ - -c
+ - |
+ set -e
+ ozone s3g &
+ s3g_pid=$$!
+ until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' &&
sleep 1; done;
+ ozone sh bucket delete /s3v/warehouse || true
+ ozone sh bucket create /s3v/warehouse
Review Comment:
The script starts the S3 gateway in the background and then performs
administrative operations. However, there's no check to ensure the S3 gateway
itself is fully operational before proceeding with bucket operations. The wait
only checks if ozone-om responds to 'volume list', but doesn't verify that the
S3 gateway is ready. This could lead to race conditions where bucket operations
might fail if the S3 gateway hasn't fully initialized. Consider adding a health
check or retry logic for the bucket operations.
```suggestion
max_retries=30
retry_delay=2
attempt=0
until ozone sh bucket create /s3v/warehouse >/dev/null 2>&1; do
attempt=$$((attempt + 1))
if [ "$$attempt" -ge "$$max_retries" ]; then
echo "Failed to create bucket /s3v/warehouse after $$max_retries
attempts"
exit 1
fi
echo '...waiting for S3 gateway to be ready for bucket
operations...'
sleep "$$retry_delay"
done
```
##########
site/docs/spark-quickstart.md:
##########
@@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- - CATALOG_S3_ENDPOINT=http://minio:9000
- minio:
- image: minio/minio
- container_name: minio
+ - CATALOG_S3_ENDPOINT=http://s3.ozone:9878
+
+ ozone-datanode:
+ <<: *ozone-image
+ ports:
+ - 9864
+ command: ["ozone", "datanode"]
environment:
- - MINIO_ROOT_USER=admin
- - MINIO_ROOT_PASSWORD=password
- - MINIO_DOMAIN=minio
+ <<: *ozone-common-config
networks:
iceberg_net:
- aliases:
- - warehouse.minio
+
+ ozone-om:
+ <<: *ozone-image
ports:
- - 9001:9001
- - 9000:9000
- command: ["server", "/data", "--console-address", ":9001"]
- mc:
- depends_on:
- - minio
- image: minio/mc
- container_name: mc
+ - 9874:9874
+ environment:
+ <<: *ozone-common-config
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
+ CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
+ ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
+ WAITFOR: ozone-scm:9876
+ command: ["ozone", "om"]
networks:
iceberg_net:
Review Comment:
The container name for the ozone-om service is not explicitly set, while
other services in the docker-compose file (like spark-iceberg and iceberg-rest)
have container_name specified. For consistency and easier debugging, consider
adding explicit container names for all Ozone services, such as
'container_name: ozone-om' for the ozone-om service.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]