mengw15 commented on PR #4273:
URL: https://github.com/apache/texera/pull/4273#issuecomment-4202897637

   > ## Review: bootstrap-lakekeeper.sh & parse-storage-config.py
   > Thanks for putting this together! The overall structure is solid — 
idempotent steps, good error handling, clear output. I have some suggestions to 
improve cross-platform portability and reduce dependencies.
   > 
   > ### 1. Remove `parse-storage-config.py` — use environment variables instead
   > The Python script + `pyhocon` dependency adds significant friction (not 
everyone has `pyhocon` installed, and it's a niche package). This can be 
eliminated entirely because `storage.conf` already supports env var overrides 
via `${?STORAGE_*}` syntax. The bootstrap script can read the **same env vars** 
with the **same defaults**:
   > 
   > ```shell
   > 
LAKEKEEPER_BASE_URI="${STORAGE_ICEBERG_CATALOG_REST_URI:-http://localhost:8181/catalog}";
   > WAREHOUSE_NAME="${STORAGE_ICEBERG_CATALOG_REST_WAREHOUSE_NAME:-texera}"
   > S3_REGION="${STORAGE_ICEBERG_CATALOG_REST_REGION:-us-west-2}"
   > S3_BUCKET="${STORAGE_ICEBERG_CATALOG_REST_S3_BUCKET:-texera-iceberg}"
   > S3_ENDPOINT="${STORAGE_S3_ENDPOINT:-http://localhost:9000}";
   > S3_USERNAME="${STORAGE_S3_AUTH_USERNAME:-texera_minio}"
   > S3_PASSWORD="${STORAGE_S3_AUTH_PASSWORD:-password}"
   > ```
   > 
   > This way, if a developer customizes their setup via env vars, both the 
Scala app and this script pick up the same values. No parsing needed.
   > 
   > ### 2. Wrong credential field names in warehouse creation API
   > Lines 389-390 use `aws-access-key-id` / `aws-secret-access-key`, but the 
Lakekeeper API expects `access-key-id` / `secret-access-key`. With the current 
field names, warehouse creation silently creates a warehouse without 
credentials, which causes auth failures later.
   > 
   > ### 3. Remove `awscli` dependency for MinIO bucket operations
   > Requiring `awscli` just to create a bucket is heavy. Alternatives:
   > 
   > * Use `curl` with S3 API directly (a simple `PUT` to 
`http://endpoint/bucket-name`)
   > * Use `mc` (MinIO Client) which is purpose-built and lightweight
   > * Or skip bucket creation entirely — Lakekeeper with `sts-enabled: true` 
and `flavor: "minio"` handles this
   > 
   > ### 4. Remove the separate `check_warehouse_exists` function
   > The `create_warehouse` function already handles HTTP 409 (already exists) 
as success. The separate check (lines 312-358) adds ~45 lines and an extra API 
call for no benefit. Just call create directly and handle the response.
   > 
   > ### 5. `flavor` and `sts-enabled` values
   > The script uses `"flavor": "s3-compat"` with `"sts-enabled": false`. In my 
testing, this combination causes a `FileDecompressionError` during warehouse 
creation when Lakekeeper tries to validate the S3 endpoint. Using `"flavor": 
"minio"` with `"sts-enabled": true` works correctly.
   > 
   > ### 6. Hardcoded binary path requires editing the script
   > `LAKEKEEPER_BINARY_PATH=""` at line 37 forces users to edit the script 
source. Consider:
   > 
   > * Accept it as a CLI argument or env var
   > * Default to `lakekeeper` (assuming it's on `$PATH`)
   > * Mention `brew install lakekeeper` in the setup instructions (works on 
macOS)
   > 
   > ### 7. Cross-platform considerations
   > * The script works on macOS and Linux (both have bash, curl)
   > * Windows users would need WSL2 — might be worth noting in the PR 
description
   > * Lakekeeper provides pre-built binaries for macOS ARM, Linux x86_64, and 
Linux ARM (no macOS Intel or Windows native)
   > 
   > ### Summary
   > The script could be simplified from ~525 lines to ~150 lines by:
   > 
   > 1. Dropping `parse-storage-config.py` (use env vars with defaults)
   > 2. Dropping `awscli` dependency (use curl or skip bucket creation)
   > 3. Removing the redundant warehouse-exists check
   > 4. Fixing the credential field names and flavor/sts settings
   > 
   > Happy to help with a simplified version if that would be useful!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to