sandabot opened a new issue, #13064:
URL: https://github.com/apache/cloudstack/issues/13064

   ### problem
   
   After a fresh system VM (ConsoleProxy / SecondaryStorageVm) boots, its agent 
cannot complete the TLS handshake with the management server on port 8250. The 
console proxy never binds its public HTTP/HTTPS listener (browser gets 
`ERR_CONNECTION_REFUSED` on the public IP), the SSVM never registers as a host, 
and ACS keeps destroying and re-creating the system VMs — each replacement 
exhibits the same failure.
   
   Root cause: on the VMware SSH dispatch path, `SetupCertificateCommand` 
invokes `scripts/util/keystore-cert-import` **without** positional arguments 
`$7` (CACERT_FILE path) and `$8` (CACERT content). The script then falls 
through to the `elif [ ! -f "$CACERT_FILE" ]` branch and calls a bare `exit` — 
which returns **exit code 0**. The management server receives 
`SetupCertificateAnswer: result=true` and believes the push succeeded. In 
reality the `awk ... "$CACERT_FILE"` and the subsequent `keytool -import 
-trustcacerts -alias cloudca` calls never run, so the CA certificate is never 
added to `cloud.jks` as a `trustedCertEntry`.
   
   Evidence observed on the affected deployment.
   
   Management-server log, every ~10s while the system VM is Running:
   ```
   ERROR [c.c.u.n.Link] (AgentManager-SSLHandshakeHandler-1:[])
     SSL error caught during unwrap data: (certificate_unknown)
     Received fatal alert: certificate_unknown,
     for local address=/<MS>:8250, remote address=/<SYSVM>:xxxxx.
     The client may have invalid ca-certificates.
   ```
   
   System VM `/var/log/cloud.log`, reciprocally:
   ```
   ERROR [utils.nio.Link] SSL error caught during wrap data:
     No trusted certificate found, for local address=/<SYSVM>:xxxxx,
     remote address=/<MS>:8250.
   ```
   
   Keystore inspection inside the VM (using the passphrase stored in 
agent.properties):
   ```
   $ keytool -list -keystore /usr/local/cloud/systemvm/conf/cloud.jks \
       -storepass "$(sed -n 's/^keystore.passphrase=//p' 
/usr/local/cloud/systemvm/conf/agent.properties)"
   Your keystore contains 1 entry
   cloud, <date>, PrivateKeyEntry, ...
   ```
   
   Only the leaf `PrivateKeyEntry` is present — no `trustedCertEntry` for the 
root CA. The file `/usr/local/cloud/systemvm/conf/cloud.ca.crt` is correctly 
written on disk (byte-for-byte identical to the live MS root CA), but Java does 
not read loose PEM files on disk; it only trusts entries inside `cloud.jks`.
   
   The relevant piece of `scripts/util/keystore-cert-import` in 4.22.0.0:
   ```bash
   CACERT_FILE="$7"
   CACERT=$(echo "$8" | tr '^' '\n' | tr '~' ' ')
   ...
   # Import ca certs
   if [ ! -z "${CACERT// }" ]; then
       echo "$CACERT" > "$CACERT_FILE"
   elif [ ! -f "$CACERT_FILE" ]; then
       echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
       exit                    # <-- bare exit -> exit 0 -> MS sees success
   fi
   
   awk '/-----BEGIN CERTIFICATE-----?/{n++}{print > "cloudca." n }' 
"$CACERT_FILE"
   for caChain in $(ls cloudca.*); do
       keytool -delete -noprompt -alias "$caChain" -keystore "$KS_FILE" 
-storepass "$KS_PASS" > /dev/null 2>&1 || true
       keytool -import -noprompt -storepass "$KS_PASS" -trustcacerts -alias 
"$caChain" -file "$caChain" -keystore "$KS_FILE" > /dev/null 2>&1
   done
   ```
   
   Example arguments actually logged by MS for a `v-26-VM` 
`SetupCertificateCommand` dispatch (6 positional args, not 10):
   ```
   Run command on VR: <SYSVM_IP>, script: keystore-cert-import with args:
     /usr/local/cloud/systemvm/conf/agent.properties <KS_PASS>
     /usr/local/cloud/systemvm/conf/cloud.jks
     ssh
     /usr/local/cloud/systemvm/conf/cloud.crt
     "-----BEGIN~CERTIFICATE-----^...leaf cert content..."
   ```
   
   `$7` (CACERT_FILE) and `$8` (CACERT content) are absent.
   
   
   ### versions
   
   - Apache CloudStack: 4.22.0.0 (package 
`cloudstack-management-4.22.0.0-shapeblue0`, ShapeBlue RPM build)
   - Management server host OS: Rocky Linux 10.0
   - Management server JDK: OpenJDK 21.0.10 (for 
`cloudstack-management.service`)
   - SystemVM template: `systemvmtemplate-4.22.0-x86_64-vmware.ova` (official 
4.22.0 vSphere OVA, built Tue Oct 14 11:01:03 UTC 2025)
   - SystemVM guest OS: Debian 12 (bookworm), OpenJDK 17.0.16 (`keytool` as 
shipped in the image)
   - Hypervisor: VMware vSphere (ESXi cluster, 6 hosts)
   - Primary storage: VMFS
   - Secondary storage: NFS
   - Network: VMware DVS, shared Public VLAN for system VMs, Basic zone 
networking
   - CA framework: `ca.framework.provider.plugin=root`, 
`ca.plugin.root.auth.strictness=false`
   
   
   ### The steps to reproduce the bug
   
   1. Install Apache CloudStack 4.22.0.0 management server on a clean host 
(e.g. Rocky 10).
   2. Register a VMware zone and upload the official 
`systemvmtemplate-4.22.0-x86_64-vmware.ova` as the SystemVM template (type 
`SYSTEM`).
   3. Let ACS auto-provision the ConsoleProxy and SecondaryStorageVm for that 
zone.
   4. Wait ~1 minute after both system VMs reach `state=Running`, 
`power_state=PowerOn`.
   5. Observe in `management-server.log` that `SetupCertificateCommand` is 
dispatched via the VMware SSH path and returns `SetupCertificateAnswer: 
result=true`.
   6. Despite that "success", `management-server.log` keeps emitting `SSL error 
caught during unwrap data: (certificate_unknown) Received fatal alert: 
certificate_unknown, ... remote address=/<SYSVM>:xxxxx` every ~10 s.
   7. SSH into the system VM on port 3922 using the MS key 
(`/var/cloudstack/management/.ssh/id_rsa`) and run:
      ```
      PASS=$(sed -n 's/^keystore.passphrase=//p' 
/usr/local/cloud/systemvm/conf/agent.properties)
      keytool -list -keystore /usr/local/cloud/systemvm/conf/cloud.jks 
-storepass "$PASS"
      ```
      Expected (buggy) output: the keystore contains exactly 1 entry (`cloud`, 
`PrivateKeyEntry`) — no `trustedCertEntry`.
   8. Verify `/usr/local/cloud/systemvm/conf/cloud.ca.crt` exists on disk and 
matches the MS root CA (same serial as `ca.plugin.root.ca.certificate` in the 
`configuration` table), but it was never imported into the keystore.
   9. Result: the agent never completes TLS, the host never reaches `Up`, and 
the console proxy never opens its public HTTP listener — browser hits 
`ERR_CONNECTION_REFUSED` on the system VM public IP.
   
   
   ### What to do about it?
   
   Two fixes, ideally both.
   
   **A. Primary — ensure the VMware SSH dispatch of `SetupCertificateCommand` 
passes the CA args.**
   
   The dispatch path that invokes `scripts/util/keystore-cert-import` needs to 
also pass positional arguments `$7` 
(`/usr/local/cloud/systemvm/conf/cloud.ca.crt`) and `$8` (CA certificate 
content, same `^`/`~`-encoded form used for the leaf cert in `$6`). The CA 
material is already available in the `Certificate` object used by 
`CAManagerImpl` (see 
`server/src/main/java/org/apache/cloudstack/ca/CAManagerImpl.java` around the 
`SetupCertificateCommand cmd = new SetupCertificateCommand(certificate)` call) 
and in `ca.plugin.root.ca.certificate` in the `configuration` table — it just 
isn't being carried into the script invocation on the VMware SSH path.
   
   Relevant files:
   - `core/src/main/java/org/apache/cloudstack/ca/SetupCertificateCommand.java` 
— command payload; may need to explicitly carry the CA certificate to the VR 
path.
   - `server/src/main/java/org/apache/cloudstack/ca/CAManagerImpl.java` — where 
the command is built; has the `Certificate` object with CA material.
   - 
`plugins/hypervisors/vmware/src/main/java/com/cloud/hypervisor/vmware/resource/VmwareResource.java`
 — VMware `executeInVR` dispatch; check the branch that handles 
`SetupCertificateCommand` and routes it to `keystore-cert-import`.
   
   **B. Secondary — make `scripts/util/keystore-cert-import` fail loud, with a 
safe disk fallback.**
   
   Current (buggy) behaviour:
   ```bash
   elif [ ! -f "$CACERT_FILE" ]; then
       echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
       exit
   fi
   ```
   
   Suggested patch:
   ```diff
   - elif [ ! -f "$CACERT_FILE" ]; then
   -     echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
   -     exit
   - fi
   + elif [ ! -f "$CACERT_FILE" ]; then
   +     # Fall back to the CA cert that cloud-early-config placed on disk.
   +     if [ -f /usr/local/cloud/systemvm/conf/cloud.ca.crt ]; then
   +         CACERT_FILE=/usr/local/cloud/systemvm/conf/cloud.ca.crt
   +     else
   +         echo "Cannot find ca certificate file; neither arg \$7 nor 
/usr/local/cloud/systemvm/conf/cloud.ca.crt available, aborting!" >&2
   +         exit 1
   +     fi
   + fi
   ```
   A non-zero exit lets the management server see the failure (today, 
`SetupCertificateAnswer.result=true` completely masks it). The disk fallback is 
defense-in-depth against future dispatch-side regressions of this same shape.
   
   **Workaround currently in use on the affected deployment (for anyone hitting 
this before a fix is merged):**
   For each system VM, extract the existing leaf key+cert with `openssl 
pkcs12`, rebuild the PKCS12 with `openssl pkcs12 -export ... -certfile 
cloud.ca.crt`, then `keytool -import -noprompt -trustcacerts -alias cloudca 
-file cloud.ca.crt -keystore cloud.jks -storepass <agent.properties 
passphrase>`, then `systemctl restart cloud`. Note: the `keytool` shipped in 
the 4.22 systemvm template (OpenJDK 17.0.16, Debian 12) rejected `-import` of 
both PEM and DER forms of the CA with `CertificateParsingException: signed 
overrun, bytes = 115`, so the rebuilt keystore had to be generated on the 
management server (OpenJDK 21) and copied back. That broken `keytool -import` 
behaviour on the systemvm image may deserve a separate issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to