sandabot opened a new issue, #13064:
URL: https://github.com/apache/cloudstack/issues/13064
### problem
After a fresh system VM (ConsoleProxy / SecondaryStorageVm) boots, its agent
cannot complete the TLS handshake with the management server on port 8250. The
console proxy never binds its public HTTP/HTTPS listener (browser gets
`ERR_CONNECTION_REFUSED` on the public IP), the SSVM never registers as a host,
and ACS keeps destroying and re-creating the system VMs — each replacement
exhibits the same failure.
Root cause: on the VMware SSH dispatch path, `SetupCertificateCommand`
invokes `scripts/util/keystore-cert-import` **without** positional arguments
`$7` (CACERT_FILE path) and `$8` (CACERT content). The script then falls
through to the `elif [ ! -f "$CACERT_FILE" ]` branch and calls a bare `exit` —
which returns **exit code 0**. The management server receives
`SetupCertificateAnswer: result=true` and believes the push succeeded. In
reality the `awk ... "$CACERT_FILE"` and the subsequent `keytool -import
-trustcacerts -alias cloudca` calls never run, so the CA certificate is never
added to `cloud.jks` as a `trustedCertEntry`.
Evidence observed on the affected deployment.
Management-server log, every ~10s while the system VM is Running:
```
ERROR [c.c.u.n.Link] (AgentManager-SSLHandshakeHandler-1:[])
SSL error caught during unwrap data: (certificate_unknown)
Received fatal alert: certificate_unknown,
for local address=/<MS>:8250, remote address=/<SYSVM>:xxxxx.
The client may have invalid ca-certificates.
```
System VM `/var/log/cloud.log`, reciprocally:
```
ERROR [utils.nio.Link] SSL error caught during wrap data:
No trusted certificate found, for local address=/<SYSVM>:xxxxx,
remote address=/<MS>:8250.
```
Keystore inspection inside the VM (using the passphrase stored in
agent.properties):
```
$ keytool -list -keystore /usr/local/cloud/systemvm/conf/cloud.jks \
-storepass "$(sed -n 's/^keystore.passphrase=//p'
/usr/local/cloud/systemvm/conf/agent.properties)"
Your keystore contains 1 entry
cloud, <date>, PrivateKeyEntry, ...
```
Only the leaf `PrivateKeyEntry` is present — no `trustedCertEntry` for the
root CA. The file `/usr/local/cloud/systemvm/conf/cloud.ca.crt` is correctly
written on disk (byte-for-byte identical to the live MS root CA), but Java does
not read loose PEM files on disk; it only trusts entries inside `cloud.jks`.
The relevant piece of `scripts/util/keystore-cert-import` in 4.22.0.0:
```bash
CACERT_FILE="$7"
CACERT=$(echo "$8" | tr '^' '\n' | tr '~' ' ')
...
# Import ca certs
if [ ! -z "${CACERT// }" ]; then
echo "$CACERT" > "$CACERT_FILE"
elif [ ! -f "$CACERT_FILE" ]; then
echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
exit # <-- bare exit -> exit 0 -> MS sees success
fi
awk '/-----BEGIN CERTIFICATE-----?/{n++}{print > "cloudca." n }'
"$CACERT_FILE"
for caChain in $(ls cloudca.*); do
keytool -delete -noprompt -alias "$caChain" -keystore "$KS_FILE"
-storepass "$KS_PASS" > /dev/null 2>&1 || true
keytool -import -noprompt -storepass "$KS_PASS" -trustcacerts -alias
"$caChain" -file "$caChain" -keystore "$KS_FILE" > /dev/null 2>&1
done
```
Example arguments actually logged by MS for a `v-26-VM`
`SetupCertificateCommand` dispatch (6 positional args, not 10):
```
Run command on VR: <SYSVM_IP>, script: keystore-cert-import with args:
/usr/local/cloud/systemvm/conf/agent.properties <KS_PASS>
/usr/local/cloud/systemvm/conf/cloud.jks
ssh
/usr/local/cloud/systemvm/conf/cloud.crt
"-----BEGIN~CERTIFICATE-----^...leaf cert content..."
```
`$7` (CACERT_FILE) and `$8` (CACERT content) are absent.
### versions
- Apache CloudStack: 4.22.0.0 (package
`cloudstack-management-4.22.0.0-shapeblue0`, ShapeBlue RPM build)
- Management server host OS: Rocky Linux 10.0
- Management server JDK: OpenJDK 21.0.10 (for
`cloudstack-management.service`)
- SystemVM template: `systemvmtemplate-4.22.0-x86_64-vmware.ova` (official
4.22.0 vSphere OVA, built Tue Oct 14 11:01:03 UTC 2025)
- SystemVM guest OS: Debian 12 (bookworm), OpenJDK 17.0.16 (`keytool` as
shipped in the image)
- Hypervisor: VMware vSphere (ESXi cluster, 6 hosts)
- Primary storage: VMFS
- Secondary storage: NFS
- Network: VMware DVS, shared Public VLAN for system VMs, Basic zone
networking
- CA framework: `ca.framework.provider.plugin=root`,
`ca.plugin.root.auth.strictness=false`
### The steps to reproduce the bug
1. Install Apache CloudStack 4.22.0.0 management server on a clean host
(e.g. Rocky 10).
2. Register a VMware zone and upload the official
`systemvmtemplate-4.22.0-x86_64-vmware.ova` as the SystemVM template (type
`SYSTEM`).
3. Let ACS auto-provision the ConsoleProxy and SecondaryStorageVm for that
zone.
4. Wait ~1 minute after both system VMs reach `state=Running`,
`power_state=PowerOn`.
5. Observe in `management-server.log` that `SetupCertificateCommand` is
dispatched via the VMware SSH path and returns `SetupCertificateAnswer:
result=true`.
6. Despite that "success", `management-server.log` keeps emitting `SSL error
caught during unwrap data: (certificate_unknown) Received fatal alert:
certificate_unknown, ... remote address=/<SYSVM>:xxxxx` every ~10 s.
7. SSH into the system VM on port 3922 using the MS key
(`/var/cloudstack/management/.ssh/id_rsa`) and run:
```
PASS=$(sed -n 's/^keystore.passphrase=//p'
/usr/local/cloud/systemvm/conf/agent.properties)
keytool -list -keystore /usr/local/cloud/systemvm/conf/cloud.jks
-storepass "$PASS"
```
Expected (buggy) output: the keystore contains exactly 1 entry (`cloud`,
`PrivateKeyEntry`) — no `trustedCertEntry`.
8. Verify `/usr/local/cloud/systemvm/conf/cloud.ca.crt` exists on disk and
matches the MS root CA (same serial as `ca.plugin.root.ca.certificate` in the
`configuration` table), but it was never imported into the keystore.
9. Result: the agent never completes TLS, the host never reaches `Up`, and
the console proxy never opens its public HTTP listener — browser hits
`ERR_CONNECTION_REFUSED` on the system VM public IP.
### What to do about it?
Two fixes, ideally both.
**A. Primary — ensure the VMware SSH dispatch of `SetupCertificateCommand`
passes the CA args.**
The dispatch path that invokes `scripts/util/keystore-cert-import` needs to
also pass positional arguments `$7`
(`/usr/local/cloud/systemvm/conf/cloud.ca.crt`) and `$8` (CA certificate
content, same `^`/`~`-encoded form used for the leaf cert in `$6`). The CA
material is already available in the `Certificate` object used by
`CAManagerImpl` (see
`server/src/main/java/org/apache/cloudstack/ca/CAManagerImpl.java` around the
`SetupCertificateCommand cmd = new SetupCertificateCommand(certificate)` call)
and in `ca.plugin.root.ca.certificate` in the `configuration` table — it just
isn't being carried into the script invocation on the VMware SSH path.
Relevant files:
- `core/src/main/java/org/apache/cloudstack/ca/SetupCertificateCommand.java`
— command payload; may need to explicitly carry the CA certificate to the VR
path.
- `server/src/main/java/org/apache/cloudstack/ca/CAManagerImpl.java` — where
the command is built; has the `Certificate` object with CA material.
-
`plugins/hypervisors/vmware/src/main/java/com/cloud/hypervisor/vmware/resource/VmwareResource.java`
— VMware `executeInVR` dispatch; check the branch that handles
`SetupCertificateCommand` and routes it to `keystore-cert-import`.
**B. Secondary — make `scripts/util/keystore-cert-import` fail loud, with a
safe disk fallback.**
Current (buggy) behaviour:
```bash
elif [ ! -f "$CACERT_FILE" ]; then
echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
exit
fi
```
Suggested patch:
```diff
- elif [ ! -f "$CACERT_FILE" ]; then
- echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
- exit
- fi
+ elif [ ! -f "$CACERT_FILE" ]; then
+ # Fall back to the CA cert that cloud-early-config placed on disk.
+ if [ -f /usr/local/cloud/systemvm/conf/cloud.ca.crt ]; then
+ CACERT_FILE=/usr/local/cloud/systemvm/conf/cloud.ca.crt
+ else
+ echo "Cannot find ca certificate file; neither arg \$7 nor
/usr/local/cloud/systemvm/conf/cloud.ca.crt available, aborting!" >&2
+ exit 1
+ fi
+ fi
```
A non-zero exit lets the management server see the failure (today,
`SetupCertificateAnswer.result=true` completely masks it). The disk fallback is
defense-in-depth against future dispatch-side regressions of this same shape.
**Workaround currently in use on the affected deployment (for anyone hitting
this before a fix is merged):**
For each system VM, extract the existing leaf key+cert with `openssl
pkcs12`, rebuild the PKCS12 with `openssl pkcs12 -export ... -certfile
cloud.ca.crt`, then `keytool -import -noprompt -trustcacerts -alias cloudca
-file cloud.ca.crt -keystore cloud.jks -storepass <agent.properties
passphrase>`, then `systemctl restart cloud`. Note: the `keytool` shipped in
the 4.22 systemvm template (OpenJDK 17.0.16, Debian 12) rejected `-import` of
both PEM and DER forms of the CA with `CertificateParsingException: signed
overrun, bytes = 115`, so the rebuilt keystore had to be generated on the
management server (OpenJDK 21) and copied back. That broken `keytool -import`
behaviour on the systemvm image may deserve a separate issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]