Hi,

I'm trying to bootstrap a disconnected (air-gapped) 4.2 cluster using the bare
metal method
<https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-bare-metal.html>.
It is technically vmware, but I'm following the bare metal version as our
vmware cluster wasn't quite compatible with the vmware instructions.

After a few false starts I managed to get the bootstrapping to start to
take place.  One strange thing that happened was that it was trying to
download images from "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
instead of the documented "quay.io/openshift-release-dev/ocp-release". I
found this rather odd, and I couldn't find many references to
"ocp-v4.0-art-dev" on the internet, so I'm not sure exactly where it came
from.  I did a "strings openshift-install | grep ocp-v4.0-art-dev" but that
didn't show anything, so it's a bit of a strange one.

So my image content sources ended up being:

imageContentSources: - mirrors: -
<bastion_host_name>:5000/<repo_name>/release source:
quay.io/openshift-release-dev/ocp-release - mirrors: -
<bastion_host_name>:5000/<repo_name>/release source:
quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors: - <bastion_host_name>:5000/<repo_name>/release source:
registry.svc.ci.openshift.org/ocp/release

I was watching the journalctl on the bootstrap server, and I saw each etcd
server join one by one, then once they had all joined, then the apiserver
on the bootstrap server seemed to lockup, when I tried to connect to
https://localhost:6443 the connections would hang.  Initially, I thought
this meant that bootstrap had completed, but then I noticed that none of
the master nodes were listing on 6443, they were all trying to look
themselves up in etcd at "api-int.<cluster_name>.<base_domain>" but nothing
was listening.

I then scoured the journal on the bootstrap node, but I struggled to find
logs related to why the apiserver had disappeared.  The journal was mostly
full of the bootstrap node trying to connect to https://localhost:6443,
which suggested to me that bootstrap was not yet complete.

I tried rebooting the bootstrap node, but I think that made it worse, it
seemed to be in a crash loop whinging about files in /etc/kubernetes
already existing or something like that.  I had a look through /var/logs
and found this error message in some pod logs:

exiting because of error: log: unable to create log: open
/var/log/bootstrap-control-plane/kube-apiserver.log: permission denied

I'm not sure if that error is because I restarted before bootstrap was
successful, or if that is actually some sort of problem.

I tried reinstalling from scratch a few times, and it always got stuck in
the same place, so it doesn't seem to be transient.

Where can I look for errors? Is "ocp-v4.0-art-dev" an indication of a
problem? Since it's an air-gapped solution it's difficult to get logs out
of the system, so I don't know if I'll be able to use must-gather.
However, if I'm understanding it correctly, must-gather can only be used
after bootstrap has succeeded.

Thoughts?
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to