To close the loop on this, we chatted offline a bit and I'm pretty sure this is related to a bug in the rkt api-service wherein rkt fly pods result in an increased latency for `ListPods` calls, leading to the kubelet timing out contacting the api-service.
I've filed a rkt issue upstream and am working on a fix for that bug: https://github.com/coreos/rkt/issues/3060 Minimzing the number of running fly pods should reduce the impact of this issue, though if it doesn't we might be dealing with a difference issue entirely. In this case, it did seem to help. On Thu, Aug 11, 2016 at 1:27 PM, Stefan Vetter <[email protected]> wrote: > Update: looks like it is not working on all machines :-( > > And I don't see any additional log entries there: > Aug 11 20:24:39 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:24:39 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:25:43 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:25:43.291881 32877 generic.go:197] GenericPLEG: Unable to retrieve > pods: couldn't list pods: rpc error: code = 4 desc = "context deadline > exceeded" > Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:25:44.286288 32877 pod_workers.go:183] Error syncing pod > 7ce192e0-5ffe-11e6-9211-001999c214fd, skipping: failed to SyncPod: > couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded" > Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:25:45.285416 32877 kubelet.go:2249] Error listing containers: > &errors.errorString{s:"couldn't list pods: rpc error: code = 4 desc = > \"context deadline exceeded\""} > Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:25:45.285506 32877 kubelet.go:2596] Failed cleaning pods: couldn't > list pods: rpc error: code = 4 desc = "context deadline exceeded" > Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:28 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:28 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:34 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:34 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:39 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:26:39.285875 32877 rkt.go:1859] rkt: Failed to list pods: rpc error: > code = 4 desc = "context deadline exceeded" > Aug 11 20:26:39 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811 > 20:26:39.285913 32877 kubelet.go:927] Container garbage collection > failed: rpc error: code = 4 desc = "context deadline exceeded" > Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:49 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:49 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32845]: W0811 > 20:27:03.369172 32877 manager.go:1045] Failed to process watch event > {EventType:0 Name:/user.slice/user-0.slice/session-c57.scope > WatchSource:1}: this should be impossible!, new handler failing, but > factory allowed, name = /user.slice/user-0.slice/session-c57.scope > Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:27:05 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:27:05 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > Aug 11 20:27:09 node3.cluster1.kubernetes.cluster.int sudo[32844]: > api-service: checking for machined registration failed > Aug 11 20:27:09 node3.cluster1.kubernetes.cluster.int sudo[32844]: > └─pod not found > > > > Am Donnerstag, 11. August 2016 22:14:09 UTC+2 schrieb Stefan Vetter: >> >> Current status: >> >> Problems still existing: >> 1. Time until a container is created is very long (at least 2-3minutes) >> 2. Time until the new status of the container can be seen in the >> master/apiserver: several minutes >> >> Problems solved: >> 1. Containers can be created/managed now >> >> Actions to get the problem solved: >> 1. Create directory /var/lib/polkit-1 >> 2. Add Slice=machine.slice to rkt-api.service >> >> >> If you have any ideas on this problem, please any time send me an answer. >> >> >> Errors in log: >> Aug 11 20:12:11 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811 >> 20:12:11.665421 5687 generic.go:197] GenericPLEG: Unable to retrieve >> pods: couldn't list pods: rpc error: code = 4 desc = "context deadline >> exceeded" >> Aug 11 20:12:13 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> api-service: checking for machined registration failed >> Aug 11 20:12:13 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> └─pod not found >> Aug 11 20:12:33 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811 >> 20:12:33.596852 5687 kubelet.go:2249] Error listing containers: >> &errors.errorString{s:"couldn't list pods: rpc error: code = 4 desc = >> \"context deadline exceeded\""} >> Aug 11 20:12:33 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811 >> 20:12:33.596903 5687 kubelet.go:2596] Failed cleaning pods: couldn't >> list pods: rpc error: code = 4 desc = "context deadline exceeded" >> Aug 11 20:12:34 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> api-service: checking for machined registration failed >> Aug 11 20:12:34 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> └─pod not found >> Aug 11 20:12:57 node1.cluster1.kubernetes.cluster.int sudo[5671]: W0811 >> 20:12:57.606672 5687 image_manager.go:135] [ImageManager] Failed to >> monitor images: couldn't list pods: rpc error: code = 4 desc = "context >> deadline exceeded" >> Aug 11 20:12:58 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> api-service: checking for machined registration failed >> Aug 11 20:12:58 node1.cluster1.kubernetes.cluster.int sudo[5613]: >> └─pod not found >> >> >> Thanks for any help! >> >> >> Am Mittwoch, 10. August 2016 18:36:52 UTC+2 schrieb Euan Kemp: >>> >>> The logs of the "rkt-api" unit are a good place to check when you get >>> rpc errors because that's the service it's trying to communicate with. >>> >>> I recall getting a similar error with older versions of rkt due to a bug >>> in the api-service leading to a panic. Updating to a newer rkt version >>> (which can be done by updating to a newer CoreOS release) might help, but >>> if not the api-service logs would help us figure out what's going on. >>> >>> Best, >>> Euan >>> >>> >>> On Tuesday, August 9, 2016 at 7:42:09 PM UTC-7, Stefan Vetter wrote: >>>> >>>> Maybe I should just switch back to docker as it seams rkt does not work >>>> with kubelet properly... :-( >>>> >>>> Am Dienstag, 9. August 2016 18:09:00 UTC+2 schrieb Stefan Vetter: >>>>> >>>>> Hi all, >>>>> >>>>> first of all, I still have error messages when using a downloaded >>>>> kubelet, but pods get deployed. >>>>> >>>>> >>>>> Using kubernetes 1.3.2 and rkt 1.10.1, coreos 1068.8.0 >>>>> >>>>> >>>>> Here are my findings (kubelet-wrapper): >>>>> >>>>> Aug 09 14:35:09 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:35:09.942522 4523 kubelet.go:2594] Failed cleaning pods: >>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exce >>>>> >>>>> eded" >>>>> >>>>> Aug 09 14:35:09 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:35:09.942455 4523 kubelet.go:2247] Error listing containers: >>>>> &errors.errorString{s:"couldn't list pods: rpc error: code = 4 de >>>>> >>>>> sc = \"context deadline exceeded\""} >>>>> >>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:35:04.942120 4523 generic.go:197] GenericPLEG: Unable to >>>>> retrieve pods: couldn't list pods: rpc error: code = 4 desc = "contex >>>>> >>>>> t deadline exceeded" >>>>> >>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:35:04.941377 4523 kubelet.go:927] Container garbage collection >>>>> failed: rpc error: code = 4 desc = "context deadline exceeded" >>>>> >>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:35:04.941322 4523 rkt.go:1866] rkt: Failed to list pods: rpc >>>>> error: code = 4 desc = "context deadline exceeded" >>>>> >>>>> Aug 09 14:33:04 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:33:04.944400 4523 manager.go:235] Docker container factory >>>>> registration failed: failed to validate Docker info: failed to dete >>>>> >>>>> ct Docker info: Cannot connect to the Docker daemon. Is the docker >>>>> daemon running on this host?. >>>>> >>>>> Aug 09 14:33:04 node3.cluster1.kubernetes.cluster.int sudo[4487]: >>>>> E0809 14:33:04.940714 4523 kubelet.go:933] Image garbage collection >>>>> failed: unable to find data for container / >>>>> >>>>> >>>>> >>>>> >>>>> Here are my findings (without wrapper): >>>>> >>>>> >>>>> E0809 16:00:35.278281 46910 pod_workers.go:183] Error syncing pod >>>>> ff3cfae6-5e48-11e6-9c00-001999c214fd, skipping: failed to SyncPod: >>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded" >>>>> >>>>> E0809 16:02:34.972125 46910 kubelet.go:2247] Error listing >>>>> containers: &errors.errorString{s:"couldn't list pods: rpc error: >>>>> code = 4 desc = \"context deadline exceeded\""} >>>>> >>>>> E0809 16:02:34.972156 46910 kubelet.go:2594] Failed cleaning pods: >>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded" >>>>> >>>>> E0809 16:02:37.278562 46910 generic.go:197] GenericPLEG: Unable to >>>>> retrieve pods: couldn't list pods: rpc error: code = 4 desc = "context >>>>> deadline exceeded" >>>>> >>>>> E0809 16:02:39.183965 46910 rkt.go:65] detectRktContainers: >>>>> listRunningPods failed: rpc error: code = 2 desc = "grpc: the client >>>>> connection is closing" >>>>> >>>>> >>>>> >>>>> >>>>> Thanks for any very appreaciated help! >>>>> >>>>> >>>>> >>>>> Stefan >>>>> >>>>
