To close the loop on this, we chatted offline a bit and I'm pretty sure
this is related to a bug in the rkt api-service wherein rkt fly pods result
in an increased latency for `ListPods` calls, leading to the kubelet timing
out contacting the api-service.

I've filed a rkt issue upstream and am working on a fix for that bug:
https://github.com/coreos/rkt/issues/3060

Minimzing the number of running fly pods should reduce the impact of this
issue, though if it doesn't we might be dealing with a difference issue
entirely. In this case, it did seem to help.


On Thu, Aug 11, 2016 at 1:27 PM, Stefan Vetter <[email protected]> wrote:

> Update: looks like it is not working on all machines :-(
>
> And I don't see any additional log entries there:
> Aug 11 20:24:39 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:24:39 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:25:43 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:25:43.291881   32877 generic.go:197] GenericPLEG: Unable to retrieve
> pods: couldn't list pods: rpc error: code = 4 desc = "context deadline
> exceeded"
> Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:25:44.286288   32877 pod_workers.go:183] Error syncing pod
> 7ce192e0-5ffe-11e6-9211-001999c214fd, skipping: failed to SyncPod:
> couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded"
> Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:25:44 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:25:45.285416   32877 kubelet.go:2249] Error listing containers:
> &errors.errorString{s:"couldn't list pods: rpc error: code = 4 desc =
> \"context deadline exceeded\""}
> Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:25:45.285506   32877 kubelet.go:2596] Failed cleaning pods: couldn't
> list pods: rpc error: code = 4 desc = "context deadline exceeded"
> Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:25:45 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:28 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:28 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:34 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:34 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:39 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:26:39.285875   32877 rkt.go:1859] rkt: Failed to list pods: rpc error:
> code = 4 desc = "context deadline exceeded"
> Aug 11 20:26:39 node3.cluster1.kubernetes.cluster.int sudo[32845]: E0811
> 20:26:39.285913   32877 kubelet.go:927] Container garbage collection
> failed: rpc error: code = 4 desc = "context deadline exceeded"
> Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:48 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:49 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:49 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:26:50 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32845]: W0811
> 20:27:03.369172   32877 manager.go:1045] Failed to process watch event
> {EventType:0 Name:/user.slice/user-0.slice/session-c57.scope
> WatchSource:1}: this should be impossible!, new handler failing, but
> factory allowed, name = /user.slice/user-0.slice/session-c57.scope
> Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:27:03 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:27:05 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:27:05 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
> Aug 11 20:27:09 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> api-service: checking for machined registration failed
> Aug 11 20:27:09 node3.cluster1.kubernetes.cluster.int sudo[32844]:
> └─pod not found
>
>
>
> Am Donnerstag, 11. August 2016 22:14:09 UTC+2 schrieb Stefan Vetter:
>>
>> Current status:
>>
>> Problems still existing:
>> 1. Time until a container is created is very long (at least 2-3minutes)
>> 2. Time until the new status of the container can be seen in the
>> master/apiserver: several minutes
>>
>> Problems solved:
>> 1. Containers can be created/managed now
>>
>> Actions to get the problem solved:
>> 1. Create directory /var/lib/polkit-1
>> 2. Add Slice=machine.slice to rkt-api.service
>>
>>
>> If you have any ideas on this problem, please any time send me an answer.
>>
>>
>> Errors in log:
>> Aug 11 20:12:11 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811
>> 20:12:11.665421    5687 generic.go:197] GenericPLEG: Unable to retrieve
>> pods: couldn't list pods: rpc error: code = 4 desc = "context deadline
>> exceeded"
>> Aug 11 20:12:13 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> api-service: checking for machined registration failed
>> Aug 11 20:12:13 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> └─pod not found
>> Aug 11 20:12:33 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811
>> 20:12:33.596852    5687 kubelet.go:2249] Error listing containers:
>> &errors.errorString{s:"couldn't list pods: rpc error: code = 4 desc =
>> \"context deadline exceeded\""}
>> Aug 11 20:12:33 node1.cluster1.kubernetes.cluster.int sudo[5671]: E0811
>> 20:12:33.596903    5687 kubelet.go:2596] Failed cleaning pods: couldn't
>> list pods: rpc error: code = 4 desc = "context deadline exceeded"
>> Aug 11 20:12:34 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> api-service: checking for machined registration failed
>> Aug 11 20:12:34 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> └─pod not found
>> Aug 11 20:12:57 node1.cluster1.kubernetes.cluster.int sudo[5671]: W0811
>> 20:12:57.606672    5687 image_manager.go:135] [ImageManager] Failed to
>> monitor images: couldn't list pods: rpc error: code = 4 desc = "context
>> deadline exceeded"
>> Aug 11 20:12:58 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> api-service: checking for machined registration failed
>> Aug 11 20:12:58 node1.cluster1.kubernetes.cluster.int sudo[5613]:
>> └─pod not found
>>
>>
>> Thanks for any help!
>>
>>
>> Am Mittwoch, 10. August 2016 18:36:52 UTC+2 schrieb Euan Kemp:
>>>
>>> The logs of the "rkt-api" unit are a good place to check when you get
>>> rpc errors because that's the service it's trying to communicate with.
>>>
>>> I recall getting a similar error with older versions of rkt due to a bug
>>> in the api-service leading to a panic. Updating to a newer rkt version
>>> (which can be done by updating to a newer CoreOS release) might help, but
>>> if not the api-service logs would help us figure out what's going on.
>>>
>>> Best,
>>> Euan
>>>
>>>
>>> On Tuesday, August 9, 2016 at 7:42:09 PM UTC-7, Stefan Vetter wrote:
>>>>
>>>> Maybe I should just switch back to docker as it seams rkt does not work
>>>> with kubelet properly... :-(
>>>>
>>>> Am Dienstag, 9. August 2016 18:09:00 UTC+2 schrieb Stefan Vetter:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> first of all, I still have error messages when using a downloaded
>>>>> kubelet, but pods get deployed.
>>>>>
>>>>>
>>>>> Using kubernetes 1.3.2 and rkt 1.10.1, coreos 1068.8.0
>>>>>
>>>>>
>>>>> Here are my findings (kubelet-wrapper):
>>>>>
>>>>> Aug 09 14:35:09 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:35:09.942522    4523 kubelet.go:2594] Failed cleaning pods:
>>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exce
>>>>>
>>>>> eded"
>>>>>
>>>>> Aug 09 14:35:09 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:35:09.942455    4523 kubelet.go:2247] Error listing containers:
>>>>> &errors.errorString{s:"couldn't list pods: rpc error: code = 4 de
>>>>>
>>>>> sc = \"context deadline exceeded\""}
>>>>>
>>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:35:04.942120    4523 generic.go:197] GenericPLEG: Unable to
>>>>> retrieve pods: couldn't list pods: rpc error: code = 4 desc = "contex
>>>>>
>>>>> t deadline exceeded"
>>>>>
>>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:35:04.941377    4523 kubelet.go:927] Container garbage collection
>>>>> failed: rpc error: code = 4 desc = "context deadline exceeded"
>>>>>
>>>>> Aug 09 14:35:04 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:35:04.941322    4523 rkt.go:1866] rkt: Failed to list pods: rpc
>>>>> error: code = 4 desc = "context deadline exceeded"
>>>>>
>>>>> Aug 09 14:33:04 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:33:04.944400    4523 manager.go:235] Docker container factory
>>>>> registration failed: failed to validate Docker info: failed to dete
>>>>>
>>>>> ct Docker info: Cannot connect to the Docker daemon. Is the docker
>>>>> daemon running on this host?.
>>>>>
>>>>> Aug 09 14:33:04 node3.cluster1.kubernetes.cluster.int sudo[4487]:
>>>>> E0809 14:33:04.940714    4523 kubelet.go:933] Image garbage collection
>>>>> failed: unable to find data for container /
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Here are my findings (without wrapper):
>>>>>
>>>>>
>>>>> E0809 16:00:35.278281   46910 pod_workers.go:183] Error syncing pod
>>>>> ff3cfae6-5e48-11e6-9c00-001999c214fd, skipping: failed to SyncPod:
>>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded"
>>>>>
>>>>> E0809 16:02:34.972125   46910 kubelet.go:2247] Error listing
>>>>> containers: &errors.errorString{s:"couldn't list pods: rpc error:
>>>>> code = 4 desc = \"context deadline exceeded\""}
>>>>>
>>>>> E0809 16:02:34.972156   46910 kubelet.go:2594] Failed cleaning pods:
>>>>> couldn't list pods: rpc error: code = 4 desc = "context deadline exceeded"
>>>>>
>>>>> E0809 16:02:37.278562   46910 generic.go:197] GenericPLEG: Unable to
>>>>> retrieve pods: couldn't list pods: rpc error: code = 4 desc = "context
>>>>> deadline exceeded"
>>>>>
>>>>> E0809 16:02:39.183965   46910 rkt.go:65] detectRktContainers:
>>>>> listRunningPods failed: rpc error: code = 2 desc = "grpc: the client
>>>>> connection is closing"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks for any very appreaciated help!
>>>>>
>>>>>
>>>>>
>>>>> Stefan
>>>>>
>>>>

Reply via email to