One problem with this (delegating to ContainerFactory to share prewarm/warm containers to other cluster nodes) is that ContainerFactory currently is previously ignorant of container state - and making use of the shared containers requires sharing at least some of their state (besides paused/running state). Specifically: - creating a prewarm, the kind needs to be shared - pausing a warm, the action needs to be shared
To handle this, the ContainerFactory.createContainer(), Container.suspend() and Container.resume() would have to change to propagate this state. This seems slightly awkward to me, so want to put it out for feedback. WDYT? On Mar 30, 2018, at 2:31 PM, David P Grove <gro...@us.ibm.com<mailto:gro...@us.ibm.com>> wrote: +1. I like this design. --dave Tyson Norris <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> wrote on 03/30/2018 01:37:43 PM: From: Tyson Norris <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> To: "dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>" <dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>> Date: 03/30/2018 01:37 PM Subject: Re: Invoker HA on Mesos Hooking into pause/unpause/destroy of containers seems plausible, instead of hooking into the Maps in ContainerPool. So in the existing PR, the ContainerPool uses an alternate impl for Map to store freePool and prewarmPool, and that alternate impl initiates the attach to existing containers, when it becomes active. The ContainerPool could instead potentially delegate to the ContainerFactory, e.g. a ContainerFactory.reviveContainers(childFactory) => (freePool, prewarmPool) - we will still need a way to trigger this on demand (e.g. when the standby pool becomes active, in our case, but I think that is a minor detail). I can try it out; I will be out next week, but if you test any of this in the meantime, let me know. Thanks Tyson On Mar 30, 2018, at 9:58 AM, David P Grove <gro...@us.ibm.com<mailto:gro...@us.ibm.com>> wrote: Tyson Norris <tnor...@adobe.com.INVALID<mailto:tnor...@adobe.com.INVALID>> wrote on 03/27/2018 06:25:59 PM: Do you have an example of the labels working? I guess the labels are changed over time through the lifecycle of the container? Apologies for brutally chopping the email chain; my mail client made a horrible hash of it. Right now, all we are doing with Kube labels is to label each action container with its owning invoker on startup. This lets us delete orphaned containers if the invoker crashes and needs to be restarted. The labeling happens at [1] and the removal of orphans using the labels at [2]. I think the Kube-native version of part of what you are doing with the DistributedData for Mesos would be to add and remove additional labels to give us the option of attaching a new invoker instance to orphaned containers instead of just destroying them. Interacting with the Kubernetes API server to do a labeling operation takes around 10ms, so we couldn't do this on a truly hot path. But we could probably afford to update container labels in parallel with pause/unpause operations, which could enable re-attachment to any paused containers. --dave [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl&data=02%7C01%7Ctnorris%40adobe.com%7Ca7a6bc14ead944405aad08d59685d4e4%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636580423906584912&sdata=heMhgQgGqt4ku4hDZuAbKRDw96xQkM7anxlvlhoShs0%3D&reserved=0? u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com<http://3furl-3dhttps-253a-252f-252fgithub.com/>-252Fapache-252Fincubator-2Dopenwhisk-252Fblob-252F0b20df0f725a671f8e51c9e8793116476fd22f76-252Fcore-252Finvoker-252Fsrc-252Fmain-252Fscala-252Fwhisk-252Fcore-252Fcontainerpool-252Fkubernetes-252FKubernetesContainerFactory.scala-2523L81-26data-3D02-257C01-257Ctnorris-2540adobe.com<http://252fcore-252finvoker-252fsrc-252fmain-252fscala-252fwhisk-252fcore-252fcontainerpool-252fkubernetes-252fkubernetescontainerfactory.scala-2523l81-26data-3d02-257c01-257ctnorris-2540adobe.com/>-257C3ea96a8a416141db52b208d59660052f-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580261502275400-26sdata-3D6XagwDT7CnCoj1nOIHK-252B02bincKYogLkKy0vUXh8jY8-253D-26reserved-3D0&d=DwIFAg&c=jf_iaSHvJObTbx- siA1ZOg&r=Fe4FicGBU_20P2yihxV- apaNSFb6BSj6AlkptSF2gMk&m=4UxWSqFWfs8nhAEogipIZa9x4X7JbRZ5gLfuemvqWQI&s=AiIYyNqL1l96RBLRXVhvdAaIkrJjdZ- GRKClR0esbDc&e= [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl&data=02%7C01%7Ctnorris%40adobe.com%7Ca7a6bc14ead944405aad08d59685d4e4%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636580423906584912&sdata=heMhgQgGqt4ku4hDZuAbKRDw96xQkM7anxlvlhoShs0%3D&reserved=0? u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com<http://3furl-3dhttps-253a-252f-252fgithub.com/>-252Fapache-252Fincubator-2Dopenwhisk-252Fblob-252F0b20df0f725a671f8e51c9e8793116476fd22f76-252Fcore-252Finvoker-252Fsrc-252Fmain-252Fscala-252Fwhisk-252Fcore-252Fcontainerpool-252Fkubernetes-252FKubernetesContainerFactory.scala-2523L57-26data-3D02-257C01-257Ctnorris-2540adobe.com<http://252fcore-252finvoker-252fsrc-252fmain-252fscala-252fwhisk-252fcore-252fcontainerpool-252fkubernetes-252fkubernetescontainerfactory.scala-2523l57-26data-3d02-257c01-257ctnorris-2540adobe.com/>-257C3ea96a8a416141db52b208d59660052f-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580261502275400-26sdata-3Df6VQl9UMW7gtoFheibT9opXz973hGUVmivlDJg-252FF5Co-253D-26reserved-3D0&d=DwIFAg&c=jf_iaSHvJObTbx- siA1ZOg&r=Fe4FicGBU_20P2yihxV- apaNSFb6BSj6AlkptSF2gMk&m=4UxWSqFWfs8nhAEogipIZa9x4X7JbRZ5gLfuemvqWQI&s=ISliBvpYptlv9AhbicWZSFptIleHy1- XzCcKuqP7e-0&e=