The etherpad for this session is here [1]. The goal of the session was to get some questions answered that the developers had for operators around the topic of cellsv2.
The bulk of the time was spent discussing ways to limit instance scheduling retries in a cellsv2 world where placement eliminates resource-reservation races. Reschedules would be upcalls from the cell, which we are trying to avoid. While placement should eliminate 95% (or more) of reschedules due to pre-claiming resources before booting, there will still be cases where we may want to reschedule due to unexpected transient failures. How many of those remain, and whether or not rescheduling for them is really useful is in question. The compromise that seemed popular in the room was to grab more than one host at the time of scheduling, claim for that one, but pass the rest to the cell. If the cell needs to reschedule, the cell conductor would try one of the alternates that came as part of the original boot request, instead of asking scheduler again. During the discussion of this, an operator raised the concern that without reschedules, a single compute that fails to boot 100% of the time ends up becoming a magnet for all future builds, looking like an excellent target for the scheduler, but failing anything that is sent to it. If we don't reschedule, that situation could be very problematic. An idea came out that we should really have compute monitor and disable itself if a certain number of _consecutive_ build failures crosses a threshold. That would mitigate/eliminate the "fail magnet" behavior and further reduce the need for retries. A patch has been proposed for this, and so far enjoys wide support [2]. We also discussed the transition to counting quotas, and what that means for operators. The room seemed in favor of this, and discussion was brief. Finally, I made the call for people with reasonably-sized pre-prod environments to begin testing cellsv2 to help prove it out and find the gremlins. CERN and NeCTAR specifically volunteered for this effort. [1] https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer-community-coordination [2] https://review.openstack.org/#/c/463597/ --Dan __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev