I think these are two adjacent discussions. Let's focus on getting the CI in a state where we can make it deployable first.
Pedro Larroy <pedro.larroy.li...@gmail.com> schrieb am Sa., 28. Sep. 2019, 00:01: > We will address the shortcomings that Marco outlined by using a pipeline to > deploy the CI infrastructure. Which will allow for contributions and easy > redeployment and rollback in the case of issues. > > I would recommend planning a migration towards Drone IO or similar, with an > initial prototype to validate that the main use cases are covered. > > Pedro. > > On Thu, Sep 19, 2019 at 2:29 PM Sheng Zha <zhash...@apache.org> wrote: > > > Hi Marco, > > > > Thank you for sharing the insights. The discussion is intended for > setting > > goals so that future design improvement to the CI can take these goals > into > > consideration. Thus, while I fully recognize that there could be > difficulty > > in implementation, I'd still like to confirm with the community if the > > outlined access control recommendation is at the right level. > > > > To summarize your concerns: > > - opening up access control should be conditioned on having good version > > control and roll-back mechanism to ease the operation burden from > breakage, > > which is more likely given larger user base. > > - upgrades to the system would be better managed as planned and > collective > > efforts instead of adhoc tasks performed by uncoordinated individuals. > > > > You also mentioned that "changes to the system should only be done by the > > administrators". It's exactly the intention of this thread is to define > who > > would qualify as administrators. Currently, such qualification is opaque, > > and only happens within a group in Amazon. > > > > On the other hand, this current way can, and already has caused friction. > > When this project's daily activity of validating and merging code is > > affected due to the system's instability, the community members have no > > choice but to wait for the issues to be resolved by the current system > > administrators. Other affected community members have no way to help even > > if they wish to. > > > > Given the existing Apache project governance model, I'd recommend that > the > > goal for CI access control be set so that committer and PMC member who > > wishes to be involved should have the right to help. > > > > -sz > > > > On 2019/09/17 12:49:20, Marco de Abreu <marco.g.ab...@gmail.com> wrote: > > > Ah, with regards to #1 and #2: Currently, we don't have any plugins > that > > > control the actions of a single user and allows us to monitor and rate > > > limit them. Just giving trigger permission (which is also tied with > > > abort-permission if I recall correctly), would allow a malicious user > to > > > start a huge number of jobs and thus either create immense costs or > bring > > > down the system. Also, we'd have to check how we can restrict the > trigger > > > permission to specific jobs. > > > > > > -Marco > > > > > > On Tue, Sep 17, 2019 at 2:47 PM Marco de Abreu < > marco.g.ab...@gmail.com> > > > wrote: > > > > > > > Hi Sheng, > > > > > > > > will I'm in general all in favour of widening the access to > distribute > > the > > > > tasks, the situation around the CI system in particular is a bit more > > > > difficult. > > > > > > > > As far as I know, the creation of the CI system is neither automated, > > > > versioned nor backed up or safeguarded. This means that if somebody > > makes a > > > > change that breaks something, we're left with a broken system we > can't > > > > recover from. Thus, I preferred it in the past to restrict the access > > as > > > > much as possible (at least to Prod) to avoid these situations from > > > > happening. While #1 and #2 are already possible today (we have two > > roles > > > > for committers and regular users that allow this already), #3 and #4 > > come > > > > with a significant risk for the stability of the system. > > > > > > > > As soon as a job is added or changed, a lot of things happen in > > Jenkins - > > > > one of these tasks is the SCM scan which tries to determine the > > branches > > > > the job should run on. For somebody who is inexperienced, the first > > pitfall > > > > is that suddenly hundreds of jobs are being spawned which will > > certainly > > > > overload Jenkins and render it unusable. There are a lot of tricks > and > > I > > > > could elaborate them, but basically the bottom line is that the > > > > configuration interface of Jenkins is far from fail-proof and > exposes a > > > > significant risk if accessed by somebody who doesn't exactly know > what > > > > they're doing - speak, we would need to design some kind of training > > and > > > > even that would not safeguard us from these fatal events. > > > > > > > > There's the whole security aspect around user-facing artifact > > generation > > > > of CI/CD and the possibility of them being tampered, but I don't > think > > I > > > > have to elaborate that. > > > > > > > > With regards to #4 especially, I'd say that the risk of somebody just > > > > upgrading the system or changing plugins inherits an even bigger > risk. > > > > Plugins are notoriously unsafe and system updates have also shown to > > not > > > > really go like a breeze. I'd argue that changes to the system should > > only > > > > be done by the administrators of it since they have a bigger overview > > over > > > > all the things that are currently going on while also having the full > > > > access (backups before making changes, SSH access, log access, metric > > > > access, etc) to debug errors. In the end we shouldn't forget that > this > > is a > > > > productive system - usually, you'd have nobody being able to touch it > > at > > > > all, but we're not in a perfect world, so I'd say we should restrict > > it to > > > > a bare minimum in the form of admins. > > > > > > > > So while I certainly understand and encourage to distribute the > > access, I > > > > don't feel comfortable widening the access to such a critical > > productive > > > > system. It being down means that the GitHub development is fully > > halted, > > > > which is really problematic since we don't have rollback mechanisms. > > > > > > > > Best regards, > > > > marco > > > > > > > > On Sun, Sep 15, 2019 at 6:40 AM Sheng Zha <zhash...@apache.org> > wrote: > > > > > > > >> Hi, > > > >> > > > >> I'd like to initiate discussion on how access control should be > > managed > > > >> for the CI system. The hope is that we can present the conclusion of > > this > > > >> discussion as the recommendation and request to the donors of the CI > > system > > > >> from Amazon. > > > >> > > > >> The specific aspects I'd like to discuss are the abilities to: > > > >> 1. trigger PR validation and nightly jobs. > > > >> 2. trigger continuous delivery jobs, such as for binary releases in > > pip, > > > >> maven, and dockerhub. > > > >> 3. add jobs to the CI system. > > > >> 4. maintain and manage the CI system, such as system upgrades and > > jenkins > > > >> plugin installation. > > > >> > > > >> Given that we already have GitHub SSO enabled on the Jenkins CI, I > > > >> suggest the following authentication levels for these items: > > > >> 1. all authenticated GitHub users. > > > >> 2-4. all MXNet committers > > > >> > > > >> What do you think? If you have more aspects that you wish to > discuss, > > > >> feel free to propose. > > > >> > > > >> -sz > > > >> > > > > > > > > > >