We will address the shortcomings that Marco outlined by using a pipeline to
deploy the CI infrastructure. Which will allow for contributions and easy
redeployment and rollback in the case of issues.

I would recommend planning a migration towards Drone IO or similar, with an
initial prototype to validate that the main use cases are covered.

Pedro.

On Thu, Sep 19, 2019 at 2:29 PM Sheng Zha <zhash...@apache.org> wrote:

> Hi Marco,
>
> Thank you for sharing the insights. The discussion is intended for setting
> goals so that future design improvement to the CI can take these goals into
> consideration. Thus, while I fully recognize that there could be difficulty
> in implementation, I'd still like to confirm with the community if the
> outlined access control recommendation is at the right level.
>
> To summarize your concerns:
> - opening up access control should be conditioned on having good version
> control and roll-back mechanism to ease the operation burden from breakage,
> which is more likely given larger user base.
> - upgrades to the system would be better managed as planned and collective
> efforts instead of adhoc tasks performed by uncoordinated individuals.
>
> You also mentioned that "changes to the system should only be done by the
> administrators". It's exactly the intention of this thread is to define who
> would qualify as administrators. Currently, such qualification is opaque,
> and only happens within a group in Amazon.
>
> On the other hand, this current way can, and already has caused friction.
> When this project's daily activity of validating and merging code is
> affected due to the system's instability, the community members have no
> choice but to wait for the issues to be resolved by the current system
> administrators. Other affected community members have no way to help even
> if they wish to.
>
> Given the existing Apache project governance model, I'd recommend that the
> goal for CI access control be set so that committer and PMC member who
> wishes to be involved should have the right to help.
>
> -sz
>
> On 2019/09/17 12:49:20, Marco de Abreu <marco.g.ab...@gmail.com> wrote:
> > Ah, with regards to #1 and #2: Currently, we don't have any plugins that
> > control the actions of a single user and allows us to monitor and rate
> > limit them. Just giving trigger permission (which is also tied with
> > abort-permission if I recall correctly), would allow a malicious user to
> > start a huge number of jobs and thus either create immense costs or bring
> > down the system. Also, we'd have to check how we can restrict the trigger
> > permission to specific jobs.
> >
> > -Marco
> >
> > On Tue, Sep 17, 2019 at 2:47 PM Marco de Abreu <marco.g.ab...@gmail.com>
> > wrote:
> >
> > > Hi Sheng,
> > >
> > > will I'm in general all in favour of widening the access to distribute
> the
> > > tasks, the situation around the CI system in particular is a bit more
> > > difficult.
> > >
> > > As far as I know, the creation of the CI system is neither automated,
> > > versioned nor backed up or safeguarded. This means that if somebody
> makes a
> > > change that breaks something, we're left with a broken system we can't
> > > recover from. Thus, I preferred it in the past to restrict the access
> as
> > > much as possible (at least to Prod) to avoid these situations from
> > > happening. While #1 and #2 are already possible today (we have two
> roles
> > > for committers and regular users that allow this already), #3 and #4
> come
> > > with a significant risk for the stability of the system.
> > >
> > > As soon as a job is added or changed, a lot of things happen in
> Jenkins -
> > > one of these tasks is the SCM scan which tries to determine the
> branches
> > > the job should run on. For somebody who is inexperienced, the first
> pitfall
> > > is that suddenly hundreds of jobs are being spawned which will
> certainly
> > > overload Jenkins and render it unusable. There are a lot of tricks and
> I
> > > could elaborate them, but basically the bottom line is that the
> > > configuration interface of Jenkins is far from fail-proof and exposes a
> > > significant risk if accessed by somebody who doesn't exactly know what
> > > they're doing - speak, we would need to design some kind of training
> and
> > > even that would not safeguard us from these fatal events.
> > >
> > > There's the whole security aspect around user-facing artifact
> generation
> > > of CI/CD and the possibility of them being tampered, but I don't think
> I
> > > have to elaborate that.
> > >
> > > With regards to #4 especially, I'd say that the risk of somebody just
> > > upgrading the system or changing plugins inherits an even bigger risk.
> > > Plugins are notoriously unsafe and system updates have also shown to
> not
> > > really go like a breeze. I'd argue that changes to the system should
> only
> > > be done by the administrators of it since they have a bigger overview
> over
> > > all the things that are currently going on while also having the full
> > > access (backups before making changes, SSH access, log access, metric
> > > access, etc) to debug errors. In the end we shouldn't forget that this
> is a
> > > productive system - usually, you'd have nobody being able to touch it
> at
> > > all, but we're not in a perfect world, so I'd say we should restrict
> it to
> > > a bare minimum in the form of admins.
> > >
> > > So while I certainly understand and encourage to distribute the
> access, I
> > > don't feel comfortable widening the access to such a critical
> productive
> > > system. It being down means that the GitHub development is fully
> halted,
> > > which is really problematic since we don't have rollback mechanisms.
> > >
> > > Best regards,
> > > marco
> > >
> > > On Sun, Sep 15, 2019 at 6:40 AM Sheng Zha <zhash...@apache.org> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'd like to initiate discussion on how access control should be
> managed
> > >> for the CI system. The hope is that we can present the conclusion of
> this
> > >> discussion as the recommendation and request to the donors of the CI
> system
> > >> from Amazon.
> > >>
> > >> The specific aspects I'd like to discuss are the abilities to:
> > >> 1. trigger PR validation and nightly jobs.
> > >> 2. trigger continuous delivery jobs, such as for binary releases in
> pip,
> > >> maven, and dockerhub.
> > >> 3. add jobs to the CI system.
> > >> 4. maintain and manage the CI system, such as system upgrades and
> jenkins
> > >> plugin installation.
> > >>
> > >> Given that we already have GitHub SSO enabled on the Jenkins CI, I
> > >> suggest the following authentication levels for these items:
> > >> 1. all authenticated GitHub users.
> > >> 2-4. all MXNet committers
> > >>
> > >> What do you think? If you have more aspects that you wish to discuss,
> > >> feel free to propose.
> > >>
> > >> -sz
> > >>
> > >
> >
>

Reply via email to