Re: Kubernetes Operator: Can We Preserve CassKop's Flexibility?

2020-10-10 Thread Christopher Bradford
Hey Tom,

You make some great points. I agree that there is an ecosystem of tooling
surrounding cass-operator, but those tools are narrowly focused on
particular tasks. Ideally, as community iniatives like CEP-1

graduate, they can replace these components where appropriate. Right now
the documentation for cass-operator exists in a few separate directories
under the Github repo. Guidance around specific topics and some more
reference material is sorely needed. There is currently an effort underway
to organize existing documentation and build out a platform for new
content.

In your message you reference support for a pre-run script. We have this
functionality in cass-operator via the spec.podTemplateSpec.initContainers
field. The cass-confg-builder container is prepended to this list ensuring
that all rendered configuration files are available on disk for any
subsequent containers to modify / leverage. There are some cass-operator
users that schedule their own containers to pull secrets from external
systems like Vault and inject them into the configs.

~Chris

Christopher Bradford



On Wed, Oct 7, 2020 at 11:00 PM Cyril Scetbon  wrote:

> Thank you Tom for your support, as one of the main contributors of CassKop
> I’m happy to see that the efforts we put in it to try to support as many
> configurations as possible is well appreciated.
>
> When we first started to talk about creating a kubernetes operator we
> always mentioned the features that we added and the importance of trying to
> fulfill the needs of every user. All those choices have a reason, a
> situation that happened on production, a configuration that we used to
> apply to some of our clusters or a situation that could potentially happen
> and that we needed to overcome. An example is the fact that IPs could
> change when a kubernetes node restarts, and possibly IPs could be exchanged
> between 2 nodes of the same cluster. We then implemented a detection
> algorithm  that when it sees it happening tries to restart pods which
> should get new IPs and solve the problem
> https://orange-opensource.github.io/casskop/docs/3_configuration_deployment/9_advanced_configuration#cross-ip-management
>
> The features we tried to add solved use cases that happened on production
> or that could happen due to the environment and we tried to make it as
> simple and intuitive as possible. We put also a lot of efforts in the
> documentation which is not perfect but serves the purpose of explaining and
> detailing how to use CassKop.
>
> Soon, when we start talking about porting our features, we’ll of course
> support the importance of making it opened (tbh we had in mind to make it
> supported by any recent Cassandra versions and even ScyllaDB) as much as
> possible, simple, configurable and adaptable if possible. Of course not all
> versions are supported even by CassKop cause we make some Jolokia calls and
> if the JMX bean change some important operations could stop working (We
> check that a datacenter has no data replicated to it before decommissioning
> <
> https://orange-opensource.github.io/casskop/docs/5_operations/1_cluster_operations#updatescaledown>
> it for instance).
>
> I had a few discussions with some of the cass-operator developers and I
> think we understood each other and know that in order for it to be adopted
> and the work to be fruitful no feature should be lost on the way and if
> there is a better way to do things we’ll find it together. Orange also uses
> CassKop and will keep using it as long as the crucial features are not
> available. We’ll also have to find a way to migrate from CassKop to
> Cass-operator without breaking everything. But let’s start walking before
> running 
>
> —
> Cyril Scetbon
>
> > On Oct 7, 2020, at 2:23 PM, Tom Offermann 
> wrote:
> >
> > I've been following the discussion about Kubernetes operators with a
> great
> > deal of interest. At New Relic, we're about to move our Cassandra
> Clusters
> > from bare-metal hosts in our datacenters to Kubernetes clusters in AWS,
> so
> > we've been looking closely at the current operators.
> >
> > Our goals:
> >
> > * Don't write our own operator.
> >
> > * Choose the community standard, if possible. If not possible, choose an
> > operator with active development, usage, and community.
> >
> > * Choose an operator that can work with our existing way of managing
> > clusters. Most significantly, at New Relic we do not use virtual nodes in
> > our Cassandra clusters. Instead, we continue to assign initial_tokens to
> > individual nodes. While we certainly don't expect an operator to support
> > this use case by default,  we do hope that an operator will make it
> > possible.
> >
> > * Don't run a forked version of the operator.
> >
> > Both [cass-operator][1] and [CassKop][2] worked very well and we were
> > really impressed with both of them. Heading into the evaluation, we
> > expected to 

Re: Supported upgrade path for 4.0

2020-10-10 Thread Mick Semb Wever
> "3.11 performs close to parity with 2.1/2.2. 3.0 does not. If we recommend
> people upgrade from 2.1 -> 3.0 -> 4.0, we are asking them to have a cluster
> in a regressed performance state for potentially months as they execute
> their upgrade."
>
> Did I get anything wrong here Mick? ^
>


That's correct Josh.

>From tickets like those listed, and from experience, we recommend folk
avoid 3.0 altogether. This has only been made more evident by witnessing
the benefits from 3.0 → 3.11 upgrades.

My recommendation remains  2.*→3.11→4.0. And I don't believe I'm alone.
Though if a user was already on 3.0, then I would (of course) recommend an
upgrade directly to 4.0.

I feel like I'm just splitting straws at this point, since we have accepted
(folk willing to help with) both paths to 4.0, and I can't see how we stop
recommending  2.*→3.11 upgrades.


Re: Supported upgrade path for 4.0

2020-10-10 Thread Benedict Elliott Smith
This sounds eminently sensible to me.

On 09/10/2020, 19:42, "Joshua McKenzie"  wrote:

Fair point on uncertainties and delaying decisions until strictly required
so we have more data.

I want to nuance my earlier proposal and what we document (sorry for the
multiple messages; my time is fragmented enough these days that I only have
thin slices to engage w/stuff like this).

I think we should do a "From → To" model for both testing and supporting
upgrades and have a point of view as a project for each currently supported
version of C* in the "From" list. Specifically - we test and recommend the
following paths:

   1. 2.1 → 3.0 → 4.0
   2. 3.0 → 4.0 (subset of 1)
   3. 3.11 → 4.0

There's no value whatsoever in hopping through an interim version if a
leapfrog is expected to be as tested and stable. The only other alternative
would be to recommend 2.1 → 3.11 → 4.0 (as Mick alluded to) but that just
exposes to more deltas from the tick-tock .X line for no added value as you
mentioned.

We could re-apply the "from-to" testing and support model in future
releases w/whatever is supported at that time. That way users will be able
to have a single source of truth on what the project recommends and vets
for going from wherever they are to the latest.


On Fri, Oct 09, 2020 at 12:05 PM, Benedict Elliott Smith <
bened...@apache.org> wrote:

> There is a sizeable cohort of us who I expect to be primarily focused on
> 3.0->4.0, so if you have a cohort focusing primarily on 3.11->4.0 I think
> we'll be in good shape.
>
> For all subsequent major releases, we test and officially support only 1
> major back
>
> I think we should wait to see what happens before committing ourselves to
> something like this - things like release cadence etc will matter a lot.
> That is *definitely* not to say that I disagree with you, just that I 
think
> more project future-context is needed to make a decision like this. I
> expect we'll have lots more fun (hopefully positive) conversations around
> topics like this in the coming year, as I have no doubt we all want to
> evolve our approach to releases, and there's no knowing what we'll end up
> deciding (we have done some crazy things in the past __ ).
>
> On 09/10/2020, 16:46, "Joshua McKenzie"  wrote:
>
> I think it's a clean and simple heuristic for the project to say "you can
> safely upgrade to adjacent major versions".
>
> The difficulty we face with 3.0 is that it has made many contributors very
> wary of pre 4.0 code and with good reason. Your point about conservative
> users upgrading later in a cycle resonates Benedict, and reflects on the
> confidence we should or should not have in 3.11. I think it's also
> important to realize that many cluster upgrades can take months, so it's
> not a transient exposure to unknowns in a release.
>
> I propose the following compromise:
>
> 1. For 4.0 GA, we consider the following upgrade paths "tested and
> supported": 2.1 → 3.0 → 3.11 → 4.0, and 2.1 → 3.0 → 4.0
> 2. For all subsequent major releases, we test and officially support only
> 1 major back
> 3. Any contributor can optionally meet whatever bar we set for "tested and
> supported" to allow leapfrogging versions, but we won't constrain GA on
> that.
>
> We have to pay down our debt right now, but if we have to continue to do
> this in the future we're not learning from our mistakes.
>
> Speaking for DataStax, we don't have enough resources to work through the
> new testing work on 40_quality_test, the defects that David is surfacing
> like crazy (well done!), and validating 2 major upgrade paths. If you and 
a
> set of contributors could take on the 3.0 → 4.0 path Benedict, that'd be a
> great help. I also assume we could all collaborate on the tooling / infra 
/
> approaches we use for this validation so it wouldn't be a complete 
re-work.
>
> On Fri, Oct 09, 2020 at 11:02 AM, Benedict Elliott Smith < benedict@
> apache.org> wrote:
>
> Since email is very unclear and context gets lost, I'm personally OK with
> officially supporting all of these upgrade paths, but the spectre was
> raised that this might lead to lost labour due to an increased support
> burden. My view is that 3.0->4.0 is probably a safer upgrade path for 
users
> and as a result a lower support cost to the project, so I would be happy 
to
> deprecate 3.0->3.11 if this helps alleviate the concerns of others that
> this would be costly to the project. Alternatively, if we want to support
> both but some feel 3.0->4.0 is burdensome, I would be happy to focus on
> 3.0->4.0 while they focus on the paths I would be happy to deprecate.
>
> On 09/10/2020, 15:49, "Benedict Elliott Smith"