Re: Keeping on top of test failures

2021-09-13 Thread Joshua McKenzie
Closed out in bulk with a comment (liking that Auto Closed resolution),
looks like I managed not to accidentally email everyone on each update, and
will be looking to get the process into the website soon.

~Josh

On Sat, Sep 11, 2021 at 2:52 AM Berenguer Blasi 
wrote:

> +100 to closing anything that old after the big 4.0 push
>
> On 10/9/21 18:21, Joshua McKenzie wrote:
> > Thanks for the feedback everyone. Drafting site changes now and I'll pull
> > the trigger on JIRA probably Monday; give people the weekend to chew on
> > this.
> >
> > If I open up the window to 52 weeks, we still only have 13 of the test
> > failure tickets being created in that window. Figure it's probably safe
> to
> > close out year old flaky failure tickets.
> >
> > ~Josh
> >
> > On Thu, Sep 9, 2021 at 5:01 PM David Capwell  >
> > wrote:
> >
> >> +1
> >>
> >>> On Sep 9, 2021, at 10:27 AM, Mick Semb Wever  wrote:
> >>>
> >>> +1, much appreciated.
> >>>
> >>>
> >>> On 2021/09/09 16:03:31, Andrés de la Peña 
> >> wrote:
>  +1, thanks for the proposal.
> 
>  On Thu, 9 Sept 2021 at 16:45, Brandon Williams 
> >> wrote:
> > +1
> >
> > On Thu, Sep 9, 2021 at 10:39 AM Joshua McKenzie <
> jmcken...@apache.org>
> > wrote:
> >> (Taking #cassandra-dev slack chat to here)
> >>
> >> For context, we have a long history of an ebb and flow of flaky test
> >> failures building up and getting burned down, but don't really have
> a
> >> workflow or discipline around having a clean snapshot of where we
> are
> >> or
> >> attempting to stay at some kind of steady state. We have thousands
> of
> > tests
> >> executing in a wide variety of environments: this state is to be
> > expected,
> >> but I argue needs to be actively managed so we don't get into the
> >> kind of
> >> situation we did with 4.0 again.
> >>
> >> I threw together a couple of JIRA queries that paint a pretty
> >> navigable
> >> picture IMO:
> >>
> >> Total JIRA for test failures:
> >>
> >>
> https://issues.apache.org/jira/issues/?filter=12350869=project%20%3D%20Cassandra%20AND%20resolution%20%3D%20unresolved%20AND%20(summary%20~%20flaky%20OR%20summary%20~%20test%20OR%20component%20%3D%20%22Test%2Funit%22)%20AND%20type%20%3D%20bug%20AND%20issuekey%20not%20in%20(CASSANDRA-16010%2C%20CASSANDRA-16024%2C%20CASSANDRA-16022%2C%20CASSANDRA-16021%2C%20CASSANDRA-16025%2C%20CASSANDRA-16023)%20AND%20summary%20!~%20hardening%20ORDER%20BY%20cf%5B12313825%5D%20ASC
> >> (sorry for the URL) - 112 failures
> >>
> >> # of failures more recent than 6 months:
> >> https://issues.apache.org/jira/issues/?filter=12350869
> >> 10 failures.
> >>
> >> In the interest of tidying this up and staying on top of it going
> > forward,
> >> I propose the following:
> >> 1. We close as won't fix all test failures created >= 6 months ago
> (We
> > had
> >> a big push for 4.0 and a lot of this JIRA content is stale)
> >> 2. We switch the "Bug Category" for these 10 more recent to
> >> "Correctness
> > -
> >> Test Failure"
> >> 3. We document a "canonical" workflow around test failures that
> links
> >> to
> > a
> >> saved JIRA filter query that includes:
> >> 4. When you're working on something and you see a test failure that
> >> isn't
> >> related to your patch, check that filter, see if the test name is
> >> there,
> >> and if not create a new ticket w/that Bug Category
> >>
> >> In theory this should give us a single source of truth for
> documented
> > test
> >> failures as well as an entry point for new contributors.
> >>
> >> Thoughts?
> >>
> >> ~Josh
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: Defining which code changes target which release types

2021-09-13 Thread Joshua McKenzie
>
> Where I think there becomes a grey area is on refactoring

ISTM we have two types of refactors:
1) Improving logic, tracking, state machines, etc (behavior, invasive)
2) Should be opaque to end user with zero change to behavior (hygiene, code
organization)

Feature flag on #1 as it's kind of a new feature / new way of operating,
deferred to major on #2 as we have a complex code-base where refactors
often have side-effects that should be more thoroughly vetted like a major?

Take as a loose set of ideas; potentially bad ones.

~Josh

On Mon, Sep 13, 2021 at 10:28 AM Ekaterina Dimitrova 
wrote:

> I don’t think we can or we should cover every particular case but this is a
> good baseline/guideline and we should encourage people to hit the mailing
> list when there is uncertainty.
> My understanding is that this document will support also the initially
> mentioned one where I saw something that probably partially addresses
> David’s concerns but it is said as something tbd as far as I understand:
> https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle
>
> “Compatibility between major versions - Content we have so far based on the
> feedback - Developer community will try not to make any backwards
> incompatible changes as much as possible, except in extreme cases like to
> ensure correctness. Introducing a backward incompatibility change needs dev
> community approval via voting [voting open for 48 hours].”
>
> On Fri, 10 Sep 2021 at 14:29, David Capwell 
> wrote:
>
> > > I believe that always having a feature flag for every new feature might
> > be
> > > too complicated in practice for different reasons.
> > > Some new features might be low impact like new nodetool commands or new
> > > virtual tables and adding flags for those might simply be extra
> > > complication for the developers and users.
> > > For some other features it might be simply too hard to hide them behind
> > > feature flags.
> > >
> > > Feature flag basically means "experimental" so it would be good when a
> > > feature flag is introduced to also have a clear plan on when and how
> the
> > > flag will be removed.
> > >
> > > I would personally limit the feature flag to significant new features.
> As
> > > those types of features now require a CEP, we could make the feature
> fag
> > > discussion part of the CEP discussion.
> > >
> > > What do you think?
> >
> > Personally I run with the idea we should default to “you need a feature
> > flag” and special case places which do not need; if we start with
> > “significant new feature” every feature will be argued that it isn’t
> > “significant enough” or that offering one would be “too complex”.  I
> would
> > argue tables/nodetool act more like a feature flag so these examples
> > shouldn’t cause us to weaken the notation of a feature flag, as they do
> not
> > impact you unless you opt-into them…. which is what a feature flag does.
> >
> > > For some other features it might be simply too hard to hide them behind
> > > feature flags.
> >
> > In my experience these types of features get a feature flag after the
> fact
> > or warned to operators/users not to use them… While working on
> > CASSANDRA-16850 it was really annoying to support flags as I need to keep
> > track of state both at the coordinator and the replica to support this,
> and
> > at each check’s level (we also do not have a notion of a query context or
> > what actor is doing the action… which makes this even more painful to
> do);
> > this drastically increased my testing scope.  This was still important to
> > do as after it is deployed it could cause a negative impact to operators
> or
> > users, so being able to act without code changes is important.
> >
> > Where I think there becomes a grey area is on refactoring… for example I
> > have put in a lot of work refactoring repair coordination and I plan to
> do
> > a lot more… do I support falling back to old logic or old behavior?  In
> > CASSANDRA-16909 I document a lot of places which are buggy and have shown
> > to cause production issues… is the “fix” actually a “new feature” (fun
> > example that happens on prod from time to time… we drop the merkle tree
> and
> > hang forever… we could make this recoverable but is that a feature or a
> bug
> > fix)?  Should this go into a prior release?
> >
> >
> > > On Sep 10, 2021, at 9:25 AM, Joshua McKenzie 
> > wrote:
> > >
> > > I put together a gdoc documenting what was in this thread - should be
> > open
> > > to comment for everyone:
> > >
> >
> https://docs.google.com/document/d/1LhCNcbuhtqTkv_aKx1TQUgWEcq022fsAZs1C_oOxEJw/edit
> > >
> > > I'll let this thread sit to early next week and assuming no major
> > concerns
> > > we'll get that into either the wiki or the site or both.
> > >
> > > Thanks everyone for the feedback!
> > >
> > > ~Josh
> > >
> > > On Thu, Sep 2, 2021 at 9:57 AM Joshua McKenzie 
> > wrote:
> > >
> > >> Feature flag basically means "experimental"
> > >>
> > >> I'm thinking of feature 

Re: [VOTE] CEP-13: Denylisting partitions

2021-09-13 Thread Sumanth Pasupuleti
With 6 (six) +1 votes and no -1 votes, the vote passes. Thanks everyone!

On Sat, Sep 11, 2021 at 11:41 PM Jordan West  wrote:

> +1
>
> On Wed, Sep 8, 2021 at 11:38 AM Chris Lohfink 
> wrote:
>
> > +1
> >
> > On Wed, Sep 8, 2021 at 11:58 AM bened...@apache.org  >
> > wrote:
> >
> > > +1
> > >
> > > From: Brandon Williams 
> > > Date: Wednesday, 8 September 2021 at 17:57
> > > To: dev@cassandra.apache.org 
> > > Subject: Re: [VOTE] CEP-13: Denylisting partitions
> > > +1
> > >
> > > On Wed, Sep 8, 2021 at 11:31 AM Sumanth Pasupuleti
> > >  wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I’m proposing this CEP for approval.
> > > >
> > > > Proposal:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-13%3A+Denylisting+partitions
> > > > Discussion:
> > > >
> > >
> >
> https://lists.apache.org/thread.html/r1547c5f2fb8548e2f7dcbe1a26da8c2a95ebec81adeeb2ea0545924d%40%3Cdev.cassandra.apache.org%3E
> > > >
> > > > The vote will be open for 72 hours.
> > > > Votes by committers are considered binding.
> > > > A vote passes if there are at least three binding +1s and no binding
> > > vetoes.
> > > >
> > > > Thanks,
> > > > Sumanth
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> >
>


Re: [DISCUSS] Diagnostic events in virtual tables

2021-09-13 Thread Chris Lohfink
Perhaps re-add the settings virtual table mutability. That way the same
place can be used to update settings at runtime for multiple things instead
of creating a new virtual table per service we want to make hot props for.

Might be kinda nice to allow REGISTER and EVENT CQL events to be created
with virtual tables as well for some extended functionality.

Chris

On Fri, Sep 10, 2021 at 2:36 AM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:

> Hi Mick,
>
> I returned to this after some time and here are my questions about this.
>
> I am waiting for 16806 to be merged which introduces abstract mutable
> vtables (1) on top of which I want to build what you have proposed.
> I do not think we need a non-virtual table for this and this is
> actually super handy in this case because we can react on updates /
> inserts / deletes and
> subscribe / unsubscribe an event to DiagnosticsService while modifying
> that vtable. Otherwise, I do not see an easy and straightforward way
> to react
> to our modifications to that table (maybe via QueryHandler but that is
> quite cumbersome and not too performance friendly).
>
> The question I have is rather semantic one. If we enable / disable
> events via this table, a user will suddenly have two ways to subscribe
> - via JMX and by CQL. Is this ok?
> If one subscribes via JMX, this subscription is not propagated to the
> underlying CQL table. So she might subscribe to 5 events but there
> would be none in vtable. On the other hand,
> if we subscribe via CQL, that will populate some maps in
> DiagnosticsService / DiagnosticsPersistence. Hence, my concern is
> about having this discrepancy between what we see
> in vtable and what is enabled via JMX path. How would you address this?
>
> Regards
>
> (1) https://github.com/apache/cassandra/pull/1117/files
>
> On Sat, 24 Jul 2021 at 18:59, Mick Semb Wever  wrote:
> >
> > >
> > > I am not sure yet how the implementation in case of virtual tables
> > > fits into the overall picture of "pluggability".
> >
> >
> >
> > Yeah, it was a goal of the design to make writing new types as easy as
> > possible, so having to wire up a new vtable for each new event type works
> > against that.
> >
> > I'd be inclined to start with just two tables: one "diagnostic_events"
> that
> > lists all events, and another "diagnostic_service" which lists what's
> > enabled (and allows you to enable them live).  With the design of
> > Diagnostic Events as subscribe and pull, it would make sense (to me) if
> > enabling an event class in the diagnostic_service table ('update
> > diagnostic_service set enabled = true where class = '') then
> > also does a subscribesAll with a noop consumer.
> >
> > Extending/configuring the limit makes sense as part of the binlog
> > implementation. I'm hesitant about allowing users to increase the
> in-memory
> > store.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: Defining which code changes target which release types

2021-09-13 Thread Ekaterina Dimitrova
I don’t think we can or we should cover every particular case but this is a
good baseline/guideline and we should encourage people to hit the mailing
list when there is uncertainty.
My understanding is that this document will support also the initially
mentioned one where I saw something that probably partially addresses
David’s concerns but it is said as something tbd as far as I understand:
https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle

“Compatibility between major versions - Content we have so far based on the
feedback - Developer community will try not to make any backwards
incompatible changes as much as possible, except in extreme cases like to
ensure correctness. Introducing a backward incompatibility change needs dev
community approval via voting [voting open for 48 hours].”

On Fri, 10 Sep 2021 at 14:29, David Capwell 
wrote:

> > I believe that always having a feature flag for every new feature might
> be
> > too complicated in practice for different reasons.
> > Some new features might be low impact like new nodetool commands or new
> > virtual tables and adding flags for those might simply be extra
> > complication for the developers and users.
> > For some other features it might be simply too hard to hide them behind
> > feature flags.
> >
> > Feature flag basically means "experimental" so it would be good when a
> > feature flag is introduced to also have a clear plan on when and how the
> > flag will be removed.
> >
> > I would personally limit the feature flag to significant new features. As
> > those types of features now require a CEP, we could make the feature fag
> > discussion part of the CEP discussion.
> >
> > What do you think?
>
> Personally I run with the idea we should default to “you need a feature
> flag” and special case places which do not need; if we start with
> “significant new feature” every feature will be argued that it isn’t
> “significant enough” or that offering one would be “too complex”.  I would
> argue tables/nodetool act more like a feature flag so these examples
> shouldn’t cause us to weaken the notation of a feature flag, as they do not
> impact you unless you opt-into them…. which is what a feature flag does.
>
> > For some other features it might be simply too hard to hide them behind
> > feature flags.
>
> In my experience these types of features get a feature flag after the fact
> or warned to operators/users not to use them… While working on
> CASSANDRA-16850 it was really annoying to support flags as I need to keep
> track of state both at the coordinator and the replica to support this, and
> at each check’s level (we also do not have a notion of a query context or
> what actor is doing the action… which makes this even more painful to do);
> this drastically increased my testing scope.  This was still important to
> do as after it is deployed it could cause a negative impact to operators or
> users, so being able to act without code changes is important.
>
> Where I think there becomes a grey area is on refactoring… for example I
> have put in a lot of work refactoring repair coordination and I plan to do
> a lot more… do I support falling back to old logic or old behavior?  In
> CASSANDRA-16909 I document a lot of places which are buggy and have shown
> to cause production issues… is the “fix” actually a “new feature” (fun
> example that happens on prod from time to time… we drop the merkle tree and
> hang forever… we could make this recoverable but is that a feature or a bug
> fix)?  Should this go into a prior release?
>
>
> > On Sep 10, 2021, at 9:25 AM, Joshua McKenzie 
> wrote:
> >
> > I put together a gdoc documenting what was in this thread - should be
> open
> > to comment for everyone:
> >
> https://docs.google.com/document/d/1LhCNcbuhtqTkv_aKx1TQUgWEcq022fsAZs1C_oOxEJw/edit
> >
> > I'll let this thread sit to early next week and assuming no major
> concerns
> > we'll get that into either the wiki or the site or both.
> >
> > Thanks everyone for the feedback!
> >
> > ~Josh
> >
> > On Thu, Sep 2, 2021 at 9:57 AM Joshua McKenzie 
> wrote:
> >
> >> Feature flag basically means "experimental"
> >>
> >> I'm thinking of feature flags more as giving the power to operators to
> >> decide what they do and don't allow users of the database access to.
> Even
> >> if a feature is very stable and non-experimental, it can have negative
> >> effects on other use-cases on a shared cluster, be incompatible with the
> >> underlying execution environment, be outside compliance policies of an
> >> organization, require greater expertise to use correctly, etc.
> >>
> >> That said, I 100% agree w/you on the "limit it to significant new
> >> features". I don't think feature flagging nodetool commands makes a lot
> of
> >> sense. :)
> >>
> >> Adding it to the CEP template as something to yes/no on would be a
> simple
> >> clarification for this. +1
> >>
> >> ~Josh
> >>
> >>
> >> On Thu, Sep 2, 2021 at 3:14 AM Benjamin Lerer 
> wrote:
>