Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

2022-06-27 Thread Kenneth Knowles
Regarding P2s and P3s getting resolved as well: pretty much every healthy
project has a backlog that grows without bound. So we do need a place to
put that backlog. I think P3 is where things tend to end up, because P2s
that do not receive a comment are automatically downgraded to P3. These may
still be resolved, but there isn't any hope that _all_ of them get resolved.

Kenn

On Fri, Jun 24, 2022 at 10:32 AM Alexey Romanenko 
wrote:

> Thanks, Danny!
>
> On 24 Jun 2022, at 19:23, Danny McCormick 
> wrote:
>
> Sure, I put up a fix - https://github.com/apache/beam/pull/22048
>
> On Fri, Jun 24, 2022 at 1:20 PM Alexey Romanenko 
> wrote:
>
>>
>>
>> > 2. The links in this report start with api.github.* and don’t take us
>> directly to the issues.
>>
>> > Yeah Danny pointed that out as well. I'm assuming he knows how to fix
>> it?
>>
>> This is already fixed - Pablo actually beat me to it!
>> 
>>
>>
>> It adds also a colon after URL and some mail clients consider it as a
>> part of URL which leads to a broken link.
>> Should we just remove a colon there or add a space between?
>>
>> —
>> Alexey
>>
>>
>> Thanks,
>> Danny
>>
>> On Thu, Jun 23, 2022 at 8:30 PM Brian Hulette 
>> wrote:
>>
>>> +1 for that proposal!
>>>
>>> > 1. P2 and P3 issues should be noticed and resolved as well. Shall we
>>> have a longer time window for the rest of not triaged or stagnate issues
>>> and include them?
>>>
>>> I worry these lists would get _very_ long and wouldn't be actionable.
>>> But maybe it's worth reporting something like "There are 376 P2's with no
>>> update in the last 6 months" with a link to a query?
>>>
>>> > 2. The links in this report start with api.github.* and don’t take us
>>> directly to the issues.
>>>
>>> Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
>>>
>>> On Thu, Jun 23, 2022 at 2:37 PM Pablo Estrada 
>>> wrote:
>>>
 Thanks. I like the proposal, and I've found the emails useful.
 Best
 -P.

 On Thu, Jun 23, 2022 at 2:33 PM Manu Zhang 
 wrote:

> Sounds good! It’s like our internal reports of JIRA tickets exceeding
> SLA time and having no response from engineers.  We either resolve them or
> downgrade the priority to extend time window.
>
> Besides,
> 1. P2 and P3 issues should be noticed and resolved as well. Shall we
> have a longer time window for the rest of not triaged or stagnate issues
> and include them?
> 2. The links in this report start with api.github.* and don’t take us
> directly to the issues.
>
>
> Danny McCormick 于2022年6月24日 周五04:48写道:
>
>> That generally sounds right to me - I also would vote that we
>> consolidate to 1 email and stop distinguishing between flaky P1s and 
>> normal
>> P1s.
>>
>> So the single daily report would be:
>>
>> - Unassigned P0s
>> - P0s with no update in the last 36 hours
>> - Unassigned P1s
>> - P1s with no update in the last 7 days
>>
>> I think that will generate a pretty good list of issues that require
>> some kind of action.
>>
>> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles 
>> wrote:
>>
>>> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are
>>> more like ~hours for true outages of CI/website/etc) and P1s > 7 days?
>>>
>>> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette 
>>> wrote:
>>>
 I think that Danny's alternate proposal (a daily email that show
 only issues last updated >7 days ago, and those with no assignee) fits 
 well
 with the two goals you describe, if we include "triage needed" issues 
 in
 the latter category. Maybe we also explicitly separate these two 
 concerns
 in the report?


 On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles 
 wrote:

> Forking thread because lots of people may just ignore this topic,
> per the discussion :-)
>
> (sometimes gmail doesn't fork thread properly, but here's
> hoping...)
>
> I'll add some other outcomes of these emails:
>
>  - people file P0s that are not outages and P1s that are not data
> loss and I downgrade them
>  - I randomly open up a few flaky test bugs and see if I can fix
> them really quick
>  - people file legit P0s and P1s and I subscribe and follow them
>
> Of these, only the last one seems important (not just that *I*
> follow them, but that new P0s and P1s get immediate attention from 
> many
> eyes)
>
> So maybe one take on the goal is to:
>
>  - have new P0s and P1s evaluated quickly: P0s are an outage or
> outage-like occurrence that needs immediate remedy, and P1s need to be
> evaluated for release blocking, etc.
>  - make 

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

2022-06-24 Thread Alexey Romanenko
Thanks, Danny!

> On 24 Jun 2022, at 19:23, Danny McCormick  wrote:
> 
> Sure, I put up a fix - https://github.com/apache/beam/pull/22048 
> 
> On Fri, Jun 24, 2022 at 1:20 PM Alexey Romanenko  > wrote:
> 
> 
>> > 2. The links in this report start with api.github.* and don’t take us 
>> > directly to the issues.
>> 
>> > Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
>> 
>> This is already fixed - Pablo actually beat me to it! 
>> 
> It adds also a colon after URL and some mail clients consider it as a part of 
> URL which leads to a broken link.
> Should we just remove a colon there or add a space between?
> 
> —
> Alexey
> 
>> 
>> Thanks,
>> Danny
>> 
>> On Thu, Jun 23, 2022 at 8:30 PM Brian Hulette > > wrote:
>> +1 for that proposal!
>> 
>> > 1. P2 and P3 issues should be noticed and resolved as well. Shall we have 
>> > a longer time window for the rest of not triaged or stagnate issues and 
>> > include them?
>> 
>> I worry these lists would get _very_ long and wouldn't be actionable. But 
>> maybe it's worth reporting something like "There are 376 P2's with no update 
>> in the last 6 months" with a link to a query?
>> 
>> > 2. The links in this report start with api.github.* and don’t take us 
>> > directly to the issues.
>> 
>> Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
>> 
>> On Thu, Jun 23, 2022 at 2:37 PM Pablo Estrada > > wrote:
>> Thanks. I like the proposal, and I've found the emails useful.
>> Best
>> -P.
>> 
>> On Thu, Jun 23, 2022 at 2:33 PM Manu Zhang > > wrote:
>> Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA 
>> time and having no response from engineers.  We either resolve them or 
>> downgrade the priority to extend time window.
>> 
>> Besides,
>> 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
>> longer time window for the rest of not triaged or stagnate issues and 
>> include them?
>> 2. The links in this report start with api.github.* and don’t take us 
>> directly to the issues.
>> 
>> 
>> Danny McCormick > >于2022年6月24日 周五04:48写道:
>> That generally sounds right to me - I also would vote that we consolidate to 
>> 1 email and stop distinguishing between flaky P1s and normal P1s.
>> 
>> So the single daily report would be:
>> 
>> - Unassigned P0s
>> - P0s with no update in the last 36 hours
>> - Unassigned P1s
>> - P1s with no update in the last 7 days
>> 
>> I think that will generate a pretty good list of issues that require some 
>> kind of action.
>> 
>> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles > > wrote:
>> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more like 
>> ~hours for true outages of CI/website/etc) and P1s > 7 days?
>> 
>> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette > > wrote:
>> I think that Danny's alternate proposal (a daily email that show only issues 
>> last updated >7 days ago, and those with no assignee) fits well with the two 
>> goals you describe, if we include "triage needed" issues in the latter 
>> category. Maybe we also explicitly separate these two concerns in the report?
>> 
>> 
>> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles > > wrote:
>> Forking thread because lots of people may just ignore this topic, per the 
>> discussion :-)
>> 
>> (sometimes gmail doesn't fork thread properly, but here's hoping...)
>> 
>> I'll add some other outcomes of these emails:
>> 
>>  - people file P0s that are not outages and P1s that are not data loss and I 
>> downgrade them
>>  - I randomly open up a few flaky test bugs and see if I can fix them really 
>> quick
>>  - people file legit P0s and P1s and I subscribe and follow them
>> 
>> Of these, only the last one seems important (not just that *I* follow them, 
>> but that new P0s and P1s get immediate attention from many eyes)
>> 
>> So maybe one take on the goal is to:
>> 
>>  - have new P0s and P1s evaluated quickly: P0s are an outage or outage-like 
>> occurrence that needs immediate remedy, and P1s need to be evaluated for 
>> release blocking, etc.
>>  - make sure P0s and P1s get attention appropriate to their priority
>> 
>> It can also be helpful to just state the failure modes which would happen by 
>> default if we don't have a good process or automation:
>> 
>>  - Real P0 gets filed and not noticed or fixed in a timely manner, blocking 
>> users and/or community in real time
>>  - Real P1 gets filed and not noticed, so release goes out with known data 
>> loss bug or other total loss of functionality
>>  - Non-real P0s and P1s accumulate, throwing off our data and making it hard 
>> to find the real problems
>>  - Flakes are never fixed
>> 
>> WDYT?

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

2022-06-24 Thread Alexey Romanenko


> > 2. The links in this report start with api.github.* and don’t take us 
> > directly to the issues.
> 
> > Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
> 
> This is already fixed - Pablo actually beat me to it! 
> 
It adds also a colon after URL and some mail clients consider it as a part of 
URL which leads to a broken link.
Should we just remove a colon there or add a space between?

—
Alexey

> 
> Thanks,
> Danny
> 
> On Thu, Jun 23, 2022 at 8:30 PM Brian Hulette  > wrote:
> +1 for that proposal!
> 
> > 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
> > longer time window for the rest of not triaged or stagnate issues and 
> > include them?
> 
> I worry these lists would get _very_ long and wouldn't be actionable. But 
> maybe it's worth reporting something like "There are 376 P2's with no update 
> in the last 6 months" with a link to a query?
> 
> > 2. The links in this report start with api.github.* and don’t take us 
> > directly to the issues.
> 
> Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
> 
> On Thu, Jun 23, 2022 at 2:37 PM Pablo Estrada  > wrote:
> Thanks. I like the proposal, and I've found the emails useful.
> Best
> -P.
> 
> On Thu, Jun 23, 2022 at 2:33 PM Manu Zhang  > wrote:
> Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA 
> time and having no response from engineers.  We either resolve them or 
> downgrade the priority to extend time window.
> 
> Besides,
> 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
> longer time window for the rest of not triaged or stagnate issues and include 
> them?
> 2. The links in this report start with api.github.* and don’t take us 
> directly to the issues.
> 
> 
> Danny McCormick  >于2022年6月24日 周五04:48写道:
> That generally sounds right to me - I also would vote that we consolidate to 
> 1 email and stop distinguishing between flaky P1s and normal P1s.
> 
> So the single daily report would be:
> 
> - Unassigned P0s
> - P0s with no update in the last 36 hours
> - Unassigned P1s
> - P1s with no update in the last 7 days
> 
> I think that will generate a pretty good list of issues that require some 
> kind of action.
> 
> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles  > wrote:
> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more like 
> ~hours for true outages of CI/website/etc) and P1s > 7 days?
> 
> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette  > wrote:
> I think that Danny's alternate proposal (a daily email that show only issues 
> last updated >7 days ago, and those with no assignee) fits well with the two 
> goals you describe, if we include "triage needed" issues in the latter 
> category. Maybe we also explicitly separate these two concerns in the report?
> 
> 
> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles  > wrote:
> Forking thread because lots of people may just ignore this topic, per the 
> discussion :-)
> 
> (sometimes gmail doesn't fork thread properly, but here's hoping...)
> 
> I'll add some other outcomes of these emails:
> 
>  - people file P0s that are not outages and P1s that are not data loss and I 
> downgrade them
>  - I randomly open up a few flaky test bugs and see if I can fix them really 
> quick
>  - people file legit P0s and P1s and I subscribe and follow them
> 
> Of these, only the last one seems important (not just that *I* follow them, 
> but that new P0s and P1s get immediate attention from many eyes)
> 
> So maybe one take on the goal is to:
> 
>  - have new P0s and P1s evaluated quickly: P0s are an outage or outage-like 
> occurrence that needs immediate remedy, and P1s need to be evaluated for 
> release blocking, etc.
>  - make sure P0s and P1s get attention appropriate to their priority
> 
> It can also be helpful to just state the failure modes which would happen by 
> default if we don't have a good process or automation:
> 
>  - Real P0 gets filed and not noticed or fixed in a timely manner, blocking 
> users and/or community in real time
>  - Real P1 gets filed and not noticed, so release goes out with known data 
> loss bug or other total loss of functionality
>  - Non-real P0s and P1s accumulate, throwing off our data and making it hard 
> to find the real problems
>  - Flakes are never fixed
> 
> WDYT?
> 
> If we have P0s and P1s in the "awaiting triage" state, those are the ones we 
> need to notice. Then for a P0 or P1 outside of that state, we just need some 
> way of making sure it doesn't stagnate. Or if it does stagnate, that 
> empirically demonstrates it isn't really P1 (just like our P2 to P3 downgrade 
> automation). If everything is P1, nothing is, as they say.
> 
> Kenn
> 
> On Thu, Jun 23, 

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

2022-06-23 Thread Manu Zhang
Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA
time and having no response from engineers.  We either resolve them or
downgrade the priority to extend time window.

Besides,
1. P2 and P3 issues should be noticed and resolved as well. Shall we have a
longer time window for the rest of not triaged or stagnate issues and
include them?
2. The links in this report start with api.github.* and don’t take us
directly to the issues.


Danny McCormick 于2022年6月24日 周五04:48写道:

> That generally sounds right to me - I also would vote that we consolidate
> to 1 email and stop distinguishing between flaky P1s and normal P1s.
>
> So the single daily report would be:
>
> - Unassigned P0s
> - P0s with no update in the last 36 hours
> - Unassigned P1s
> - P1s with no update in the last 7 days
>
> I think that will generate a pretty good list of issues that require some
> kind of action.
>
> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles  wrote:
>
>> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more
>> like ~hours for true outages of CI/website/etc) and P1s > 7 days?
>>
>> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette 
>> wrote:
>>
>>> I think that Danny's alternate proposal (a daily email that show only
>>> issues last updated >7 days ago, and those with no assignee) fits well with
>>> the two goals you describe, if we include "triage needed" issues in the
>>> latter category. Maybe we also explicitly separate these two concerns in
>>> the report?
>>>
>>>
>>> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles  wrote:
>>>
 Forking thread because lots of people may just ignore this topic, per
 the discussion :-)

 (sometimes gmail doesn't fork thread properly, but here's hoping...)

 I'll add some other outcomes of these emails:

  - people file P0s that are not outages and P1s that are not data loss
 and I downgrade them
  - I randomly open up a few flaky test bugs and see if I can fix them
 really quick
  - people file legit P0s and P1s and I subscribe and follow them

 Of these, only the last one seems important (not just that *I* follow
 them, but that new P0s and P1s get immediate attention from many eyes)

 So maybe one take on the goal is to:

  - have new P0s and P1s evaluated quickly: P0s are an outage or
 outage-like occurrence that needs immediate remedy, and P1s need to be
 evaluated for release blocking, etc.
  - make sure P0s and P1s get attention appropriate to their priority

 It can also be helpful to just state the failure modes which would
 happen by default if we don't have a good process or automation:

  - Real P0 gets filed and not noticed or fixed in a timely manner,
 blocking users and/or community in real time
  - Real P1 gets filed and not noticed, so release goes out with known
 data loss bug or other total loss of functionality
  - Non-real P0s and P1s accumulate, throwing off our data and making it
 hard to find the real problems
  - Flakes are never fixed

 WDYT?

 If we have P0s and P1s in the "awaiting triage" state, those are the
 ones we need to notice. Then for a P0 or P1 outside of that state, we just
 need some way of making sure it doesn't stagnate. Or if it does stagnate,
 that empirically demonstrates it isn't really P1 (just like our P2 to P3
 downgrade automation). If everything is P1, nothing is, as they say.

 Kenn

 On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> > Maybe it would be helpful to sort these by last update time (and
> potentially include that information in the email). Then we can at least
> prioritize them instead of looking at a big wall of issues.
>
> I agree that this is a good idea (and pretty trivial to do). I'll
> update the automation to do that once we get consensus on an approach.
>
> > I think the motivation for daily emails is that per the priorities
> guide [1] P1 issues should be getting "continuous status updates". If 
> these
> issues aren't actually that important, I think the noise is good as it
> should motivate us to prioritize them correctly. In practice that hasn't
> been happening though...
>
> I guess the questions here are:
>
> 1) What is the goal of this email?
> 2) Is it effective at accomplishing that goal.
>
> I think you're saying that the goal (or a goal) is to highlight issues
> that aren't getting the attention they need; if that's our goal, then I
> don't think this is a particularly effective mechanism for it because (a)
> its very unclear which issues fall into that category and (b) there are 
> too
> many to manually go through on a daily basis. From the email alone, it's
> not clear to me that any of the issues above "shouldn't" be P1s (though 
> I'd

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

2022-06-23 Thread Kenneth Knowles
Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more
like ~hours for true outages of CI/website/etc) and P1s > 7 days?

On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette  wrote:

> I think that Danny's alternate proposal (a daily email that show only
> issues last updated >7 days ago, and those with no assignee) fits well with
> the two goals you describe, if we include "triage needed" issues in the
> latter category. Maybe we also explicitly separate these two concerns in
> the report?
>
>
> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles  wrote:
>
>> Forking thread because lots of people may just ignore this topic, per the
>> discussion :-)
>>
>> (sometimes gmail doesn't fork thread properly, but here's hoping...)
>>
>> I'll add some other outcomes of these emails:
>>
>>  - people file P0s that are not outages and P1s that are not data loss
>> and I downgrade them
>>  - I randomly open up a few flaky test bugs and see if I can fix them
>> really quick
>>  - people file legit P0s and P1s and I subscribe and follow them
>>
>> Of these, only the last one seems important (not just that *I* follow
>> them, but that new P0s and P1s get immediate attention from many eyes)
>>
>> So maybe one take on the goal is to:
>>
>>  - have new P0s and P1s evaluated quickly: P0s are an outage or
>> outage-like occurrence that needs immediate remedy, and P1s need to be
>> evaluated for release blocking, etc.
>>  - make sure P0s and P1s get attention appropriate to their priority
>>
>> It can also be helpful to just state the failure modes which would happen
>> by default if we don't have a good process or automation:
>>
>>  - Real P0 gets filed and not noticed or fixed in a timely manner,
>> blocking users and/or community in real time
>>  - Real P1 gets filed and not noticed, so release goes out with known
>> data loss bug or other total loss of functionality
>>  - Non-real P0s and P1s accumulate, throwing off our data and making it
>> hard to find the real problems
>>  - Flakes are never fixed
>>
>> WDYT?
>>
>> If we have P0s and P1s in the "awaiting triage" state, those are the ones
>> we need to notice. Then for a P0 or P1 outside of that state, we just need
>> some way of making sure it doesn't stagnate. Or if it does stagnate, that
>> empirically demonstrates it isn't really P1 (just like our P2 to P3
>> downgrade automation). If everything is P1, nothing is, as they say.
>>
>> Kenn
>>
>> On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> > Maybe it would be helpful to sort these by last update time (and
>>> potentially include that information in the email). Then we can at least
>>> prioritize them instead of looking at a big wall of issues.
>>>
>>> I agree that this is a good idea (and pretty trivial to do). I'll update
>>> the automation to do that once we get consensus on an approach.
>>>
>>> > I think the motivation for daily emails is that per the priorities
>>> guide [1] P1 issues should be getting "continuous status updates". If these
>>> issues aren't actually that important, I think the noise is good as it
>>> should motivate us to prioritize them correctly. In practice that hasn't
>>> been happening though...
>>>
>>> I guess the questions here are:
>>>
>>> 1) What is the goal of this email?
>>> 2) Is it effective at accomplishing that goal.
>>>
>>> I think you're saying that the goal (or a goal) is to highlight issues
>>> that aren't getting the attention they need; if that's our goal, then I
>>> don't think this is a particularly effective mechanism for it because (a)
>>> its very unclear which issues fall into that category and (b) there are too
>>> many to manually go through on a daily basis. From the email alone, it's
>>> not clear to me that any of the issues above "shouldn't" be P1s (though I'd
>>> guess you're right that some/many of them don't belong since most were
>>> created before the Jira -> GH migration based on the titles). I'd also
>>> argue that a daily email just desensitizes us to them since there almost
>>> always will be *some *valid P1s that don't need extra attention.
>>>
>>> I do still think this could have value as a weekly email, with the goal
>>> being "it's probably a good idea for someone to take a look at each of
>>> these". Another option would be to only include issues with no action in
>>> the last 7 days and/or no assignees and keep it daily.
>>>
>>> A couple side notes:
>>> - No matter what we do, if we keep the current automation in any form we
>>> should fix the url from
>>> https://api.github.com/repos/apache/beam/issues/# to
>>> https://github.com/apache/beam/issues/# - the current links are very
>>> annoying.
>>> - After I send this, I will do a pass of the current P1s since it does
>>> indeed seem like too many are P1s and many should actually be P2s (or
>>> lower).
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette 
>>> wrote:
>>>
 I think the motivation for daily emails is