flake automation Was: P1 issues report (70)

Manu Zhang Thu, 23 Jun 2022 14:33:53 -0700

Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA
time and having no response from engineers.  We either resolve them or
downgrade the priority to extend time window.


Besides,
1. P2 and P3 issues should be noticed and resolved as well. Shall we have a
longer time window for the rest of not triaged or stagnate issues and
include them?
2. The links in this report start with api.github.* and don’t take us
directly to the issues.


Danny McCormick <[email protected]>于2022年6月24日 周五04:48写道：

> That generally sounds right to me - I also would vote that we consolidate
> to 1 email and stop distinguishing between flaky P1s and normal P1s.
>
> So the single daily report would be:
>
> - Unassigned P0s
> - P0s with no update in the last 36 hours
> - Unassigned P1s
> - P1s with no update in the last 7 days
>
> I think that will generate a pretty good list of issues that require some
> kind of action.
>
> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles <[email protected]> wrote:
>
>> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more
>> like ~hours for true outages of CI/website/etc) and P1s > 7 days?
>>
>> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>> I think that Danny's alternate proposal (a daily email that show only
>>> issues last updated >7 days ago, and those with no assignee) fits well with
>>> the two goals you describe, if we include "triage needed" issues in the
>>> latter category. Maybe we also explicitly separate these two concerns in
>>> the report?
>>>
>>>
>>> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> Forking thread because lots of people may just ignore this topic, per
>>>> the discussion :-)
>>>>
>>>> (sometimes gmail doesn't fork thread properly, but here's hoping...)
>>>>
>>>> I'll add some other outcomes of these emails:
>>>>
>>>>  - people file P0s that are not outages and P1s that are not data loss
>>>> and I downgrade them
>>>>  - I randomly open up a few flaky test bugs and see if I can fix them
>>>> really quick
>>>>  - people file legit P0s and P1s and I subscribe and follow them
>>>>
>>>> Of these, only the last one seems important (not just that *I* follow
>>>> them, but that new P0s and P1s get immediate attention from many eyes)
>>>>
>>>> So maybe one take on the goal is to:
>>>>
>>>>  - have new P0s and P1s evaluated quickly: P0s are an outage or
>>>> outage-like occurrence that needs immediate remedy, and P1s need to be
>>>> evaluated for release blocking, etc.
>>>>  - make sure P0s and P1s get attention appropriate to their priority
>>>>
>>>> It can also be helpful to just state the failure modes which would
>>>> happen by default if we don't have a good process or automation:
>>>>
>>>>  - Real P0 gets filed and not noticed or fixed in a timely manner,
>>>> blocking users and/or community in real time
>>>>  - Real P1 gets filed and not noticed, so release goes out with known
>>>> data loss bug or other total loss of functionality
>>>>  - Non-real P0s and P1s accumulate, throwing off our data and making it
>>>> hard to find the real problems
>>>>  - Flakes are never fixed
>>>>
>>>> WDYT?
>>>>
>>>> If we have P0s and P1s in the "awaiting triage" state, those are the
>>>> ones we need to notice. Then for a P0 or P1 outside of that state, we just
>>>> need some way of making sure it doesn't stagnate. Or if it does stagnate,
>>>> that empirically demonstrates it isn't really P1 (just like our P2 to P3
>>>> downgrade automation). If everything is P1, nothing is, as they say.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <
>>>> [email protected]> wrote:
>>>>
>>>>> > Maybe it would be helpful to sort these by last update time (and
>>>>> potentially include that information in the email). Then we can at least
>>>>> prioritize them instead of looking at a big wall of issues.
>>>>>
>>>>> I agree that this is a good idea (and pretty trivial to do). I'll
>>>>> update the automation to do that once we get consensus on an approach.
>>>>>
>>>>> > I think the motivation for daily emails is that per the priorities
>>>>> guide [1] P1 issues should be getting "continuous status updates". If 
>>>>> these
>>>>> issues aren't actually that important, I think the noise is good as it
>>>>> should motivate us to prioritize them correctly. In practice that hasn't
>>>>> been happening though...
>>>>>
>>>>> I guess the questions here are:
>>>>>
>>>>> 1) What is the goal of this email?
>>>>> 2) Is it effective at accomplishing that goal.
>>>>>
>>>>> I think you're saying that the goal (or a goal) is to highlight issues
>>>>> that aren't getting the attention they need; if that's our goal, then I
>>>>> don't think this is a particularly effective mechanism for it because (a)
>>>>> its very unclear which issues fall into that category and (b) there are 
>>>>> too
>>>>> many to manually go through on a daily basis. From the email alone, it's
>>>>> not clear to me that any of the issues above "shouldn't" be P1s (though 
>>>>> I'd
>>>>> guess you're right that some/many of them don't belong since most were
>>>>> created before the Jira -> GH migration based on the titles). I'd also
>>>>> argue that a daily email just desensitizes us to them since there almost
>>>>> always will be *some *valid P1s that don't need extra attention.
>>>>>
>>>>> I do still think this could have value as a weekly email, with the
>>>>> goal being "it's probably a good idea for someone to take a look at each 
>>>>> of
>>>>> these". Another option would be to only include issues with no action in
>>>>> the last 7 days and/or no assignees and keep it daily.
>>>>>
>>>>> A couple side notes:
>>>>> - No matter what we do, if we keep the current automation in any form
>>>>> we should fix the url from
>>>>> https://api.github.com/repos/apache/beam/issues/# to
>>>>> https://github.com/apache/beam/issues/# - the current links are very
>>>>> annoying.
>>>>> - After I send this, I will do a pass of the current P1s since it does
>>>>> indeed seem like too many are P1s and many should actually be P2s (or
>>>>> lower).
>>>>>
>>>>> Thanks,
>>>>> Danny
>>>>>
>>>>> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I think the motivation for daily emails is that per the priorities
>>>>>> guide [1] P1 issues should be getting "continuous status updates". If 
>>>>>> these
>>>>>> issues aren't actually that important, I think the noise is good as it
>>>>>> should motivate us to prioritize them correctly. In practice that hasn't
>>>>>> been happening though...
>>>>>>
>>>>>> Maybe it would be helpful to sort these by last update time (and
>>>>>> potentially include that information in the email). Then we can at least
>>>>>> prioritize them instead of looking at a big wall of issues.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> [1] https://beam.apache.org/contribute/issue-priorities/
>>>>>>
>>>>>> On Thu, Jun 23, 2022 at 6:07 AM Danny McCormick <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think a weekly summary seems like a good idea for the P1 issues
>>>>>>> and flaky tests, though daily still seems appropriate for P0 issues. I 
>>>>>>> put
>>>>>>> up https://github.com/apache/beam/pull/22017 to just send the
>>>>>>> P1/flaky test reports on Wednesdays, if anyone objects please let me 
>>>>>>> know -
>>>>>>> I'll wait on merging til tomorrow to leave time for feedback (and it's
>>>>>>> always reversible 🙂).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Danny
>>>>>>>
>>>>>>> On Wed, Jun 22, 2022 at 7:05 PM Manu Zhang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> what is this daily summary intended for? Not all issues look like
>>>>>>>> P1. And will a weekly summary be less noise?
>>>>>>>>
>>>>>>>> <[email protected]>于2022年6月22日 周三23:45写道：
>>>>>>>>
>>>>>>>>> This is your daily summary of Beam's current P1 issues, not
>>>>>>>>> including flaky tests.
>>>>>>>>>
>>>>>>>>>     See
>>>>>>>>> https://beam.apache.org/contribute/issue-priorities/#p1-critical
>>>>>>>>> for the meaning and expectations around P1 issues.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21978:
>>>>>>>>> [Playground] Implement Share Any Code feature on the frontend
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21946: [Bug]: No
>>>>>>>>> way to read or write to file when running Beam in Flink
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21935: [Bug]:
>>>>>>>>> Reject illformed GBK Coders
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21897: [Feature
>>>>>>>>> Request]: Flink runner savepoint backward compatibility
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21893: [Bug]:
>>>>>>>>> BigQuery Storage Write API implementation does not support table
>>>>>>>>> partitioning
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21794: Dataflow
>>>>>>>>> runner creates a new timer whenever the output timestamp is change
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21763:
>>>>>>>>> [Playground Task]: Migrate from Google Analytics to Matomo Cloud
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21715: Data
>>>>>>>>> missing when using CassandraIO.Read
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21713: 404s in
>>>>>>>>> BigQueryIO don't get output to Failed Inserts PCollection
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21711: Python
>>>>>>>>> Streaming job failing to drain with BigQueryIO write errors
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21703:
>>>>>>>>> pubsublite.ReadWriteIT failing in beam_PostCommit_Java_DataflowV1 and 
>>>>>>>>> V2
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21702:
>>>>>>>>> SpannerWriteIT failing in beam PostCommit Java V1
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21700:
>>>>>>>>> --dataflowServiceOptions=use_runner_v2 is broken
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21695:
>>>>>>>>> DataflowPipelineResult does not raise exception for unsuccessful 
>>>>>>>>> states.
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21694: BigQuery
>>>>>>>>> Storage API insert with writeResult retry and write to error table
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21479: Install
>>>>>>>>> Python wheel and dependencies to local venv in SDK harness
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21478:
>>>>>>>>> KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21477: Add
>>>>>>>>> integration testing for BQ Storage API  write modes
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21476:
>>>>>>>>> WriteToBigQuery Dynamic table destinations returns wrong tableId
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21475: Beam
>>>>>>>>> x-lang Dataflow tests failing due to _InactiveRpcError
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21473:
>>>>>>>>> PVR_Spark2_Streaming perma-red
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21466: Simplify
>>>>>>>>> version override for Dev versions of the Go SDK.
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21465: Kafka
>>>>>>>>> commit offset drop data on failure for runners that have 
>>>>>>>>> non-checkpointing
>>>>>>>>> shuffle
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21269: Delete
>>>>>>>>> orphaned files
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21268: Race
>>>>>>>>> between member variable being accessed due to leaking uninitialized 
>>>>>>>>> state
>>>>>>>>> via OutboundObserverFactory
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21267:
>>>>>>>>> WriteToBigQuery submits a duplicate BQ load job if a 503 error code is
>>>>>>>>> returned from googleapi
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21265:
>>>>>>>>> apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
>>>>>>>>> 'apache_beam.coders.coder_impl._AbstractIterable' object is not 
>>>>>>>>> reversible
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21263: (Broken
>>>>>>>>> Pipe induced) Bricked Dataflow Pipeline
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21262: Python
>>>>>>>>> AfterAny, AfterAll do not follow spec
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21260: Python
>>>>>>>>> DirectRunner does not emit data at GC time
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21259: Consumer
>>>>>>>>> group with random prefix
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21258: Dataflow
>>>>>>>>> error in CombinePerKey operation
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21257: Either
>>>>>>>>> Create or DirectRunner fails to produce all elements to the following
>>>>>>>>> transform
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21123: Multiple
>>>>>>>>> jobs running on Flink session cluster reuse the persistent Python
>>>>>>>>> environment.
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21119: Migrate to
>>>>>>>>> the next version of Python `requests` when released
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21117: "Java IO
>>>>>>>>> IT Tests" - missing data in grafana
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21115: JdbcIO
>>>>>>>>> date conversion is sensitive to OS
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21112: Dataflow
>>>>>>>>> SocketException (SSLException) error while trying to send message from
>>>>>>>>> Cloud Pub/Sub to BigQuery
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21111: Java
>>>>>>>>> creates an incorrect pipeline proto when core-construction-java jar 
>>>>>>>>> is not
>>>>>>>>> in the CLASSPATH
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21110:
>>>>>>>>> codecov/patch has poor behavior
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21109: SDF
>>>>>>>>> BoundedSource seems to execute significantly slower than 'normal'
>>>>>>>>> BoundedSource
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/21108:
>>>>>>>>> java.io.InvalidClassException With Flink Kafka
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20979: Portable
>>>>>>>>> runners should be able to issue checkpoints to Splittable DoFn
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20978:
>>>>>>>>> PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to 
>>>>>>>>> decode
>>>>>>>>> some Avro logical types
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20973: Python
>>>>>>>>> Beam SDK Harness hangs when installing pip packages
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20818: XmlIO.Read
>>>>>>>>> does not handle XML encoding per spec
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20814: JmsIO is
>>>>>>>>> not acknowledging messages correctly
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20813: No trigger
>>>>>>>>> early repeatedly for session windows
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20812:
>>>>>>>>> Cross-language consistency (RequiresStableInputs) is quietly broken 
>>>>>>>>> (at
>>>>>>>>> least on portable flink runner)
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20692: Timer with
>>>>>>>>> dataflow runner can be set multiple times (dataflow runner)
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20691: Beam
>>>>>>>>> metrics should be displayed in Flink UI "Metrics" tab
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20689: Kafka
>>>>>>>>> commitOffsetsInFinalize OOM on Flink
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20532: Support
>>>>>>>>> for coder argument in WriteToBigQuery
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20531:
>>>>>>>>> FileBasedSink: allow setting temp directory provider per dynamic 
>>>>>>>>> destination
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20530: Make
>>>>>>>>> non-portable Splittable DoFn the only option when executing Java 
>>>>>>>>> "Read"
>>>>>>>>> transforms
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20529: SpannerIO
>>>>>>>>> tests don't actually assert anything.
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20528: python
>>>>>>>>> CombineGlobally().with_fanout() cause duplicate combine results for 
>>>>>>>>> sliding
>>>>>>>>> windows
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20333:
>>>>>>>>> beam_PerformanceTests_Kafka_IO failing due to " provided port is 
>>>>>>>>> already
>>>>>>>>> allocated"
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20332: FileIO
>>>>>>>>> writeDynamic with AvroIO.sink not writing all data
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20330: Remove
>>>>>>>>> insecure ssl options from MongoDBIO
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20109: SortValues
>>>>>>>>> should fail if SecondaryKey coder is not deterministic
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20108: Python
>>>>>>>>> direct runner doesn't emit empty pane when it should
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/20009:
>>>>>>>>> Environment-sensitive provisioning for Dataflow
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19971: [SQL] Some
>>>>>>>>> Hive tests throw NullPointerException, but get marked as passing 
>>>>>>>>> (Direct
>>>>>>>>> Runner)
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19817: datetime
>>>>>>>>> and decimal should be logical types
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19815: Add
>>>>>>>>> support for remaining data types in python RowCoder
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19813: PubsubIO
>>>>>>>>> returns empty message bodies for all messages read
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19556: User
>>>>>>>>> reports protobuf ClassChangeError running against 2.6.0 or above
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/19369: KafkaIO
>>>>>>>>> doesn't commit offsets while being used as bounded source
>>>>>>>>> https://api.github.com/repos/apache/beam/issues/17950: [Bug]:
>>>>>>>>> Java Precommit permared
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

Reply via email to