flake automation Was: P1 issues report (70)

Kenneth Knowles Thu, 23 Jun 2022 13:43:15 -0700

Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more
like ~hours for true outages of CI/website/etc) and P1s > 7 days?


On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette <bhule...@google.com> wrote:

> I think that Danny's alternate proposal (a daily email that show only
> issues last updated >7 days ago, and those with no assignee) fits well with
> the two goals you describe, if we include "triage needed" issues in the
> latter category. Maybe we also explicitly separate these two concerns in
> the report?
>
>
> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Forking thread because lots of people may just ignore this topic, per the
>> discussion :-)
>>
>> (sometimes gmail doesn't fork thread properly, but here's hoping...)
>>
>> I'll add some other outcomes of these emails:
>>
>>  - people file P0s that are not outages and P1s that are not data loss
>> and I downgrade them
>>  - I randomly open up a few flaky test bugs and see if I can fix them
>> really quick
>>  - people file legit P0s and P1s and I subscribe and follow them
>>
>> Of these, only the last one seems important (not just that *I* follow
>> them, but that new P0s and P1s get immediate attention from many eyes)
>>
>> So maybe one take on the goal is to:
>>
>>  - have new P0s and P1s evaluated quickly: P0s are an outage or
>> outage-like occurrence that needs immediate remedy, and P1s need to be
>> evaluated for release blocking, etc.
>>  - make sure P0s and P1s get attention appropriate to their priority
>>
>> It can also be helpful to just state the failure modes which would happen
>> by default if we don't have a good process or automation:
>>
>>  - Real P0 gets filed and not noticed or fixed in a timely manner,
>> blocking users and/or community in real time
>>  - Real P1 gets filed and not noticed, so release goes out with known
>> data loss bug or other total loss of functionality
>>  - Non-real P0s and P1s accumulate, throwing off our data and making it
>> hard to find the real problems
>>  - Flakes are never fixed
>>
>> WDYT?
>>
>> If we have P0s and P1s in the "awaiting triage" state, those are the ones
>> we need to notice. Then for a P0 or P1 outside of that state, we just need
>> some way of making sure it doesn't stagnate. Or if it does stagnate, that
>> empirically demonstrates it isn't really P1 (just like our P2 to P3
>> downgrade automation). If everything is P1, nothing is, as they say.
>>
>> Kenn
>>
>> On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> > Maybe it would be helpful to sort these by last update time (and
>>> potentially include that information in the email). Then we can at least
>>> prioritize them instead of looking at a big wall of issues.
>>>
>>> I agree that this is a good idea (and pretty trivial to do). I'll update
>>> the automation to do that once we get consensus on an approach.
>>>
>>> > I think the motivation for daily emails is that per the priorities
>>> guide [1] P1 issues should be getting "continuous status updates". If these
>>> issues aren't actually that important, I think the noise is good as it
>>> should motivate us to prioritize them correctly. In practice that hasn't
>>> been happening though...
>>>
>>> I guess the questions here are:
>>>
>>> 1) What is the goal of this email?
>>> 2) Is it effective at accomplishing that goal.
>>>
>>> I think you're saying that the goal (or a goal) is to highlight issues
>>> that aren't getting the attention they need; if that's our goal, then I
>>> don't think this is a particularly effective mechanism for it because (a)
>>> its very unclear which issues fall into that category and (b) there are too
>>> many to manually go through on a daily basis. From the email alone, it's
>>> not clear to me that any of the issues above "shouldn't" be P1s (though I'd
>>> guess you're right that some/many of them don't belong since most were
>>> created before the Jira -> GH migration based on the titles). I'd also
>>> argue that a daily email just desensitizes us to them since there almost
>>> always will be *some *valid P1s that don't need extra attention.
>>>
>>> I do still think this could have value as a weekly email, with the goal
>>> being "it's probably a good idea for someone to take a look at each of
>>> these". Another option would be to only include issues with no action in
>>> the last 7 days and/or no assignees and keep it daily.
>>>
>>> A couple side notes:
>>> - No matter what we do, if we keep the current automation in any form we
>>> should fix the url from
>>> https://api.github.com/repos/apache/beam/issues/# to
>>> https://github.com/apache/beam/issues/# - the current links are very
>>> annoying.
>>> - After I send this, I will do a pass of the current P1s since it does
>>> indeed seem like too many are P1s and many should actually be P2s (or
>>> lower).
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette <bhule...@google.com>
>>> wrote:
>>>
>>>> I think the motivation for daily emails is that per the priorities
>>>> guide [1] P1 issues should be getting "continuous status updates". If these
>>>> issues aren't actually that important, I think the noise is good as it
>>>> should motivate us to prioritize them correctly. In practice that hasn't
>>>> been happening though...
>>>>
>>>> Maybe it would be helpful to sort these by last update time (and
>>>> potentially include that information in the email). Then we can at least
>>>> prioritize them instead of looking at a big wall of issues.
>>>>
>>>> Brian
>>>>
>>>> [1] https://beam.apache.org/contribute/issue-priorities/
>>>>
>>>> On Thu, Jun 23, 2022 at 6:07 AM Danny McCormick <
>>>> dannymccorm...@google.com> wrote:
>>>>
>>>>> I think a weekly summary seems like a good idea for the P1 issues and
>>>>> flaky tests, though daily still seems appropriate for P0 issues. I put up
>>>>> https://github.com/apache/beam/pull/22017 to just send the P1/flaky
>>>>> test reports on Wednesdays, if anyone objects please let me know - I'll
>>>>> wait on merging til tomorrow to leave time for feedback (and it's always
>>>>> reversible 🙂).
>>>>>
>>>>> Thanks,
>>>>> Danny
>>>>>
>>>>> On Wed, Jun 22, 2022 at 7:05 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> what is this daily summary intended for? Not all issues look like P1.
>>>>>> And will a weekly summary be less noise?
>>>>>>
>>>>>> <beamacti...@gmail.com>于2022年6月22日 周三23:45写道：
>>>>>>
>>>>>>> This is your daily summary of Beam's current P1 issues, not
>>>>>>> including flaky tests.
>>>>>>>
>>>>>>>     See
>>>>>>> https://beam.apache.org/contribute/issue-priorities/#p1-critical
>>>>>>> for the meaning and expectations around P1 issues.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://api.github.com/repos/apache/beam/issues/21978: [Playground]
>>>>>>> Implement Share Any Code feature on the frontend
>>>>>>> https://api.github.com/repos/apache/beam/issues/21946: [Bug]: No
>>>>>>> way to read or write to file when running Beam in Flink
>>>>>>> https://api.github.com/repos/apache/beam/issues/21935: [Bug]:
>>>>>>> Reject illformed GBK Coders
>>>>>>> https://api.github.com/repos/apache/beam/issues/21897: [Feature
>>>>>>> Request]: Flink runner savepoint backward compatibility
>>>>>>> https://api.github.com/repos/apache/beam/issues/21893: [Bug]:
>>>>>>> BigQuery Storage Write API implementation does not support table
>>>>>>> partitioning
>>>>>>> https://api.github.com/repos/apache/beam/issues/21794: Dataflow
>>>>>>> runner creates a new timer whenever the output timestamp is change
>>>>>>> https://api.github.com/repos/apache/beam/issues/21763: [Playground
>>>>>>> Task]: Migrate from Google Analytics to Matomo Cloud
>>>>>>> https://api.github.com/repos/apache/beam/issues/21715: Data missing
>>>>>>> when using CassandraIO.Read
>>>>>>> https://api.github.com/repos/apache/beam/issues/21713: 404s in
>>>>>>> BigQueryIO don't get output to Failed Inserts PCollection
>>>>>>> https://api.github.com/repos/apache/beam/issues/21711: Python
>>>>>>> Streaming job failing to drain with BigQueryIO write errors
>>>>>>> https://api.github.com/repos/apache/beam/issues/21703:
>>>>>>> pubsublite.ReadWriteIT failing in beam_PostCommit_Java_DataflowV1 and V2
>>>>>>> https://api.github.com/repos/apache/beam/issues/21702:
>>>>>>> SpannerWriteIT failing in beam PostCommit Java V1
>>>>>>> https://api.github.com/repos/apache/beam/issues/21700:
>>>>>>> --dataflowServiceOptions=use_runner_v2 is broken
>>>>>>> https://api.github.com/repos/apache/beam/issues/21695:
>>>>>>> DataflowPipelineResult does not raise exception for unsuccessful states.
>>>>>>> https://api.github.com/repos/apache/beam/issues/21694: BigQuery
>>>>>>> Storage API insert with writeResult retry and write to error table
>>>>>>> https://api.github.com/repos/apache/beam/issues/21479: Install
>>>>>>> Python wheel and dependencies to local venv in SDK harness
>>>>>>> https://api.github.com/repos/apache/beam/issues/21478:
>>>>>>> KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions
>>>>>>> https://api.github.com/repos/apache/beam/issues/21477: Add
>>>>>>> integration testing for BQ Storage API  write modes
>>>>>>> https://api.github.com/repos/apache/beam/issues/21476:
>>>>>>> WriteToBigQuery Dynamic table destinations returns wrong tableId
>>>>>>> https://api.github.com/repos/apache/beam/issues/21475: Beam x-lang
>>>>>>> Dataflow tests failing due to _InactiveRpcError
>>>>>>> https://api.github.com/repos/apache/beam/issues/21473:
>>>>>>> PVR_Spark2_Streaming perma-red
>>>>>>> https://api.github.com/repos/apache/beam/issues/21466: Simplify
>>>>>>> version override for Dev versions of the Go SDK.
>>>>>>> https://api.github.com/repos/apache/beam/issues/21465: Kafka commit
>>>>>>> offset drop data on failure for runners that have non-checkpointing 
>>>>>>> shuffle
>>>>>>> https://api.github.com/repos/apache/beam/issues/21269: Delete
>>>>>>> orphaned files
>>>>>>> https://api.github.com/repos/apache/beam/issues/21268: Race between
>>>>>>> member variable being accessed due to leaking uninitialized state via
>>>>>>> OutboundObserverFactory
>>>>>>> https://api.github.com/repos/apache/beam/issues/21267:
>>>>>>> WriteToBigQuery submits a duplicate BQ load job if a 503 error code is
>>>>>>> returned from googleapi
>>>>>>> https://api.github.com/repos/apache/beam/issues/21265:
>>>>>>> apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
>>>>>>> 'apache_beam.coders.coder_impl._AbstractIterable' object is not 
>>>>>>> reversible
>>>>>>> https://api.github.com/repos/apache/beam/issues/21263: (Broken Pipe
>>>>>>> induced) Bricked Dataflow Pipeline
>>>>>>> https://api.github.com/repos/apache/beam/issues/21262: Python
>>>>>>> AfterAny, AfterAll do not follow spec
>>>>>>> https://api.github.com/repos/apache/beam/issues/21260: Python
>>>>>>> DirectRunner does not emit data at GC time
>>>>>>> https://api.github.com/repos/apache/beam/issues/21259: Consumer
>>>>>>> group with random prefix
>>>>>>> https://api.github.com/repos/apache/beam/issues/21258: Dataflow
>>>>>>> error in CombinePerKey operation
>>>>>>> https://api.github.com/repos/apache/beam/issues/21257: Either
>>>>>>> Create or DirectRunner fails to produce all elements to the following
>>>>>>> transform
>>>>>>> https://api.github.com/repos/apache/beam/issues/21123: Multiple
>>>>>>> jobs running on Flink session cluster reuse the persistent Python
>>>>>>> environment.
>>>>>>> https://api.github.com/repos/apache/beam/issues/21119: Migrate to
>>>>>>> the next version of Python `requests` when released
>>>>>>> https://api.github.com/repos/apache/beam/issues/21117: "Java IO IT
>>>>>>> Tests" - missing data in grafana
>>>>>>> https://api.github.com/repos/apache/beam/issues/21115: JdbcIO date
>>>>>>> conversion is sensitive to OS
>>>>>>> https://api.github.com/repos/apache/beam/issues/21112: Dataflow
>>>>>>> SocketException (SSLException) error while trying to send message from
>>>>>>> Cloud Pub/Sub to BigQuery
>>>>>>> https://api.github.com/repos/apache/beam/issues/21111: Java creates
>>>>>>> an incorrect pipeline proto when core-construction-java jar is not in 
>>>>>>> the
>>>>>>> CLASSPATH
>>>>>>> https://api.github.com/repos/apache/beam/issues/21110:
>>>>>>> codecov/patch has poor behavior
>>>>>>> https://api.github.com/repos/apache/beam/issues/21109: SDF
>>>>>>> BoundedSource seems to execute significantly slower than 'normal'
>>>>>>> BoundedSource
>>>>>>> https://api.github.com/repos/apache/beam/issues/21108:
>>>>>>> java.io.InvalidClassException With Flink Kafka
>>>>>>> https://api.github.com/repos/apache/beam/issues/20979: Portable
>>>>>>> runners should be able to issue checkpoints to Splittable DoFn
>>>>>>> https://api.github.com/repos/apache/beam/issues/20978:
>>>>>>> PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode
>>>>>>> some Avro logical types
>>>>>>> https://api.github.com/repos/apache/beam/issues/20973: Python Beam
>>>>>>> SDK Harness hangs when installing pip packages
>>>>>>> https://api.github.com/repos/apache/beam/issues/20818: XmlIO.Read
>>>>>>> does not handle XML encoding per spec
>>>>>>> https://api.github.com/repos/apache/beam/issues/20814: JmsIO is not
>>>>>>> acknowledging messages correctly
>>>>>>> https://api.github.com/repos/apache/beam/issues/20813: No trigger
>>>>>>> early repeatedly for session windows
>>>>>>> https://api.github.com/repos/apache/beam/issues/20812:
>>>>>>> Cross-language consistency (RequiresStableInputs) is quietly broken (at
>>>>>>> least on portable flink runner)
>>>>>>> https://api.github.com/repos/apache/beam/issues/20692: Timer with
>>>>>>> dataflow runner can be set multiple times (dataflow runner)
>>>>>>> https://api.github.com/repos/apache/beam/issues/20691: Beam metrics
>>>>>>> should be displayed in Flink UI "Metrics" tab
>>>>>>> https://api.github.com/repos/apache/beam/issues/20689: Kafka
>>>>>>> commitOffsetsInFinalize OOM on Flink
>>>>>>> https://api.github.com/repos/apache/beam/issues/20532: Support for
>>>>>>> coder argument in WriteToBigQuery
>>>>>>> https://api.github.com/repos/apache/beam/issues/20531:
>>>>>>> FileBasedSink: allow setting temp directory provider per dynamic 
>>>>>>> destination
>>>>>>> https://api.github.com/repos/apache/beam/issues/20530: Make
>>>>>>> non-portable Splittable DoFn the only option when executing Java "Read"
>>>>>>> transforms
>>>>>>> https://api.github.com/repos/apache/beam/issues/20529: SpannerIO
>>>>>>> tests don't actually assert anything.
>>>>>>> https://api.github.com/repos/apache/beam/issues/20528: python
>>>>>>> CombineGlobally().with_fanout() cause duplicate combine results for 
>>>>>>> sliding
>>>>>>> windows
>>>>>>> https://api.github.com/repos/apache/beam/issues/20333:
>>>>>>> beam_PerformanceTests_Kafka_IO failing due to " provided port is already
>>>>>>> allocated"
>>>>>>> https://api.github.com/repos/apache/beam/issues/20332: FileIO
>>>>>>> writeDynamic with AvroIO.sink not writing all data
>>>>>>> https://api.github.com/repos/apache/beam/issues/20330: Remove
>>>>>>> insecure ssl options from MongoDBIO
>>>>>>> https://api.github.com/repos/apache/beam/issues/20109: SortValues
>>>>>>> should fail if SecondaryKey coder is not deterministic
>>>>>>> https://api.github.com/repos/apache/beam/issues/20108: Python
>>>>>>> direct runner doesn't emit empty pane when it should
>>>>>>> https://api.github.com/repos/apache/beam/issues/20009:
>>>>>>> Environment-sensitive provisioning for Dataflow
>>>>>>> https://api.github.com/repos/apache/beam/issues/19971: [SQL] Some
>>>>>>> Hive tests throw NullPointerException, but get marked as passing (Direct
>>>>>>> Runner)
>>>>>>> https://api.github.com/repos/apache/beam/issues/19817: datetime and
>>>>>>> decimal should be logical types
>>>>>>> https://api.github.com/repos/apache/beam/issues/19815: Add support
>>>>>>> for remaining data types in python RowCoder
>>>>>>> https://api.github.com/repos/apache/beam/issues/19813: PubsubIO
>>>>>>> returns empty message bodies for all messages read
>>>>>>> https://api.github.com/repos/apache/beam/issues/19556: User reports
>>>>>>> protobuf ClassChangeError running against 2.6.0 or above
>>>>>>> https://api.github.com/repos/apache/beam/issues/19369: KafkaIO
>>>>>>> doesn't commit offsets while being used as bounded source
>>>>>>> https://api.github.com/repos/apache/beam/issues/17950: [Bug]: Java
>>>>>>> Precommit permared
>>>>>>>
>>>>>>

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

Reply via email to