flake automation Was: P1 issues report (70)

Kenneth Knowles Thu, 23 Jun 2022 13:14:34 -0700

Forking thread because lots of people may just ignore this topic, per the
discussion :-)


(sometimes gmail doesn't fork thread properly, but here's hoping...)

I'll add some other outcomes of these emails:

 - people file P0s that are not outages and P1s that are not data loss and
I downgrade them
 - I randomly open up a few flaky test bugs and see if I can fix them
really quick
 - people file legit P0s and P1s and I subscribe and follow them

Of these, only the last one seems important (not just that *I* follow them,
but that new P0s and P1s get immediate attention from many eyes)

So maybe one take on the goal is to:

 - have new P0s and P1s evaluated quickly: P0s are an outage or outage-like
occurrence that needs immediate remedy, and P1s need to be evaluated for
release blocking, etc.
 - make sure P0s and P1s get attention appropriate to their priority

It can also be helpful to just state the failure modes which would happen
by default if we don't have a good process or automation:

 - Real P0 gets filed and not noticed or fixed in a timely manner, blocking
users and/or community in real time
 - Real P1 gets filed and not noticed, so release goes out with known data
loss bug or other total loss of functionality
 - Non-real P0s and P1s accumulate, throwing off our data and making it
hard to find the real problems
 - Flakes are never fixed

WDYT?

If we have P0s and P1s in the "awaiting triage" state, those are the ones
we need to notice. Then for a P0 or P1 outside of that state, we just need
some way of making sure it doesn't stagnate. Or if it does stagnate, that
empirically demonstrates it isn't really P1 (just like our P2 to P3
downgrade automation). If everything is P1, nothing is, as they say.

Kenn

On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <dannymccorm...@google.com>
wrote:

> > Maybe it would be helpful to sort these by last update time (and
> potentially include that information in the email). Then we can at least
> prioritize them instead of looking at a big wall of issues.
>
> I agree that this is a good idea (and pretty trivial to do). I'll update
> the automation to do that once we get consensus on an approach.
>
> > I think the motivation for daily emails is that per the priorities guide
> [1] P1 issues should be getting "continuous status updates". If these
> issues aren't actually that important, I think the noise is good as it
> should motivate us to prioritize them correctly. In practice that hasn't
> been happening though...
>
> I guess the questions here are:
>
> 1) What is the goal of this email?
> 2) Is it effective at accomplishing that goal.
>
> I think you're saying that the goal (or a goal) is to highlight issues
> that aren't getting the attention they need; if that's our goal, then I
> don't think this is a particularly effective mechanism for it because (a)
> its very unclear which issues fall into that category and (b) there are too
> many to manually go through on a daily basis. From the email alone, it's
> not clear to me that any of the issues above "shouldn't" be P1s (though I'd
> guess you're right that some/many of them don't belong since most were
> created before the Jira -> GH migration based on the titles). I'd also
> argue that a daily email just desensitizes us to them since there almost
> always will be *some *valid P1s that don't need extra attention.
>
> I do still think this could have value as a weekly email, with the goal
> being "it's probably a good idea for someone to take a look at each of
> these". Another option would be to only include issues with no action in
> the last 7 days and/or no assignees and keep it daily.
>
> A couple side notes:
> - No matter what we do, if we keep the current automation in any form we
> should fix the url from https://api.github.com/repos/apache/beam/issues/#
> to https://github.com/apache/beam/issues/# - the current links are very
> annoying.
> - After I send this, I will do a pass of the current P1s since it does
> indeed seem like too many are P1s and many should actually be P2s (or
> lower).
>
> Thanks,
> Danny
>
> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette <bhule...@google.com>
> wrote:
>
>> I think the motivation for daily emails is that per the priorities guide
>> [1] P1 issues should be getting "continuous status updates". If these
>> issues aren't actually that important, I think the noise is good as it
>> should motivate us to prioritize them correctly. In practice that hasn't
>> been happening though...
>>
>> Maybe it would be helpful to sort these by last update time (and
>> potentially include that information in the email). Then we can at least
>> prioritize them instead of looking at a big wall of issues.
>>
>> Brian
>>
>> [1] https://beam.apache.org/contribute/issue-priorities/
>>
>> On Thu, Jun 23, 2022 at 6:07 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> I think a weekly summary seems like a good idea for the P1 issues and
>>> flaky tests, though daily still seems appropriate for P0 issues. I put up
>>> https://github.com/apache/beam/pull/22017 to just send the P1/flaky
>>> test reports on Wednesdays, if anyone objects please let me know - I'll
>>> wait on merging til tomorrow to leave time for feedback (and it's always
>>> reversible 🙂).
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Wed, Jun 22, 2022 at 7:05 PM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> what is this daily summary intended for? Not all issues look like P1.
>>>> And will a weekly summary be less noise?
>>>>
>>>> <beamacti...@gmail.com>于2022年6月22日 周三23:45写道：
>>>>
>>>>> This is your daily summary of Beam's current P1 issues, not including
>>>>> flaky tests.
>>>>>
>>>>>     See
>>>>> https://beam.apache.org/contribute/issue-priorities/#p1-critical for
>>>>> the meaning and expectations around P1 issues.
>>>>>
>>>>>
>>>>>
>>>>> https://api.github.com/repos/apache/beam/issues/21978: [Playground]
>>>>> Implement Share Any Code feature on the frontend
>>>>> https://api.github.com/repos/apache/beam/issues/21946: [Bug]: No way
>>>>> to read or write to file when running Beam in Flink
>>>>> https://api.github.com/repos/apache/beam/issues/21935: [Bug]: Reject
>>>>> illformed GBK Coders
>>>>> https://api.github.com/repos/apache/beam/issues/21897: [Feature
>>>>> Request]: Flink runner savepoint backward compatibility
>>>>> https://api.github.com/repos/apache/beam/issues/21893: [Bug]:
>>>>> BigQuery Storage Write API implementation does not support table
>>>>> partitioning
>>>>> https://api.github.com/repos/apache/beam/issues/21794: Dataflow
>>>>> runner creates a new timer whenever the output timestamp is change
>>>>> https://api.github.com/repos/apache/beam/issues/21763: [Playground
>>>>> Task]: Migrate from Google Analytics to Matomo Cloud
>>>>> https://api.github.com/repos/apache/beam/issues/21715: Data missing
>>>>> when using CassandraIO.Read
>>>>> https://api.github.com/repos/apache/beam/issues/21713: 404s in
>>>>> BigQueryIO don't get output to Failed Inserts PCollection
>>>>> https://api.github.com/repos/apache/beam/issues/21711: Python
>>>>> Streaming job failing to drain with BigQueryIO write errors
>>>>> https://api.github.com/repos/apache/beam/issues/21703:
>>>>> pubsublite.ReadWriteIT failing in beam_PostCommit_Java_DataflowV1 and V2
>>>>> https://api.github.com/repos/apache/beam/issues/21702: SpannerWriteIT
>>>>> failing in beam PostCommit Java V1
>>>>> https://api.github.com/repos/apache/beam/issues/21700:
>>>>> --dataflowServiceOptions=use_runner_v2 is broken
>>>>> https://api.github.com/repos/apache/beam/issues/21695:
>>>>> DataflowPipelineResult does not raise exception for unsuccessful states.
>>>>> https://api.github.com/repos/apache/beam/issues/21694: BigQuery
>>>>> Storage API insert with writeResult retry and write to error table
>>>>> https://api.github.com/repos/apache/beam/issues/21479: Install Python
>>>>> wheel and dependencies to local venv in SDK harness
>>>>> https://api.github.com/repos/apache/beam/issues/21478:
>>>>> KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions
>>>>> https://api.github.com/repos/apache/beam/issues/21477: Add
>>>>> integration testing for BQ Storage API  write modes
>>>>> https://api.github.com/repos/apache/beam/issues/21476:
>>>>> WriteToBigQuery Dynamic table destinations returns wrong tableId
>>>>> https://api.github.com/repos/apache/beam/issues/21475: Beam x-lang
>>>>> Dataflow tests failing due to _InactiveRpcError
>>>>> https://api.github.com/repos/apache/beam/issues/21473:
>>>>> PVR_Spark2_Streaming perma-red
>>>>> https://api.github.com/repos/apache/beam/issues/21466: Simplify
>>>>> version override for Dev versions of the Go SDK.
>>>>> https://api.github.com/repos/apache/beam/issues/21465: Kafka commit
>>>>> offset drop data on failure for runners that have non-checkpointing 
>>>>> shuffle
>>>>> https://api.github.com/repos/apache/beam/issues/21269: Delete
>>>>> orphaned files
>>>>> https://api.github.com/repos/apache/beam/issues/21268: Race between
>>>>> member variable being accessed due to leaking uninitialized state via
>>>>> OutboundObserverFactory
>>>>> https://api.github.com/repos/apache/beam/issues/21267:
>>>>> WriteToBigQuery submits a duplicate BQ load job if a 503 error code is
>>>>> returned from googleapi
>>>>> https://api.github.com/repos/apache/beam/issues/21265:
>>>>> apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
>>>>> 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
>>>>> https://api.github.com/repos/apache/beam/issues/21263: (Broken Pipe
>>>>> induced) Bricked Dataflow Pipeline
>>>>> https://api.github.com/repos/apache/beam/issues/21262: Python
>>>>> AfterAny, AfterAll do not follow spec
>>>>> https://api.github.com/repos/apache/beam/issues/21260: Python
>>>>> DirectRunner does not emit data at GC time
>>>>> https://api.github.com/repos/apache/beam/issues/21259: Consumer group
>>>>> with random prefix
>>>>> https://api.github.com/repos/apache/beam/issues/21258: Dataflow error
>>>>> in CombinePerKey operation
>>>>> https://api.github.com/repos/apache/beam/issues/21257: Either Create
>>>>> or DirectRunner fails to produce all elements to the following transform
>>>>> https://api.github.com/repos/apache/beam/issues/21123: Multiple jobs
>>>>> running on Flink session cluster reuse the persistent Python environment.
>>>>> https://api.github.com/repos/apache/beam/issues/21119: Migrate to the
>>>>> next version of Python `requests` when released
>>>>> https://api.github.com/repos/apache/beam/issues/21117: "Java IO IT
>>>>> Tests" - missing data in grafana
>>>>> https://api.github.com/repos/apache/beam/issues/21115: JdbcIO date
>>>>> conversion is sensitive to OS
>>>>> https://api.github.com/repos/apache/beam/issues/21112: Dataflow
>>>>> SocketException (SSLException) error while trying to send message from
>>>>> Cloud Pub/Sub to BigQuery
>>>>> https://api.github.com/repos/apache/beam/issues/21111: Java creates
>>>>> an incorrect pipeline proto when core-construction-java jar is not in the
>>>>> CLASSPATH
>>>>> https://api.github.com/repos/apache/beam/issues/21110: codecov/patch
>>>>> has poor behavior
>>>>> https://api.github.com/repos/apache/beam/issues/21109: SDF
>>>>> BoundedSource seems to execute significantly slower than 'normal'
>>>>> BoundedSource
>>>>> https://api.github.com/repos/apache/beam/issues/21108:
>>>>> java.io.InvalidClassException With Flink Kafka
>>>>> https://api.github.com/repos/apache/beam/issues/20979: Portable
>>>>> runners should be able to issue checkpoints to Splittable DoFn
>>>>> https://api.github.com/repos/apache/beam/issues/20978:
>>>>> PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode
>>>>> some Avro logical types
>>>>> https://api.github.com/repos/apache/beam/issues/20973: Python Beam
>>>>> SDK Harness hangs when installing pip packages
>>>>> https://api.github.com/repos/apache/beam/issues/20818: XmlIO.Read
>>>>> does not handle XML encoding per spec
>>>>> https://api.github.com/repos/apache/beam/issues/20814: JmsIO is not
>>>>> acknowledging messages correctly
>>>>> https://api.github.com/repos/apache/beam/issues/20813: No trigger
>>>>> early repeatedly for session windows
>>>>> https://api.github.com/repos/apache/beam/issues/20812: Cross-language
>>>>> consistency (RequiresStableInputs) is quietly broken (at least on portable
>>>>> flink runner)
>>>>> https://api.github.com/repos/apache/beam/issues/20692: Timer with
>>>>> dataflow runner can be set multiple times (dataflow runner)
>>>>> https://api.github.com/repos/apache/beam/issues/20691: Beam metrics
>>>>> should be displayed in Flink UI "Metrics" tab
>>>>> https://api.github.com/repos/apache/beam/issues/20689: Kafka
>>>>> commitOffsetsInFinalize OOM on Flink
>>>>> https://api.github.com/repos/apache/beam/issues/20532: Support for
>>>>> coder argument in WriteToBigQuery
>>>>> https://api.github.com/repos/apache/beam/issues/20531: FileBasedSink:
>>>>> allow setting temp directory provider per dynamic destination
>>>>> https://api.github.com/repos/apache/beam/issues/20530: Make
>>>>> non-portable Splittable DoFn the only option when executing Java "Read"
>>>>> transforms
>>>>> https://api.github.com/repos/apache/beam/issues/20529: SpannerIO
>>>>> tests don't actually assert anything.
>>>>> https://api.github.com/repos/apache/beam/issues/20528: python
>>>>> CombineGlobally().with_fanout() cause duplicate combine results for 
>>>>> sliding
>>>>> windows
>>>>> https://api.github.com/repos/apache/beam/issues/20333:
>>>>> beam_PerformanceTests_Kafka_IO failing due to " provided port is already
>>>>> allocated"
>>>>> https://api.github.com/repos/apache/beam/issues/20332: FileIO
>>>>> writeDynamic with AvroIO.sink not writing all data
>>>>> https://api.github.com/repos/apache/beam/issues/20330: Remove
>>>>> insecure ssl options from MongoDBIO
>>>>> https://api.github.com/repos/apache/beam/issues/20109: SortValues
>>>>> should fail if SecondaryKey coder is not deterministic
>>>>> https://api.github.com/repos/apache/beam/issues/20108: Python direct
>>>>> runner doesn't emit empty pane when it should
>>>>> https://api.github.com/repos/apache/beam/issues/20009:
>>>>> Environment-sensitive provisioning for Dataflow
>>>>> https://api.github.com/repos/apache/beam/issues/19971: [SQL] Some
>>>>> Hive tests throw NullPointerException, but get marked as passing (Direct
>>>>> Runner)
>>>>> https://api.github.com/repos/apache/beam/issues/19817: datetime and
>>>>> decimal should be logical types
>>>>> https://api.github.com/repos/apache/beam/issues/19815: Add support
>>>>> for remaining data types in python RowCoder
>>>>> https://api.github.com/repos/apache/beam/issues/19813: PubsubIO
>>>>> returns empty message bodies for all messages read
>>>>> https://api.github.com/repos/apache/beam/issues/19556: User reports
>>>>> protobuf ClassChangeError running against 2.6.0 or above
>>>>> https://api.github.com/repos/apache/beam/issues/19369: KafkaIO
>>>>> doesn't commit offsets while being used as bounded source
>>>>> https://api.github.com/repos/apache/beam/issues/17950: [Bug]: Java
>>>>> Precommit permared
>>>>>
>>>>

[DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

Reply via email to