flake automation Was: P1 issues report (70)

Alexey Romanenko Fri, 24 Jun 2022 10:20:33 -0700


> > 2. The links in this report start with api.github.* and don’t take us 
> > directly to the issues.
> 
> > Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
> 
> This is already fixed - Pablo actually beat me to it! 
> <https://github.com/apache/beam/pull/22033>
It adds also a colon after URL and some mail clients consider it as a part of 
URL which leads to a broken link.
Should we just remove a colon there or add a space between?


—
Alexey

> 
> Thanks,
> Danny
> 
> On Thu, Jun 23, 2022 at 8:30 PM Brian Hulette <bhule...@google.com 
> <mailto:bhule...@google.com>> wrote:
> +1 for that proposal!
> 
> > 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
> > longer time window for the rest of not triaged or stagnate issues and 
> > include them?
> 
> I worry these lists would get _very_ long and wouldn't be actionable. But 
> maybe it's worth reporting something like "There are 376 P2's with no update 
> in the last 6 months" with a link to a query?
> 
> > 2. The links in this report start with api.github.* and don’t take us 
> > directly to the issues.
> 
> Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
> 
> On Thu, Jun 23, 2022 at 2:37 PM Pablo Estrada <pabl...@google.com 
> <mailto:pabl...@google.com>> wrote:
> Thanks. I like the proposal, and I've found the emails useful.
> Best
> -P.
> 
> On Thu, Jun 23, 2022 at 2:33 PM Manu Zhang <owenzhang1...@gmail.com 
> <mailto:owenzhang1...@gmail.com>> wrote:
> Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA 
> time and having no response from engineers.  We either resolve them or 
> downgrade the priority to extend time window.
> 
> Besides,
> 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
> longer time window for the rest of not triaged or stagnate issues and include 
> them?
> 2. The links in this report start with api.github.* and don’t take us 
> directly to the issues.
> 
> 
> Danny McCormick <dannymccorm...@google.com 
> <mailto:dannymccorm...@google.com>>于2022年6月24日 周五04:48写道：
> That generally sounds right to me - I also would vote that we consolidate to 
> 1 email and stop distinguishing between flaky P1s and normal P1s.
> 
> So the single daily report would be:
> 
> - Unassigned P0s
> - P0s with no update in the last 36 hours
> - Unassigned P1s
> - P1s with no update in the last 7 days
> 
> I think that will generate a pretty good list of issues that require some 
> kind of action.
> 
> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles <k...@apache.org 
> <mailto:k...@apache.org>> wrote:
> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more like 
> ~hours for true outages of CI/website/etc) and P1s > 7 days?
> 
> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette <bhule...@google.com 
> <mailto:bhule...@google.com>> wrote:
> I think that Danny's alternate proposal (a daily email that show only issues 
> last updated >7 days ago, and those with no assignee) fits well with the two 
> goals you describe, if we include "triage needed" issues in the latter 
> category. Maybe we also explicitly separate these two concerns in the report?
> 
> 
> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles <k...@apache.org 
> <mailto:k...@apache.org>> wrote:
> Forking thread because lots of people may just ignore this topic, per the 
> discussion :-)
> 
> (sometimes gmail doesn't fork thread properly, but here's hoping...)
> 
> I'll add some other outcomes of these emails:
> 
>  - people file P0s that are not outages and P1s that are not data loss and I 
> downgrade them
>  - I randomly open up a few flaky test bugs and see if I can fix them really 
> quick
>  - people file legit P0s and P1s and I subscribe and follow them
> 
> Of these, only the last one seems important (not just that *I* follow them, 
> but that new P0s and P1s get immediate attention from many eyes)
> 
> So maybe one take on the goal is to:
> 
>  - have new P0s and P1s evaluated quickly: P0s are an outage or outage-like 
> occurrence that needs immediate remedy, and P1s need to be evaluated for 
> release blocking, etc.
>  - make sure P0s and P1s get attention appropriate to their priority
> 
> It can also be helpful to just state the failure modes which would happen by 
> default if we don't have a good process or automation:
> 
>  - Real P0 gets filed and not noticed or fixed in a timely manner, blocking 
> users and/or community in real time
>  - Real P1 gets filed and not noticed, so release goes out with known data 
> loss bug or other total loss of functionality
>  - Non-real P0s and P1s accumulate, throwing off our data and making it hard 
> to find the real problems
>  - Flakes are never fixed
> 
> WDYT?
> 
> If we have P0s and P1s in the "awaiting triage" state, those are the ones we 
> need to notice. Then for a P0 or P1 outside of that state, we just need some 
> way of making sure it doesn't stagnate. Or if it does stagnate, that 
> empirically demonstrates it isn't really P1 (just like our P2 to P3 downgrade 
> automation). If everything is P1, nothing is, as they say.
> 
> Kenn
> 
> On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <dannymccorm...@google.com 
> <mailto:dannymccorm...@google.com>> wrote:
> > Maybe it would be helpful to sort these by last update time (and 
> > potentially include that information in the email). Then we can at least 
> > prioritize them instead of looking at a big wall of issues.
> 
> I agree that this is a good idea (and pretty trivial to do). I'll update the 
> automation to do that once we get consensus on an approach.
> 
> > I think the motivation for daily emails is that per the priorities guide 
> > [1] P1 issues should be getting "continuous status updates". If these 
> > issues aren't actually that important, I think the noise is good as it 
> > should motivate us to prioritize them correctly. In practice that hasn't 
> > been happening though...
> 
> I guess the questions here are:
> 
> 1) What is the goal of this email?
> 2) Is it effective at accomplishing that goal.
> 
> I think you're saying that the goal (or a goal) is to highlight issues that 
> aren't getting the attention they need; if that's our goal, then I don't 
> think this is a particularly effective mechanism for it because (a) its very 
> unclear which issues fall into that category and (b) there are too many to 
> manually go through on a daily basis. From the email alone, it's not clear to 
> me that any of the issues above "shouldn't" be P1s (though I'd guess you're 
> right that some/many of them don't belong since most were created before the 
> Jira -> GH migration based on the titles). I'd also argue that a daily email 
> just desensitizes us to them since there almost always will be some valid P1s 
> that don't need extra attention.
> 
> I do still think this could have value as a weekly email, with the goal being 
> "it's probably a good idea for someone to take a look at each of these". 
> Another option would be to only include issues with no action in the last 7 
> days and/or no assignees and keep it daily.
> 
> A couple side notes:
> - No matter what we do, if we keep the current automation in any form we 
> should fix the url from https://api.github.com/repos/apache/beam/issues/# 
> <https://api.github.com/repos/apache/beam/issues/#> to 
> https://github.com/apache/beam/issues/# 
> <https://github.com/apache/beam/issues/#> - the current links are very 
> annoying.
> - After I send this, I will do a pass of the current P1s since it does indeed 
> seem like too many are P1s and many should actually be P2s (or lower).
> 
> Thanks,
> Danny
> 
> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette <bhule...@google.com 
> <mailto:bhule...@google.com>> wrote:
> I think the motivation for daily emails is that per the priorities guide [1] 
> P1 issues should be getting "continuous status updates". If these issues 
> aren't actually that important, I think the noise is good as it should 
> motivate us to prioritize them correctly. In practice that hasn't been 
> happening though...
> 
> Maybe it would be helpful to sort these by last update time (and potentially 
> include that information in the email). Then we can at least prioritize them 
> instead of looking at a big wall of issues.
> 
> Brian
> 
> [1] https://beam.apache.org/contribute/issue-priorities/ 
> <https://beam.apache.org/contribute/issue-priorities/>
> On Thu, Jun 23, 2022 at 6:07 AM Danny McCormick <dannymccorm...@google.com 
> <mailto:dannymccorm...@google.com>> wrote:
> I think a weekly summary seems like a good idea for the P1 issues and flaky 
> tests, though daily still seems appropriate for P0 issues. I put up 
> https://github.com/apache/beam/pull/22017 
> <https://github.com/apache/beam/pull/22017> to just send the P1/flaky test 
> reports on Wednesdays, if anyone objects please let me know - I'll wait on 
> merging til tomorrow to leave time for feedback (and it's always reversible 
> 🙂).
> 
> Thanks,
> Danny
> 
> On Wed, Jun 22, 2022 at 7:05 PM Manu Zhang <owenzhang1...@gmail.com 
> <mailto:owenzhang1...@gmail.com>> wrote:
> Hi all,
> 
> what is this daily summary intended for? Not all issues look like P1. And 
> will a weekly summary be less noise?
> 
> <beamacti...@gmail.com <mailto:beamacti...@gmail.com>>于2022年6月22日 周三23:45写道：
> This is your daily summary of Beam's current P1 issues, not including flaky 
> tests.
> 
>     See https://beam.apache.org/contribute/issue-priorities/#p1-critical 
> <https://beam.apache.org/contribute/issue-priorities/#p1-critical> for the 
> meaning and expectations around P1 issues.
> 
> 
> 
> https://api.github.com/repos/apache/beam/issues/21978 
> <https://api.github.com/repos/apache/beam/issues/21978>: [Playground] 
> Implement Share Any Code feature on the frontend
> https://api.github.com/repos/apache/beam/issues/21946 
> <https://api.github.com/repos/apache/beam/issues/21946>: [Bug]: No way to 
> read or write to file when running Beam in Flink
> https://api.github.com/repos/apache/beam/issues/21935 
> <https://api.github.com/repos/apache/beam/issues/21935>: [Bug]: Reject 
> illformed GBK Coders
> https://api.github.com/repos/apache/beam/issues/21897 
> <https://api.github.com/repos/apache/beam/issues/21897>: [Feature Request]: 
> Flink runner savepoint backward compatibility 
> https://api.github.com/repos/apache/beam/issues/21893 
> <https://api.github.com/repos/apache/beam/issues/21893>: [Bug]: BigQuery 
> Storage Write API implementation does not support table partitioning
> https://api.github.com/repos/apache/beam/issues/21794 
> <https://api.github.com/repos/apache/beam/issues/21794>: Dataflow runner 
> creates a new timer whenever the output timestamp is change
> https://api.github.com/repos/apache/beam/issues/21763 
> <https://api.github.com/repos/apache/beam/issues/21763>: [Playground Task]: 
> Migrate from Google Analytics to Matomo Cloud
> https://api.github.com/repos/apache/beam/issues/21715 
> <https://api.github.com/repos/apache/beam/issues/21715>: Data missing when 
> using CassandraIO.Read
> https://api.github.com/repos/apache/beam/issues/21713 
> <https://api.github.com/repos/apache/beam/issues/21713>: 404s in BigQueryIO 
> don't get output to Failed Inserts PCollection
> https://api.github.com/repos/apache/beam/issues/21711 
> <https://api.github.com/repos/apache/beam/issues/21711>: Python Streaming job 
> failing to drain with BigQueryIO write errors
> https://api.github.com/repos/apache/beam/issues/21703 
> <https://api.github.com/repos/apache/beam/issues/21703>: 
> pubsublite.ReadWriteIT failing in beam_PostCommit_Java_DataflowV1 and V2
> https://api.github.com/repos/apache/beam/issues/21702 
> <https://api.github.com/repos/apache/beam/issues/21702>: SpannerWriteIT 
> failing in beam PostCommit Java V1
> https://api.github.com/repos/apache/beam/issues/21700 
> <https://api.github.com/repos/apache/beam/issues/21700>: 
> --dataflowServiceOptions=use_runner_v2 is broken
> https://api.github.com/repos/apache/beam/issues/21695 
> <https://api.github.com/repos/apache/beam/issues/21695>: 
> DataflowPipelineResult does not raise exception for unsuccessful states.
> https://api.github.com/repos/apache/beam/issues/21694 
> <https://api.github.com/repos/apache/beam/issues/21694>: BigQuery Storage API 
> insert with writeResult retry and write to error table
> https://api.github.com/repos/apache/beam/issues/21479 
> <https://api.github.com/repos/apache/beam/issues/21479>: Install Python wheel 
> and dependencies to local venv in SDK harness
> https://api.github.com/repos/apache/beam/issues/21478 
> <https://api.github.com/repos/apache/beam/issues/21478>: 
> KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions
> https://api.github.com/repos/apache/beam/issues/21477 
> <https://api.github.com/repos/apache/beam/issues/21477>: Add integration 
> testing for BQ Storage API  write modes
> https://api.github.com/repos/apache/beam/issues/21476 
> <https://api.github.com/repos/apache/beam/issues/21476>: WriteToBigQuery 
> Dynamic table destinations returns wrong tableId
> https://api.github.com/repos/apache/beam/issues/21475 
> <https://api.github.com/repos/apache/beam/issues/21475>: Beam x-lang Dataflow 
> tests failing due to _InactiveRpcError
> https://api.github.com/repos/apache/beam/issues/21473 
> <https://api.github.com/repos/apache/beam/issues/21473>: PVR_Spark2_Streaming 
> perma-red
> https://api.github.com/repos/apache/beam/issues/21466 
> <https://api.github.com/repos/apache/beam/issues/21466>: Simplify version 
> override for Dev versions of the Go SDK.
> https://api.github.com/repos/apache/beam/issues/21465 
> <https://api.github.com/repos/apache/beam/issues/21465>: Kafka commit offset 
> drop data on failure for runners that have non-checkpointing shuffle
> https://api.github.com/repos/apache/beam/issues/21269 
> <https://api.github.com/repos/apache/beam/issues/21269>: Delete orphaned files
> https://api.github.com/repos/apache/beam/issues/21268 
> <https://api.github.com/repos/apache/beam/issues/21268>: Race between member 
> variable being accessed due to leaking uninitialized state via 
> OutboundObserverFactory
> https://api.github.com/repos/apache/beam/issues/21267 
> <https://api.github.com/repos/apache/beam/issues/21267>: WriteToBigQuery 
> submits a duplicate BQ load job if a 503 error code is returned from googleapi
> https://api.github.com/repos/apache/beam/issues/21265 
> <https://api.github.com/repos/apache/beam/issues/21265>: 
> apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
>  'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
> https://api.github.com/repos/apache/beam/issues/21263 
> <https://api.github.com/repos/apache/beam/issues/21263>: (Broken Pipe 
> induced) Bricked Dataflow Pipeline 
> https://api.github.com/repos/apache/beam/issues/21262 
> <https://api.github.com/repos/apache/beam/issues/21262>: Python AfterAny, 
> AfterAll do not follow spec
> https://api.github.com/repos/apache/beam/issues/21260 
> <https://api.github.com/repos/apache/beam/issues/21260>: Python DirectRunner 
> does not emit data at GC time
> https://api.github.com/repos/apache/beam/issues/21259 
> <https://api.github.com/repos/apache/beam/issues/21259>: Consumer group with 
> random prefix
> https://api.github.com/repos/apache/beam/issues/21258 
> <https://api.github.com/repos/apache/beam/issues/21258>: Dataflow error in 
> CombinePerKey operation
> https://api.github.com/repos/apache/beam/issues/21257 
> <https://api.github.com/repos/apache/beam/issues/21257>: Either Create or 
> DirectRunner fails to produce all elements to the following transform
> https://api.github.com/repos/apache/beam/issues/21123 
> <https://api.github.com/repos/apache/beam/issues/21123>: Multiple jobs 
> running on Flink session cluster reuse the persistent Python environment.
> https://api.github.com/repos/apache/beam/issues/21119 
> <https://api.github.com/repos/apache/beam/issues/21119>: Migrate to the next 
> version of Python `requests` when released
> https://api.github.com/repos/apache/beam/issues/21117 
> <https://api.github.com/repos/apache/beam/issues/21117>: "Java IO IT Tests" - 
> missing data in grafana
> https://api.github.com/repos/apache/beam/issues/21115 
> <https://api.github.com/repos/apache/beam/issues/21115>: JdbcIO date 
> conversion is sensitive to OS
> https://api.github.com/repos/apache/beam/issues/21112 
> <https://api.github.com/repos/apache/beam/issues/21112>: Dataflow 
> SocketException (SSLException) error while trying to send message from Cloud 
> Pub/Sub to BigQuery
> https://api.github.com/repos/apache/beam/issues/21111 
> <https://api.github.com/repos/apache/beam/issues/21111>: Java creates an 
> incorrect pipeline proto when core-construction-java jar is not in the 
> CLASSPATH
> https://api.github.com/repos/apache/beam/issues/21110 
> <https://api.github.com/repos/apache/beam/issues/21110>: codecov/patch has 
> poor behavior
> https://api.github.com/repos/apache/beam/issues/21109 
> <https://api.github.com/repos/apache/beam/issues/21109>: SDF BoundedSource 
> seems to execute significantly slower than 'normal' BoundedSource
> https://api.github.com/repos/apache/beam/issues/21108 
> <https://api.github.com/repos/apache/beam/issues/21108>: 
> java.io.InvalidClassException With Flink Kafka
> https://api.github.com/repos/apache/beam/issues/20979 
> <https://api.github.com/repos/apache/beam/issues/20979>: Portable runners 
> should be able to issue checkpoints to Splittable DoFn
> https://api.github.com/repos/apache/beam/issues/20978 
> <https://api.github.com/repos/apache/beam/issues/20978>: 
> PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
> Avro logical types
> https://api.github.com/repos/apache/beam/issues/20973 
> <https://api.github.com/repos/apache/beam/issues/20973>: Python Beam SDK 
> Harness hangs when installing pip packages
> https://api.github.com/repos/apache/beam/issues/20818 
> <https://api.github.com/repos/apache/beam/issues/20818>: XmlIO.Read does not 
> handle XML encoding per spec
> https://api.github.com/repos/apache/beam/issues/20814 
> <https://api.github.com/repos/apache/beam/issues/20814>: JmsIO is not 
> acknowledging messages correctly
> https://api.github.com/repos/apache/beam/issues/20813 
> <https://api.github.com/repos/apache/beam/issues/20813>: No trigger early 
> repeatedly for session windows
> https://api.github.com/repos/apache/beam/issues/20812 
> <https://api.github.com/repos/apache/beam/issues/20812>: Cross-language 
> consistency (RequiresStableInputs) is quietly broken (at least on portable 
> flink runner)
> https://api.github.com/repos/apache/beam/issues/20692 
> <https://api.github.com/repos/apache/beam/issues/20692>: Timer with dataflow 
> runner can be set multiple times (dataflow runner)
> https://api.github.com/repos/apache/beam/issues/20691 
> <https://api.github.com/repos/apache/beam/issues/20691>: Beam metrics should 
> be displayed in Flink UI "Metrics" tab
> https://api.github.com/repos/apache/beam/issues/20689 
> <https://api.github.com/repos/apache/beam/issues/20689>: Kafka 
> commitOffsetsInFinalize OOM on Flink
> https://api.github.com/repos/apache/beam/issues/20532 
> <https://api.github.com/repos/apache/beam/issues/20532>: Support for coder 
> argument in WriteToBigQuery
> https://api.github.com/repos/apache/beam/issues/20531 
> <https://api.github.com/repos/apache/beam/issues/20531>: FileBasedSink: allow 
> setting temp directory provider per dynamic destination
> https://api.github.com/repos/apache/beam/issues/20530 
> <https://api.github.com/repos/apache/beam/issues/20530>: Make non-portable 
> Splittable DoFn the only option when executing Java "Read" transforms
> https://api.github.com/repos/apache/beam/issues/20529 
> <https://api.github.com/repos/apache/beam/issues/20529>: SpannerIO tests 
> don't actually assert anything.
> https://api.github.com/repos/apache/beam/issues/20528 
> <https://api.github.com/repos/apache/beam/issues/20528>: python 
> CombineGlobally().with_fanout() cause duplicate combine results for sliding 
> windows
> https://api.github.com/repos/apache/beam/issues/20333 
> <https://api.github.com/repos/apache/beam/issues/20333>: 
> beam_PerformanceTests_Kafka_IO failing due to " provided port is already 
> allocated"
> https://api.github.com/repos/apache/beam/issues/20332 
> <https://api.github.com/repos/apache/beam/issues/20332>: FileIO writeDynamic 
> with AvroIO.sink not writing all data
> https://api.github.com/repos/apache/beam/issues/20330 
> <https://api.github.com/repos/apache/beam/issues/20330>: Remove insecure ssl 
> options from MongoDBIO
> https://api.github.com/repos/apache/beam/issues/20109 
> <https://api.github.com/repos/apache/beam/issues/20109>: SortValues should 
> fail if SecondaryKey coder is not deterministic
> https://api.github.com/repos/apache/beam/issues/20108 
> <https://api.github.com/repos/apache/beam/issues/20108>: Python direct runner 
> doesn't emit empty pane when it should
> https://api.github.com/repos/apache/beam/issues/20009 
> <https://api.github.com/repos/apache/beam/issues/20009>: 
> Environment-sensitive provisioning for Dataflow
> https://api.github.com/repos/apache/beam/issues/19971 
> <https://api.github.com/repos/apache/beam/issues/19971>: [SQL] Some Hive 
> tests throw NullPointerException, but get marked as passing (Direct Runner)
> https://api.github.com/repos/apache/beam/issues/19817 
> <https://api.github.com/repos/apache/beam/issues/19817>: datetime and decimal 
> should be logical types
> https://api.github.com/repos/apache/beam/issues/19815 
> <https://api.github.com/repos/apache/beam/issues/19815>: Add support for 
> remaining data types in python RowCoder 
> https://api.github.com/repos/apache/beam/issues/19813 
> <https://api.github.com/repos/apache/beam/issues/19813>: PubsubIO returns 
> empty message bodies for all messages read
> https://api.github.com/repos/apache/beam/issues/19556 
> <https://api.github.com/repos/apache/beam/issues/19556>: User reports 
> protobuf ClassChangeError running against 2.6.0 or above
> https://api.github.com/repos/apache/beam/issues/19369 
> <https://api.github.com/repos/apache/beam/issues/19369>: KafkaIO doesn't 
> commit offsets while being used as bounded source
> https://api.github.com/repos/apache/beam/issues/17950 
> <https://api.github.com/repos/apache/beam/issues/17950>: [Bug]: Java 
> Precommit permared

Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

Reply via email to