flake automation Was: P1 issues report (70)

Alexey Romanenko Fri, 24 Jun 2022 10:32:23 -0700
Thanks, Danny!

> On 24 Jun 2022, at 19:23, Danny McCormick <dannymccorm...@google.com> wrote:
> 
> Sure, I put up a fix - https://github.com/apache/beam/pull/22048 
> <https://github.com/apache/beam/pull/22048>
> On Fri, Jun 24, 2022 at 1:20 PM Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>> wrote:
> 
> 
>> > 2. The links in this report start with api.github.* and don’t take us 
>> > directly to the issues.
>> 
>> > Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
>> 
>> This is already fixed - Pablo actually beat me to it! 
>> <https://github.com/apache/beam/pull/22033>
> It adds also a colon after URL and some mail clients consider it as a part of 
> URL which leads to a broken link.
> Should we just remove a colon there or add a space between?
> 
> —
> Alexey
> 
>> 
>> Thanks,
>> Danny
>> 
>> On Thu, Jun 23, 2022 at 8:30 PM Brian Hulette <bhule...@google.com 
>> <mailto:bhule...@google.com>> wrote:
>> +1 for that proposal!
>> 
>> > 1. P2 and P3 issues should be noticed and resolved as well. Shall we have 
>> > a longer time window for the rest of not triaged or stagnate issues and 
>> > include them?
>> 
>> I worry these lists would get _very_ long and wouldn't be actionable. But 
>> maybe it's worth reporting something like "There are 376 P2's with no update 
>> in the last 6 months" with a link to a query?
>> 
>> > 2. The links in this report start with api.github.* and don’t take us 
>> > directly to the issues.
>> 
>> Yeah Danny pointed that out as well. I'm assuming he knows how to fix it?
>> 
>> On Thu, Jun 23, 2022 at 2:37 PM Pablo Estrada <pabl...@google.com 
>> <mailto:pabl...@google.com>> wrote:
>> Thanks. I like the proposal, and I've found the emails useful.
>> Best
>> -P.
>> 
>> On Thu, Jun 23, 2022 at 2:33 PM Manu Zhang <owenzhang1...@gmail.com 
>> <mailto:owenzhang1...@gmail.com>> wrote:
>> Sounds good! It’s like our internal reports of JIRA tickets exceeding SLA 
>> time and having no response from engineers.  We either resolve them or 
>> downgrade the priority to extend time window.
>> 
>> Besides,
>> 1. P2 and P3 issues should be noticed and resolved as well. Shall we have a 
>> longer time window for the rest of not triaged or stagnate issues and 
>> include them?
>> 2. The links in this report start with api.github.* and don’t take us 
>> directly to the issues.
>> 
>> 
>> Danny McCormick <dannymccorm...@google.com 
>> <mailto:dannymccorm...@google.com>>于2022年6月24日 周五04:48写道：
>> That generally sounds right to me - I also would vote that we consolidate to 
>> 1 email and stop distinguishing between flaky P1s and normal P1s.
>> 
>> So the single daily report would be:
>> 
>> - Unassigned P0s
>> - P0s with no update in the last 36 hours
>> - Unassigned P1s
>> - P1s with no update in the last 7 days
>> 
>> I think that will generate a pretty good list of issues that require some 
>> kind of action.
>> 
>> On Thu, Jun 23, 2022 at 4:43 PM Kenneth Knowles <k...@apache.org 
>> <mailto:k...@apache.org>> wrote:
>> Sounds good to me. Perhaps P0s > 36 hours ago (presumably they are more like 
>> ~hours for true outages of CI/website/etc) and P1s > 7 days?
>> 
>> On Thu, Jun 23, 2022 at 1:27 PM Brian Hulette <bhule...@google.com 
>> <mailto:bhule...@google.com>> wrote:
>> I think that Danny's alternate proposal (a daily email that show only issues 
>> last updated >7 days ago, and those with no assignee) fits well with the two 
>> goals you describe, if we include "triage needed" issues in the latter 
>> category. Maybe we also explicitly separate these two concerns in the report?
>> 
>> 
>> On Thu, Jun 23, 2022 at 1:14 PM Kenneth Knowles <k...@apache.org 
>> <mailto:k...@apache.org>> wrote:
>> Forking thread because lots of people may just ignore this topic, per the 
>> discussion :-)
>> 
>> (sometimes gmail doesn't fork thread properly, but here's hoping...)
>> 
>> I'll add some other outcomes of these emails:
>> 
>>  - people file P0s that are not outages and P1s that are not data loss and I 
>> downgrade them
>>  - I randomly open up a few flaky test bugs and see if I can fix them really 
>> quick
>>  - people file legit P0s and P1s and I subscribe and follow them
>> 
>> Of these, only the last one seems important (not just that *I* follow them, 
>> but that new P0s and P1s get immediate attention from many eyes)
>> 
>> So maybe one take on the goal is to:
>> 
>>  - have new P0s and P1s evaluated quickly: P0s are an outage or outage-like 
>> occurrence that needs immediate remedy, and P1s need to be evaluated for 
>> release blocking, etc.
>>  - make sure P0s and P1s get attention appropriate to their priority
>> 
>> It can also be helpful to just state the failure modes which would happen by 
>> default if we don't have a good process or automation:
>> 
>>  - Real P0 gets filed and not noticed or fixed in a timely manner, blocking 
>> users and/or community in real time
>>  - Real P1 gets filed and not noticed, so release goes out with known data 
>> loss bug or other total loss of functionality
>>  - Non-real P0s and P1s accumulate, throwing off our data and making it hard 
>> to find the real problems
>>  - Flakes are never fixed
>> 
>> WDYT?
>> 
>> If we have P0s and P1s in the "awaiting triage" state, those are the ones we 
>> need to notice. Then for a P0 or P1 outside of that state, we just need some 
>> way of making sure it doesn't stagnate. Or if it does stagnate, that 
>> empirically demonstrates it isn't really P1 (just like our P2 to P3 
>> downgrade automation). If everything is P1, nothing is, as they say.
>> 
>> Kenn
>> 
>> On Thu, Jun 23, 2022 at 10:01 AM Danny McCormick <dannymccorm...@google.com 
>> <mailto:dannymccorm...@google.com>> wrote:
>> > Maybe it would be helpful to sort these by last update time (and 
>> > potentially include that information in the email). Then we can at least 
>> > prioritize them instead of looking at a big wall of issues.
>> 
>> I agree that this is a good idea (and pretty trivial to do). I'll update the 
>> automation to do that once we get consensus on an approach.
>> 
>> > I think the motivation for daily emails is that per the priorities guide 
>> > [1] P1 issues should be getting "continuous status updates". If these 
>> > issues aren't actually that important, I think the noise is good as it 
>> > should motivate us to prioritize them correctly. In practice that hasn't 
>> > been happening though...
>> 
>> I guess the questions here are:
>> 
>> 1) What is the goal of this email?
>> 2) Is it effective at accomplishing that goal.
>> 
>> I think you're saying that the goal (or a goal) is to highlight issues that 
>> aren't getting the attention they need; if that's our goal, then I don't 
>> think this is a particularly effective mechanism for it because (a) its very 
>> unclear which issues fall into that category and (b) there are too many to 
>> manually go through on a daily basis. From the email alone, it's not clear 
>> to me that any of the issues above "shouldn't" be P1s (though I'd guess 
>> you're right that some/many of them don't belong since most were created 
>> before the Jira -> GH migration based on the titles). I'd also argue that a 
>> daily email just desensitizes us to them since there almost always will be 
>> some valid P1s that don't need extra attention.
>> 
>> I do still think this could have value as a weekly email, with the goal 
>> being "it's probably a good idea for someone to take a look at each of 
>> these". Another option would be to only include issues with no action in the 
>> last 7 days and/or no assignees and keep it daily.
>> 
>> A couple side notes:
>> - No matter what we do, if we keep the current automation in any form we 
>> should fix the url from https://api.github.com/repos/apache/beam/issues/# 
>> <https://api.github.com/repos/apache/beam/issues/#> to 
>> https://github.com/apache/beam/issues/# 
>> <https://github.com/apache/beam/issues/#> - the current links are very 
>> annoying.
>> - After I send this, I will do a pass of the current P1s since it does 
>> indeed seem like too many are P1s and many should actually be P2s (or lower).
>> 
>> Thanks,
>> Danny
>> 
>> On Thu, Jun 23, 2022 at 12:21 PM Brian Hulette <bhule...@google.com 
>> <mailto:bhule...@google.com>> wrote:
>> I think the motivation for daily emails is that per the priorities guide [1] 
>> P1 issues should be getting "continuous status updates". If these issues 
>> aren't actually that important, I think the noise is good as it should 
>> motivate us to prioritize them correctly. In practice that hasn't been 
>> happening though...
>> 
>> Maybe it would be helpful to sort these by last update time (and potentially 
>> include that information in the email). Then we can at least prioritize them 
>> instead of looking at a big wall of issues.
>> 
>> Brian
>> 
>> [1] https://beam.apache.org/contribute/issue-priorities/ 
>> <https://beam.apache.org/contribute/issue-priorities/>
>> On Thu, Jun 23, 2022 at 6:07 AM Danny McCormick <dannymccorm...@google.com 
>> <mailto:dannymccorm...@google.com>> wrote:
>> I think a weekly summary seems like a good idea for the P1 issues and flaky 
>> tests, though daily still seems appropriate for P0 issues. I put up 
>> https://github.com/apache/beam/pull/22017 
>> <https://github.com/apache/beam/pull/22017> to just send the P1/flaky test 
>> reports on Wednesdays, if anyone objects please let me know - I'll wait on 
>> merging til tomorrow to leave time for feedback (and it's always reversible 
>> 🙂).
>> 
>> Thanks,
>> Danny
>> 
>> On Wed, Jun 22, 2022 at 7:05 PM Manu Zhang <owenzhang1...@gmail.com 
>> <mailto:owenzhang1...@gmail.com>> wrote:
>> Hi all,
>> 
>> what is this daily summary intended for? Not all issues look like P1. And 
>> will a weekly summary be less noise?
>> 
>> <beamacti...@gmail.com <mailto:beamacti...@gmail.com>>于2022年6月22日 周三23:45写道：
>> This is your daily summary of Beam's current P1 issues, not including flaky 
>> tests.
>> 
>>     See https://beam.apache.org/contribute/issue-priorities/#p1-critical 
>> <https://beam.apache.org/contribute/issue-priorities/#p1-critical> for the 
>> meaning and expectations around P1 issues.
>> 
>> 
>> 
>> https://api.github.com/repos/apache/beam/issues/21978 
>> <https://api.github.com/repos/apache/beam/issues/21978>: [Playground] 
>> Implement Share Any Code feature on the frontend
>> https://api.github.com/repos/apache/beam/issues/21946 
>> <https://api.github.com/repos/apache/beam/issues/21946>: [Bug]: No way to 
>> read or write to file when running Beam in Flink
>> https://api.github.com/repos/apache/beam/issues/21935 
>> <https://api.github.com/repos/apache/beam/issues/21935>: [Bug]: Reject 
>> illformed GBK Coders
>> https://api.github.com/repos/apache/beam/issues/21897 
>> <https://api.github.com/repos/apache/beam/issues/21897>: [Feature Request]: 
>> Flink runner savepoint backward compatibility 
>> https://api.github.com/repos/apache/beam/issues/21893 
>> <https://api.github.com/repos/apache/beam/issues/21893>: [Bug]: BigQuery 
>> Storage Write API implementation does not support table partitioning
>> https://api.github.com/repos/apache/beam/issues/21794 
>> <https://api.github.com/repos/apache/beam/issues/21794>: Dataflow runner 
>> creates a new timer whenever the output timestamp is change
>> https://api.github.com/repos/apache/beam/issues/21763 
>> <https://api.github.com/repos/apache/beam/issues/21763>: [Playground Task]: 
>> Migrate from Google Analytics to Matomo Cloud
>> https://api.github.com/repos/apache/beam/issues/21715 
>> <https://api.github.com/repos/apache/beam/issues/21715>: Data missing when 
>> using CassandraIO.Read
>> https://api.github.com/repos/apache/beam/issues/21713 
>> <https://api.github.com/repos/apache/beam/issues/21713>: 404s in BigQueryIO 
>> don't get output to Failed Inserts PCollection
>> https://api.github.com/repos/apache/beam/issues/21711 
>> <https://api.github.com/repos/apache/beam/issues/21711>: Python Streaming 
>> job failing to drain with BigQueryIO write errors
>> https://api.github.com/repos/apache/beam/issues/21703 
>> <https://api.github.com/repos/apache/beam/issues/21703>: 
>> pubsublite.ReadWriteIT failing in beam_PostCommit_Java_DataflowV1 and V2
>> https://api.github.com/repos/apache/beam/issues/21702 
>> <https://api.github.com/repos/apache/beam/issues/21702>: SpannerWriteIT 
>> failing in beam PostCommit Java V1
>> https://api.github.com/repos/apache/beam/issues/21700 
>> <https://api.github.com/repos/apache/beam/issues/21700>: 
>> --dataflowServiceOptions=use_runner_v2 is broken
>> https://api.github.com/repos/apache/beam/issues/21695 
>> <https://api.github.com/repos/apache/beam/issues/21695>: 
>> DataflowPipelineResult does not raise exception for unsuccessful states.
>> https://api.github.com/repos/apache/beam/issues/21694 
>> <https://api.github.com/repos/apache/beam/issues/21694>: BigQuery Storage 
>> API insert with writeResult retry and write to error table
>> https://api.github.com/repos/apache/beam/issues/21479 
>> <https://api.github.com/repos/apache/beam/issues/21479>: Install Python 
>> wheel and dependencies to local venv in SDK harness
>> https://api.github.com/repos/apache/beam/issues/21478 
>> <https://api.github.com/repos/apache/beam/issues/21478>: 
>> KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions
>> https://api.github.com/repos/apache/beam/issues/21477 
>> <https://api.github.com/repos/apache/beam/issues/21477>: Add integration 
>> testing for BQ Storage API  write modes
>> https://api.github.com/repos/apache/beam/issues/21476 
>> <https://api.github.com/repos/apache/beam/issues/21476>: WriteToBigQuery 
>> Dynamic table destinations returns wrong tableId
>> https://api.github.com/repos/apache/beam/issues/21475 
>> <https://api.github.com/repos/apache/beam/issues/21475>: Beam x-lang 
>> Dataflow tests failing due to _InactiveRpcError
>> https://api.github.com/repos/apache/beam/issues/21473 
>> <https://api.github.com/repos/apache/beam/issues/21473>: 
>> PVR_Spark2_Streaming perma-red
>> https://api.github.com/repos/apache/beam/issues/21466 
>> <https://api.github.com/repos/apache/beam/issues/21466>: Simplify version 
>> override for Dev versions of the Go SDK.
>> https://api.github.com/repos/apache/beam/issues/21465 
>> <https://api.github.com/repos/apache/beam/issues/21465>: Kafka commit offset 
>> drop data on failure for runners that have non-checkpointing shuffle
>> https://api.github.com/repos/apache/beam/issues/21269 
>> <https://api.github.com/repos/apache/beam/issues/21269>: Delete orphaned 
>> files
>> https://api.github.com/repos/apache/beam/issues/21268 
>> <https://api.github.com/repos/apache/beam/issues/21268>: Race between member 
>> variable being accessed due to leaking uninitialized state via 
>> OutboundObserverFactory
>> https://api.github.com/repos/apache/beam/issues/21267 
>> <https://api.github.com/repos/apache/beam/issues/21267>: WriteToBigQuery 
>> submits a duplicate BQ load job if a 503 error code is returned from 
>> googleapi
>> https://api.github.com/repos/apache/beam/issues/21265 
>> <https://api.github.com/repos/apache/beam/issues/21265>: 
>> apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
>>  'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
>> https://api.github.com/repos/apache/beam/issues/21263 
>> <https://api.github.com/repos/apache/beam/issues/21263>: (Broken Pipe 
>> induced) Bricked Dataflow Pipeline 
>> https://api.github.com/repos/apache/beam/issues/21262 
>> <https://api.github.com/repos/apache/beam/issues/21262>: Python AfterAny, 
>> AfterAll do not follow spec
>> https://api.github.com/repos/apache/beam/issues/21260 
>> <https://api.github.com/repos/apache/beam/issues/21260>: Python DirectRunner 
>> does not emit data at GC time
>> https://api.github.com/repos/apache/beam/issues/21259 
>> <https://api.github.com/repos/apache/beam/issues/21259>: Consumer group with 
>> random prefix
>> https://api.github.com/repos/apache/beam/issues/21258 
>> <https://api.github.com/repos/apache/beam/issues/21258>: Dataflow error in 
>> CombinePerKey operation
>> https://api.github.com/repos/apache/beam/issues/21257 
>> <https://api.github.com/repos/apache/beam/issues/21257>: Either Create or 
>> DirectRunner fails to produce all elements to the following transform
>> https://api.github.com/repos/apache/beam/issues/21123 
>> <https://api.github.com/repos/apache/beam/issues/21123>: Multiple jobs 
>> running on Flink session cluster reuse the persistent Python environment.
>> https://api.github.com/repos/apache/beam/issues/21119 
>> <https://api.github.com/repos/apache/beam/issues/21119>: Migrate to the next 
>> version of Python `requests` when released
>> https://api.github.com/repos/apache/beam/issues/21117 
>> <https://api.github.com/repos/apache/beam/issues/21117>: "Java IO IT Tests" 
>> - missing data in grafana
>> https://api.github.com/repos/apache/beam/issues/21115 
>> <https://api.github.com/repos/apache/beam/issues/21115>: JdbcIO date 
>> conversion is sensitive to OS
>> https://api.github.com/repos/apache/beam/issues/21112 
>> <https://api.github.com/repos/apache/beam/issues/21112>: Dataflow 
>> SocketException (SSLException) error while trying to send message from Cloud 
>> Pub/Sub to BigQuery
>> https://api.github.com/repos/apache/beam/issues/21111 
>> <https://api.github.com/repos/apache/beam/issues/21111>: Java creates an 
>> incorrect pipeline proto when core-construction-java jar is not in the 
>> CLASSPATH
>> https://api.github.com/repos/apache/beam/issues/21110 
>> <https://api.github.com/repos/apache/beam/issues/21110>: codecov/patch has 
>> poor behavior
>> https://api.github.com/repos/apache/beam/issues/21109 
>> <https://api.github.com/repos/apache/beam/issues/21109>: SDF BoundedSource 
>> seems to execute significantly slower than 'normal' BoundedSource
>> https://api.github.com/repos/apache/beam/issues/21108 
>> <https://api.github.com/repos/apache/beam/issues/21108>: 
>> java.io.InvalidClassException With Flink Kafka
>> https://api.github.com/repos/apache/beam/issues/20979 
>> <https://api.github.com/repos/apache/beam/issues/20979>: Portable runners 
>> should be able to issue checkpoints to Splittable DoFn
>> https://api.github.com/repos/apache/beam/issues/20978 
>> <https://api.github.com/repos/apache/beam/issues/20978>: 
>> PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
>> Avro logical types
>> https://api.github.com/repos/apache/beam/issues/20973 
>> <https://api.github.com/repos/apache/beam/issues/20973>: Python Beam SDK 
>> Harness hangs when installing pip packages
>> https://api.github.com/repos/apache/beam/issues/20818 
>> <https://api.github.com/repos/apache/beam/issues/20818>: XmlIO.Read does not 
>> handle XML encoding per spec
>> https://api.github.com/repos/apache/beam/issues/20814 
>> <https://api.github.com/repos/apache/beam/issues/20814>: JmsIO is not 
>> acknowledging messages correctly
>> https://api.github.com/repos/apache/beam/issues/20813 
>> <https://api.github.com/repos/apache/beam/issues/20813>: No trigger early 
>> repeatedly for session windows
>> https://api.github.com/repos/apache/beam/issues/20812 
>> <https://api.github.com/repos/apache/beam/issues/20812>: Cross-language 
>> consistency (RequiresStableInputs) is quietly broken (at least on portable 
>> flink runner)
>> https://api.github.com/repos/apache/beam/issues/20692 
>> <https://api.github.com/repos/apache/beam/issues/20692>: Timer with dataflow 
>> runner can be set multiple times (dataflow runner)
>> https://api.github.com/repos/apache/beam/issues/20691 
>> <https://api.github.com/repos/apache/beam/issues/20691>: Beam metrics should 
>> be displayed in Flink UI "Metrics" tab
>> https://api.github.com/repos/apache/beam/issues/20689 
>> <https://api.github.com/repos/apache/beam/issues/20689>: Kafka 
>> commitOffsetsInFinalize OOM on Flink
>> https://api.github.com/repos/apache/beam/issues/20532 
>> <https://api.github.com/repos/apache/beam/issues/20532>: Support for coder 
>> argument in WriteToBigQuery
>> https://api.github.com/repos/apache/beam/issues/20531 
>> <https://api.github.com/repos/apache/beam/issues/20531>: FileBasedSink: 
>> allow setting temp directory provider per dynamic destination
>> https://api.github.com/repos/apache/beam/issues/20530 
>> <https://api.github.com/repos/apache/beam/issues/20530>: Make non-portable 
>> Splittable DoFn the only option when executing Java "Read" transforms
>> https://api.github.com/repos/apache/beam/issues/20529 
>> <https://api.github.com/repos/apache/beam/issues/20529>: SpannerIO tests 
>> don't actually assert anything.
>> https://api.github.com/repos/apache/beam/issues/20528 
>> <https://api.github.com/repos/apache/beam/issues/20528>: python 
>> CombineGlobally().with_fanout() cause duplicate combine results for sliding 
>> windows
>> https://api.github.com/repos/apache/beam/issues/20333 
>> <https://api.github.com/repos/apache/beam/issues/20333>: 
>> beam_PerformanceTests_Kafka_IO failing due to " provided port is already 
>> allocated"
>> https://api.github.com/repos/apache/beam/issues/20332 
>> <https://api.github.com/repos/apache/beam/issues/20332>: FileIO writeDynamic 
>> with AvroIO.sink not writing all data
>> https://api.github.com/repos/apache/beam/issues/20330 
>> <https://api.github.com/repos/apache/beam/issues/20330>: Remove insecure ssl 
>> options from MongoDBIO
>> https://api.github.com/repos/apache/beam/issues/20109 
>> <https://api.github.com/repos/apache/beam/issues/20109>: SortValues should 
>> fail if SecondaryKey coder is not deterministic
>> https://api.github.com/repos/apache/beam/issues/20108 
>> <https://api.github.com/repos/apache/beam/issues/20108>: Python direct 
>> runner doesn't emit empty pane when it should
>> https://api.github.com/repos/apache/beam/issues/20009 
>> <https://api.github.com/repos/apache/beam/issues/20009>: 
>> Environment-sensitive provisioning for Dataflow
>> https://api.github.com/repos/apache/beam/issues/19971 
>> <https://api.github.com/repos/apache/beam/issues/19971>: [SQL] Some Hive 
>> tests throw NullPointerException, but get marked as passing (Direct Runner)
>> https://api.github.com/repos/apache/beam/issues/19817 
>> <https://api.github.com/repos/apache/beam/issues/19817>: datetime and 
>> decimal should be logical types
>> https://api.github.com/repos/apache/beam/issues/19815 
>> <https://api.github.com/repos/apache/beam/issues/19815>: Add support for 
>> remaining data types in python RowCoder 
>> https://api.github.com/repos/apache/beam/issues/19813 
>> <https://api.github.com/repos/apache/beam/issues/19813>: PubsubIO returns 
>> empty message bodies for all messages read
>> https://api.github.com/repos/apache/beam/issues/19556 
>> <https://api.github.com/repos/apache/beam/issues/19556>: User reports 
>> protobuf ClassChangeError running against 2.6.0 or above
>> https://api.github.com/repos/apache/beam/issues/19369 
>> <https://api.github.com/repos/apache/beam/issues/19369>: KafkaIO doesn't 
>> commit offsets while being used as bounded source
>> https://api.github.com/repos/apache/beam/issues/17950 
>> <https://api.github.com/repos/apache/beam/issues/17950>: [Bug]: Java 
>> Precommit permared
>
Re: [DISCUSS] What to do about P0/P1/flake automation Was: P1 issues report (70)

Reply via email to