Streaming update compatibility
Dataflow (among other runners) has the ability to "upgrade" running pipelines with new code (e.g. capturing bug fixes, dependency updates, and limited topology changes). Unfortunately some improvements (e.g. new and improved ways of writing to BigQuery, optimized use of side inputs, a change in algorithm, sometimes completely internally and not visible to the user) are not sufficiently backwards compatible which causes us, with the motivation to not break users, to either not make these changes or guard them as a parallel opt-in mode which is a significant drain on both developer productivity and causes new pipelines to run in obsolete modes by default. I created https://github.com/apache/beam/pull/29140 which adds a new pipeline option, update_compatibility_version, that allows the SDK to move forward while letting users with pipelines launched previously to manually request the "old" way of doing things to preserve update compatibility. (We should still attempt backwards compatibility when it makes sense, and the old way would remain in code until such a time it's actually deprecated and removed, but this means we won't be constrained by it, especially when it comes to default settings.) Any objections or other thoughts on this approach? - Robert P.S. Separately I think it'd be valuable to elevate the vague notion of update compatibility to a first-class Beam concept and put it on firm footing, but that's a larger conversation outside the thread of this smaller (and I think still useful in such a future world) change.
Re: [Discuss] Idea to increase RC voting participation
> One easy and standard way to make it more resilient would be to make it idempotent instead of counting on uptime or receiving any particular event. Yep, agreed that this wouldn't be super hard if someone wants to take it on. Basically it would just be updating the tool to run on a schedule and look for issues that have been closed as completed in the last N days (more or less this query - https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+). I have seen some milestones intentionally removed from issues after the bot adds them (probably because it's non-obvious that you can mark an issue as not planned instead), so we'd probably want to account for that and no-op if a milestone was removed post-close. One downside of this approach is that you significantly increase the chances of an issue getting misassigned to the wrong milestone if it comes in around the cut; you'd need to either account for this by checking out the repo to get the version at the time the issue was closed (expensive/non-trivial) or live with this downside. It's probably an ok downside to live with. You could also do a hybrid approach where you run on issue close and run a scheduled or manual pre-release step to clean up any stragglers. This would be the most robust option. On Wed, Oct 25, 2023 at 7:43 AM Kenneth Knowles wrote: > Agree. As long as we are getting enough of them, then our records as well > as any automation depending on it are fine. One easy and standard way to > make it more resilient would be to make it idempotent instead of counting > on uptime or receiving any particular event. > > Kenn > > On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick > wrote: > >> Looks like for some reason the workflow didn't trigger. This is running >> on GitHub's hosted runners, so my best guess is an outage. >> >> Looking at a more refined query, this year there have been 14 issues that >> were missed by the automation (3 had their milestone manually removed) - >> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01 >> out >> of 605 total - >> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+ >> - >> as best I can tell there were a small number of workflow flakes and then >> GHA didn't correctly trigger a few. >> >> If we wanted, we could set up some recurring automation to go through and >> try to pick up the ones without milestones (or modify our existing >> automation to be more tolerant to failures), but it doesn't seem super >> urgent to me (feel free to disagree). I don't think this piece needs to be >> perfect. >> >> On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles wrote: >> >>> Just grabbing one at random for an example, >>> https://github.com/apache/beam/issues/28635 seems like it was closed as >>> completed but not tagged. >>> >>> I'm happy to see that the bot reads the version from the repo to find >>> the appropriate milestone, rather than using the nearest open one. Just >>> recording that for the thread since I first read the description as the >>> latter. >>> >>> Kenn >>> >>> On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev < >>> dev@beam.apache.org> wrote: >>> We do tag issues to milestones when the issue is marked as "completed" (as opposed to "not planned") - https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml. So I think using issues is probably about as accurate as using commits. > It looks like we have 820 with no milestone https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed Most predate the automation, though maybe not all? Some of those may have been closed as "not planned". > This could (should) be automatically discoverable. A (closed) issues is associated with commits which are associated with a release. Today, we just tag issues to the upcoming milestone when they're closed. In theory you could do something more sophisticated using linked commits, but in practice people aren't clean enough about linking commits to issues. Again, this is fixable by automation/enforcement, but I don't think it actually gives us much value beyond what we have today. On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev < dev@beam.apache.org> wrote: > On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles > wrote: > >> Tangentially related: >> >> Long ago, attaching an issue to a release was a mandatory step as >> part of closing. Now I think it is not. Is it automatically happening? It >> looks like we have 820 with no milestone >> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed >> > > This could (should) be automatically discoverable. A (closed) issues > is associated with commits which are associat
Re: [Discuss] Idea to increase RC voting participation
Agree. As long as we are getting enough of them, then our records as well as any automation depending on it are fine. One easy and standard way to make it more resilient would be to make it idempotent instead of counting on uptime or receiving any particular event. Kenn On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick wrote: > Looks like for some reason the workflow didn't trigger. This is running on > GitHub's hosted runners, so my best guess is an outage. > > Looking at a more refined query, this year there have been 14 issues that > were missed by the automation (3 had their milestone manually removed) - > https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01 > out > of 605 total - > https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+ > - > as best I can tell there were a small number of workflow flakes and then > GHA didn't correctly trigger a few. > > If we wanted, we could set up some recurring automation to go through and > try to pick up the ones without milestones (or modify our existing > automation to be more tolerant to failures), but it doesn't seem super > urgent to me (feel free to disagree). I don't think this piece needs to be > perfect. > > On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles wrote: > >> Just grabbing one at random for an example, >> https://github.com/apache/beam/issues/28635 seems like it was closed as >> completed but not tagged. >> >> I'm happy to see that the bot reads the version from the repo to find the >> appropriate milestone, rather than using the nearest open one. Just >> recording that for the thread since I first read the description as the >> latter. >> >> Kenn >> >> On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev < >> dev@beam.apache.org> wrote: >> >>> We do tag issues to milestones when the issue is marked as "completed" >>> (as opposed to "not planned") - >>> https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml. >>> So I think using issues is probably about as accurate as using commits. >>> >>> > It looks like we have 820 with no milestone >>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed >>> >>> Most predate the automation, though maybe not all? Some of those may >>> have been closed as "not planned". >>> >>> > This could (should) be automatically discoverable. A (closed) issues >>> is associated with commits which are associated with a release. >>> >>> Today, we just tag issues to the upcoming milestone when they're closed. >>> In theory you could do something more sophisticated using linked commits, >>> but in practice people aren't clean enough about linking commits to issues. >>> Again, this is fixable by automation/enforcement, but I don't think it >>> actually gives us much value beyond what we have today. >>> >>> On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev < >>> dev@beam.apache.org> wrote: >>> On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles wrote: > Tangentially related: > > Long ago, attaching an issue to a release was a mandatory step as part > of closing. Now I think it is not. Is it automatically happening? It looks > like we have 820 with no milestone > https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed > This could (should) be automatically discoverable. A (closed) issues is associated with commits which are associated with a release. > On Tue, Oct 24, 2023 at 1:25 PM Chamikara Jayalath via dev < > dev@beam.apache.org> wrote: > >> +1 for going by the commits since this is what matters at the end of >> the day. Also, many issues may not get tagged correctly for a given >> release >> due to either the contributor not tagging the issue or due to commits for >> the issue spanning multiple Beam releases. >> >> For example, >> >> For all commits in a given release RC: >> * If we find a Github issue for the commit: add a notice to the >> Github issue >> * Else: add the notice to a generic issue for the release including >> tags for the commit ID, PR author, and the committer who merged the PR. >> >> Thanks, >> Cham >> >> >> >> >> On Mon, Oct 23, 2023 at 11:49 AM Danny McCormick via dev < >> dev@beam.apache.org> wrote: >> >>> I'd probably vote to include both the issue filer and the >>> contributor. It is pretty equally straightforward - one way to do this >>> would be using all issues related to that release's milestone and >>> extracting the issue author and the issue closer. >>> >>> This does leave out the (unfortunately sizable) set of contributions >>> that don't have an associated issue; if we're worried about that, we >>> could >>> always fall back to anyone with a commit in the last release who doesn't
Beam High Priority Issue Report (46)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness doesn't update user counters in OnTimer callback functions https://github.com/apache/beam/issues/29076 [Failing Test]: Python ARM PostCommit failing after #28385 https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github actions tests are failing due to update of pip https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader provided by apache beam does not pick the event time for watermarking https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel for integration tests sometimes times out https://github.com/apache/beam/issues/28383 [Failing Test]: org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric https://github.com/apache/beam/issues/28339 Fix failing "beam_PostCommit_XVR_GoUsingJava_Dataflow" job https://github.com/apache/beam/issues/28326 Bug: apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely https://github.com/apache/beam/issues/27616 [Bug]: Unable to use applyRowMutations() in bigquery IO apache beam java https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with inequality filters https://github.com/apache/beam/issues/27314 [Failing Test]: bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1] https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when using Kafka and GroupByKey on Dataflow Runner https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to SchemaCoder after upgrading to 2.48 https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested ROW (described below) https://github.com/apache/beam/issues/26343 [Bug]: apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is flaky https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource https://github.com/apache/beam/issues/26041 [Bug]: Unable to create exactly-once Flink pipeline with stream source and file sink https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK Harness ProcessBundleProgress https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder will drop message id and orderingKey https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not follow spec https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit data at GC time https://github.com/apache/beam/issues/21121 apache_beam.examples.s