PR 17274 - beam on non lts jvm

2022-06-28 Thread Clément Guillaume
Hi all

Could I get a review on https://github.com/apache/beam/pull/17274
it's about running the direct runner or the "java client" on non lts jvm

Thank you.


Re: Questions regarding contribution: Support for reading Kafka topics from any startReadTime in Java

2022-06-28 Thread Balázs Németh
Do you mean if the "startReadTime" as a configuration on its own shifts the
watermark already even if we receive an older entry first? That I do not
know, but based on the code I would say it does not. So if I get it right,
if it would, then my original idea would work fine, and we have to think
further only if this doesn't happen?

For SDF:
@GetInitialWatermarkEstimatorState is the one that has to consider
startReadTime(), similarly to
https://github.com/apache/beam/blob/25ca4f0fddd011bfc593e72eea6d32c040808b29/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/changestreams/dofn/ReadChangeStreamPartitionDoFn.java#L103-L106
?

For legacy read:
Setting the PartitionState.lastWatermark (not only the nextOffset) in
org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.setupInitialOffset(PartitionState)
and/or modifying the return value
of org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.getWatermark() should
be it then?



Kenneth Knowles  ezt írta (időpont: 2022. jún. 16., Cs,
23:57):

> These last two emails make sense to me, from both of you.
>
> Does startReadTime impact the watermark in the right way so that late data
> is dropped, or whatnot? Because even if you have a partition and you _do_
> find an offset that works, late data could come in. So if you restrict the
> time to be in the past, and you treat late data consistently across
> partitions, then it will not be observable.
>
> I think it would be good to try to come up with some way of describing
> this as an invariant like "the observed behavior of startReadTime is the
> same even when X Y Z are different".
>
> Kenn
>
> On Wed, Jun 15, 2022 at 3:43 PM Balázs Németh 
> wrote:
>
>> Well, a sanity check to only allow startReadTime in the past seemed
>> obvious to me. I can't really think of any common use-case when you want to
>> start processing after a specific time in the future and only idle until
>> that, but the other direction has multiple.
>>
>> The question regarding this IMO is how much of a "safety delay" from
>> "now" - if any - we have to keep if we were unable to find any newer
>> offset. I mean a message that is older than the startReadTime might be
>> still in processing and might not be available at the kafka cluster yet
>> before getting the offset. It has to be either without any check like that
>> and let the developers care about this, or it needs a filter that drops
>> messages prior to the timestamp.
>>
>> Daniel Collins  ezt írta (időpont: 2022. jún. 16.,
>> Cs, 0:19):
>>
>>> This may or may not be reasonable. Lets assume I pick startReadTime =
>>> now + 1 minute. Your logic will start returning a lot of records with
>>> values == now + 1ms. Is that reasonable for users of this API? Maybe, maybe
>>> not.
>>>
>>> -Daniel
>>>
>>> On Wed, Jun 15, 2022 at 6:16 PM Balázs Németh 
>>> wrote:
>>>
 Not a single objection means my idea seems okay then? :)

 Balázs Németh  ezt írta (időpont: 2022. máj.
 26., Cs, 5:59):

> https://issues.apache.org/jira/browse/BEAM-14518
>
>
> https://github.com/apache/beam/blob/fd8546355523f67eaddc22249606fdb982fe4938/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ConsumerSpEL.java#L180-L198
>
> Right now the 'startReadTime' config for KafkaIO.Read looks up an
> offset in every topic partition that is newer or equal to that timestamp.
> The problem is that if we use a timestamp that is so new, that we don't
> have any newer/equal message in the partition. In that case the code fails
> with an exception. Meanwhile in certain cases it makes no sense as we 
> could
> actually make it work.
>
> If we don't get an offset from calling `consumer.offsetsForTimes`, we
> should call `endOffsets`, and use the returned offset + 1. That is 
> actually
> the offset we will have to read next time.
>
> Even if `endOffsets` can't return an offset we could use 0 as the
> offset to read from.
>
>
>
> Am I missing something here? Is it okay to contribute this?
>



Re: Update PR description template

2022-06-28 Thread Kenneth Knowles
Thanks for all the super useful info. I can add some experience from the
early days of Beam: Because the GitHub "merge" button did not exist, we
rolled our own "merge bot". The pros/cons below are not general to the
concept, but to our experience with it:

PROs:
 - It single-threaded commits to the repo so there weren't race conditions
on test results. I think this will be OK for our scale for the foreseeable
future.
 - It re-ran tests after merge but before pushing to master, which is the
second half of eliminating that race condition.

CONs:
 - It was very flaky and often just didn't stay up. We danced for joy when
it was gone.
 - It ran a kind of arbitrary set of tests that didn't match the PR
statuses. It did not have any filter on which tests were run during merge.
We were small enough that mostly just "run the tests" was specific enough.
 - It always squashed the commits and then pushed the squashed commit with
a comment "this closes #". This messes up PRs that have multiple
commits that should remain separate. But more importantly it made it
impossible to easily distinguish PRs that were merged versus those that
were closed without merge. And of course it is way harder to navigate
history when the commits on master have different hashes than the commits
that were authored.
 - Getting reasonable logs with error messages when a merge failed for
whatever reason was hard or impossible IIRC.

So I think in what you have said there is an option to get the best of all,
something like:

 - Do merges through an action. Merge commit or squash-and-merge would be
separate labels. Or not bother with squash and merge, instead using
heuristics to block PRs with bad commit histories.
 - Have the workflow check that all PR statuses are green before continuing
to merge.

Kenn

On Tue, Jun 28, 2022 at 8:03 AM Danny McCormick 
wrote:

> After looking into this a little bit more, I need to revise my opinion on
> how we would do this; I don't think it's practical to have all required
> status checks via the .asf.yml file because those required checks can't be
> filtered by path (for example, if we want to require Python precommit on
> Python PRs, it would need to be required on *all *PRs). That's a GitHub
> limitation
> ,
> not an ASF one.
>
> One option is to write an action that makes sure no checks are failing
> (except maybe codecov?) and put a single required check on that. That would
> also make it easy to build in logic to override required checks like Robert
> suggested ("specific wording would have to be in the last comment"). We
> already have logic in the PR bot that does some of this
> 
> .
>
> The downside to that approach is that it's not clear what the best way to
> trigger that workflow is since it has to run after all other checks have
> completed. We could have it trigger on some label (e.g. "ready to merge")
> and then automatically merge the PR when it's done or comment and remove
> the label if checks are failing/incomplete. This changes the workflow for
> committers from "click the merge button" to "add a label", but doesn't
> require significantly more action or oversight and is pretty similar to how
> Kubernetes  and
> some other large repos run things.
>
> Another option would be to trigger that check at the end of every Actions
> run and use the check_suite trigger for external runs (unfortunately
> actions doesn't trigger that). That would require a lot of boilerplate on
> every Actions workflow though. There are other similar options which may be
> cleaner that also require modifying every Actions workflow, but I don't
> love that option.
>
> I'm still in favor of doing this via the "ready to merge" label option I
> just described, but this has dampened my enthusiasm a little bit since we'd
> need to build out/maintain our own tooling.
>
> On Tue, Jun 28, 2022 at 8:49 AM Danny McCormick 
> wrote:
>
>> Regarding code cov - it does check overall % coverage as well as the
>> coverage of the diff, but I don't think it's designed to be a blocking
>> metric. A good example of where this is not helpful is this pr to remove
>> dead code . Because it
>> removes tested code, the pr lowers our coverage percentage and fails the
>> check, but it's a totally reasonable PR that doesn't need additional
>> testing (because it only removes functionality). Other classes of changes
>> that are problematic include things like improving hard to cover error
>> messages or tough to test proto changes which can only be covered in an
>> integration test that doesn't make it into codecov (these are both common
>> at 

Re: Beam Go 2.40 Blog Post

2022-06-28 Thread Robert Burke
As the Beam Go Busybody, I am all for this :D, though i should probably
re-read the post before the week is out.

On Tue, Jun 28, 2022, 8:49 AM Danny McCormick 
wrote:

> Hey everyone, 2.40 was a significant release for the Go SDK, with lots of
> big streaming features added (Process Continuation, Watermark Estimation,
> Pipeline Drain/Truncation, Bundle Finalization) as well as the addition of
> generic registration functions (which drastically improve performance). I'd
> like to add a blog post highlighting some of those changes and wanted to
> make sure everyone had a chance to review it if desired since it's very
> public facing. Here's the PR with the post -
> https://github.com/apache/beam/pull/17723 - if there are no objections by
> next Monday I'll merge it.
>
> Thanks,
> Danny
>


Beam High Priority Issue Report

2022-06-28 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/22073 [Bug]: Golang SDK doesn't work with 
bigquery.NullDate
https://github.com/apache/beam/issues/22041 What is the version of 
beam-sdks-java-core which support JDK 17
https://github.com/apache/beam/issues/22040 What is the version of 
beam-runners-flink-1.11 which support JDK 17
https://github.com/apache/beam/issues/21946 [Bug]: No way to read or write to 
file when running Beam in Flink
https://github.com/apache/beam/issues/21935 [Bug]: Reject illformed GBK Coders
https://github.com/apache/beam/issues/21897 [Feature Request]: Flink runner 
savepoint backward compatibility 
https://github.com/apache/beam/issues/21893 [Bug]: BigQuery Storage Write API 
implementation does not support table partitioning
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer 
whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 
failures parent bug
https://github.com/apache/beam/issues/21703 pubsublite.ReadWriteIT failing in 
beam_PostCommit_Java_DataflowV1 and V2
https://github.com/apache/beam/issues/21702 SpannerWriteIT failing in beam 
PostCommit Java V1
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 
failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21698 Docker Snapshots failing to be 
published since April 14th
https://github.com/apache/beam/issues/21696 Flink Tests failure :  
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21694 BigQuery Storage API insert with 
writeResult retry and write to error table
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21477 Add integration testing for BQ 
Storage API  write modes
https://github.com/apache/beam/issues/21475 Beam x-lang Dataflow tests failing 
due to _InactiveRpcError
https://github.com/apache/beam/issues/21474 Flaky tests: Gradle build daemon 
disappeared unexpectedly
https://github.com/apache/beam/issues/21473 PVR_Spark2_Streaming perma-red
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing 
new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21468 
beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load 
tests failing
https://github.com/apache/beam/issues/21465 Kafka commit offset drop data on 
failure for runners that have non-checkpointing shuffle
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in 
beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21268 Race between member variable being 
accessed due to leaking uninitialized state via OutboundObserverFactory
https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a duplicate 
BQ load job if a 503 error code is returned from googleapi
https://github.com/apache/beam/issues/21266 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21265 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
https://github.com/apache/beam/issues/21264 beam_PostCommit_Python36 - 
CrossLanguageSpannerIOTest - flakey failing
https://github.com/apache/beam/issues/21263 (Broken Pipe induced) Bricked 
Dataflow Pipeline 
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
fol