[Release] 2.41.0 release update

2022-07-27 Thread Kiley Sok via dev
Hi all,

I've cut the release branch:
https://github.com/apache/beam/tree/release-2.41.0

There's one known issue

that
needs to be cherry picked. Please let me know if you have a change that
needs to go in.

Thanks,
Kiley


Re: BigTable reader for Python?

2022-07-27 Thread Chamikara Jayalath via dev
On Wed, Jul 27, 2022 at 1:39 PM Chamikara Jayalath 
wrote:

>
>
> On Wed, Jul 27, 2022 at 11:10 AM Lina Mårtensson 
> wrote:
>
>> Thanks Cham!
>>
>> Could you provide some more detail on your preference for developing a
>> Python wrapper rather than implementing a source purely in Python?
>>
>
> I've mentioned the main advantages of developing a cross-language
> transform over natively implementing this in Python below.
>
> * Reduced cost of development
>
> It's much easier to  develop a cross-language wrapper of the Java  source
> than re-implementing the source in Python. Sources are some of the most
> complex
> code we have in Beam and sources control the parallelization of the
> pipeline (for example, splitting and dynamic work rebalancing for supported
> runners). So getting this code wrong can result in hard to track data
> loss/duplication related issues.
> Additionally, based on my experience, it's very hard to get a source
> implementation correct and performant on the first try. It could take
> additional benchmarks/user feedback over time to get the source production
> ready.
> Java BT source is already battle tested well (actually we have two Java
> implementations [1][2] currently). So I would rather use a Java BT
> connector as a cross-language transform than re-implementing sources for
> other SDKs.
>
> * Minimal maintenance cost
>
> Developing a source/sink is just a part of the story. We (as a community)
> have to maintain it over time and make sure that ongoing issues/feature
> requests are adequately handled. In the past, we have had cases where
> sources/sinks are available for multiple SDKs but one
> is significantly better than others when it comes to the feature set (for
> example, BigQuery). Cross-language will make this easier and will allow us
> to maintain key logic in a single place.
>

Also, a shameless plug for my Beam Summit video on the subject :) -
https://www.youtube.com/watch?v=bt5DMP9Cwz0


>
>
>>
>> If I look at the instructions for using the x-language Spanner
>> connector, then using this - from the user's perspective - would
>> involve installing a Java runtime.
>> That's not terrible, but I fear that getting this to work with bazel
>> might end up being more trouble than expected. (That has often
>> happened here, and we have enough trouble with getting Python 3.9 and
>> 3.10 to co-exist.)
>>
>
> From an end user perspective, all they should have to do is make sure that
> Java is available in the machine where the job is submitted from. Beam has
> features to allow starting up cross-language expansion services (that is
> needed during job submission) automatically so users should not have to do
> anything other than that.
>
> At job execution, Beam (portable) uses Docker-based SDK harness containers
> and we already release appropriate containers for each SDK. The runners
> should seamlessly download containers needed to execute the job.
>
> That said, the main downside of cross-language today is runner support.
> Cross-language transform support is only available for portable Beam
> runners (for example, Dataflow Runner v2) but this is the direction Beam
> runners are going anyway.
>
>
>>
>> There are a few of us at our small start-up that have written
>> MapReduces and similar in the past and are completely convinced by the
>> Beam/Dataflow model. But many others have no previous experience and
>> are skeptical, and see this new tool we're introducing as something
>> that's more trouble than it's worth, and something they'd rather avoid
>> - even when we see how lots of their use cases could be made much
>> easier using Beam. I'm worried that every extra hoop to jump through
>> will make it less likely to be widely used for us. Because of that, my
>> bias would be towards having a Python connector rather than
>> x-language, and I would find it really helpful to learn about why you
>> both favor the x-language option.
>>
>
> I understand your concerns. It's certainly possible to develop the same
> connector in multiple SDKs (and we provide SDF source framework support in
> all SDK languages). But hopefully my comments above will give you an idea
> of the downsides of this approach :).
>
> Thanks,
> Cham
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
> [2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java
>
>
>>
>> Thanks!
>> -Lina
>>
>> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath 
>> wrote:
>> >
>> >
>> >
>> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
>> dev@beam.apache.org> wrote:
>> >>
>> >> Hi dev,
>> >>
>> >> We're starting to incorporate BigTable in our stack and I've delighted
>> >> my co-workers with how easy it was to create some BigTables with
>> >> Beam... but there doesn't appear to be a reader for BigTable in
>> >> Python.
>> >>
>> >> First off, is there a good reason why not/any reason why it would be
>> 

Re: BigTable reader for Python?

2022-07-27 Thread Chamikara Jayalath via dev
On Wed, Jul 27, 2022 at 11:10 AM Lina Mårtensson  wrote:

> Thanks Cham!
>
> Could you provide some more detail on your preference for developing a
> Python wrapper rather than implementing a source purely in Python?
>

I've mentioned the main advantages of developing a cross-language transform
over natively implementing this in Python below.

* Reduced cost of development

It's much easier to  develop a cross-language wrapper of the Java  source
than re-implementing the source in Python. Sources are some of the most
complex
code we have in Beam and sources control the parallelization of the
pipeline (for example, splitting and dynamic work rebalancing for supported
runners). So getting this code wrong can result in hard to track data
loss/duplication related issues.
Additionally, based on my experience, it's very hard to get a source
implementation correct and performant on the first try. It could take
additional benchmarks/user feedback over time to get the source production
ready.
Java BT source is already battle tested well (actually we have two Java
implementations [1][2] currently). So I would rather use a Java BT
connector as a cross-language transform than re-implementing sources for
other SDKs.

* Minimal maintenance cost

Developing a source/sink is just a part of the story. We (as a community)
have to maintain it over time and make sure that ongoing issues/feature
requests are adequately handled. In the past, we have had cases where
sources/sinks are available for multiple SDKs but one
is significantly better than others when it comes to the feature set (for
example, BigQuery). Cross-language will make this easier and will allow us
to maintain key logic in a single place.


>
> If I look at the instructions for using the x-language Spanner
> connector, then using this - from the user's perspective - would
> involve installing a Java runtime.
> That's not terrible, but I fear that getting this to work with bazel
> might end up being more trouble than expected. (That has often
> happened here, and we have enough trouble with getting Python 3.9 and
> 3.10 to co-exist.)
>

>From an end user perspective, all they should have to do is make sure that
Java is available in the machine where the job is submitted from. Beam has
features to allow starting up cross-language expansion services (that is
needed during job submission) automatically so users should not have to do
anything other than that.

At job execution, Beam (portable) uses Docker-based SDK harness containers
and we already release appropriate containers for each SDK. The runners
should seamlessly download containers needed to execute the job.

That said, the main downside of cross-language today is runner support.
Cross-language transform support is only available for portable Beam
runners (for example, Dataflow Runner v2) but this is the direction Beam
runners are going anyway.


>
> There are a few of us at our small start-up that have written
> MapReduces and similar in the past and are completely convinced by the
> Beam/Dataflow model. But many others have no previous experience and
> are skeptical, and see this new tool we're introducing as something
> that's more trouble than it's worth, and something they'd rather avoid
> - even when we see how lots of their use cases could be made much
> easier using Beam. I'm worried that every extra hoop to jump through
> will make it less likely to be widely used for us. Because of that, my
> bias would be towards having a Python connector rather than
> x-language, and I would find it really helpful to learn about why you
> both favor the x-language option.
>

I understand your concerns. It's certainly possible to develop the same
connector in multiple SDKs (and we provide SDF source framework support in
all SDK languages). But hopefully my comments above will give you an idea
of the downsides of this approach :).

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
[2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java


>
> Thanks!
> -Lina
>
> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath 
> wrote:
> >
> >
> >
> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Hi dev,
> >>
> >> We're starting to incorporate BigTable in our stack and I've delighted
> >> my co-workers with how easy it was to create some BigTables with
> >> Beam... but there doesn't appear to be a reader for BigTable in
> >> Python.
> >>
> >> First off, is there a good reason why not/any reason why it would be
> difficult?
> >
> >
> > There's was a previous effort to implement a Python BT source but that
> was not completed:
> https://github.com/apache/beam/pull/11295#issuecomment-646378304
> >
> >>
> >>
> >> I could write one, but before I start, I'd love some input to make it
> easier.
> >>
> >> It appears that there would be two options: either write one 

Re: BigTable reader for Python?

2022-07-27 Thread Lina Mårtensson via dev
Thanks Cham!

Could you provide some more detail on your preference for developing a
Python wrapper rather than implementing a source purely in Python?

If I look at the instructions for using the x-language Spanner
connector, then using this - from the user's perspective - would
involve installing a Java runtime.
That's not terrible, but I fear that getting this to work with bazel
might end up being more trouble than expected. (That has often
happened here, and we have enough trouble with getting Python 3.9 and
3.10 to co-exist.)

There are a few of us at our small start-up that have written
MapReduces and similar in the past and are completely convinced by the
Beam/Dataflow model. But many others have no previous experience and
are skeptical, and see this new tool we're introducing as something
that's more trouble than it's worth, and something they'd rather avoid
- even when we see how lots of their use cases could be made much
easier using Beam. I'm worried that every extra hoop to jump through
will make it less likely to be widely used for us. Because of that, my
bias would be towards having a Python connector rather than
x-language, and I would find it really helpful to learn about why you
both favor the x-language option.

Thanks!
-Lina

On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath  wrote:
>
>
>
> On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev 
>  wrote:
>>
>> Hi dev,
>>
>> We're starting to incorporate BigTable in our stack and I've delighted
>> my co-workers with how easy it was to create some BigTables with
>> Beam... but there doesn't appear to be a reader for BigTable in
>> Python.
>>
>> First off, is there a good reason why not/any reason why it would be 
>> difficult?
>
>
> There's was a previous effort to implement a Python BT source but that was 
> not completed: 
> https://github.com/apache/beam/pull/11295#issuecomment-646378304
>
>>
>>
>> I could write one, but before I start, I'd love some input to make it easier.
>>
>> It appears that there would be two options: either write one in
>> Python, or try to set one up with x-language from Java which I see is
>> done e.g. with the Spanner IO Connector.
>> Any recommendation on which one to pick or potential pitfalls in either 
>> choice?
>>
>> If I write one in Python, what should I think about?
>> It is not obvious to me how to achieve parallelization, so any tips
>> here would be welcome.
>
>
> I would strongly prefer developing a  Python wrapper for the existing Java BT 
> source using Beam's Multi-language Pipelines framework over developing a new 
> Python source.
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
> Thanks,
> Cham
>
>
>>
>>
>> Thanks!
>> -Lina


Re: Beam gRPC depedency tracing

2022-07-27 Thread Kenneth Knowles
The vendored gRPC is built by transforming the released gRPC jar. Here is
where in the Beam git history you can find the source for the
transformation:
https://github.com/apache/beam/tree/40293eb52ca914acbbbae51e4b24fa280f2b44f0/vendor/grpc-1_26_0

Kenn

On Wed, Jul 27, 2022 at 9:24 AM JDW J  wrote:

> Team,
>
> Consider me a newbie to Beam and Java world in general. I am trying to
> trace Beam vendor dependency to gRPC-upstream.
>
>   4.0.0
>   org.apache.beam
>   beam-vendor-grpc-1_26_0
>   0.3
>   Apache Beam :: Vendored Dependencies :: gRPC :: 1.26.0
>   http://beam.apache.org
>
>
> How can I tell what is the exact upstream repo for "Apache Beam ::
> Vendored Dependencies :: gRPC :: 1.26.0" ?
>
> -joji
>


Beam gRPC depedency tracing

2022-07-27 Thread JDW J
Team,

Consider me a newbie to Beam and Java world in general. I am trying to
trace Beam vendor dependency to gRPC-upstream.

  4.0.0
  org.apache.beam
  beam-vendor-grpc-1_26_0
  0.3
  Apache Beam :: Vendored Dependencies :: gRPC :: 1.26.0
  http://beam.apache.org


How can I tell what is the exact upstream repo for "Apache Beam ::
Vendored Dependencies :: gRPC :: 1.26.0" ?

-joji


Re: Checkpoints timing out upgrading from Beam version 2.29 with Flink 1.12 to Beam 2.38 and Flink 1.14

2022-07-27 Thread John Casey via dev
Would it be possible to recreate the experiments to try and isolate
variables? Right now the 3 cases change both beam and flink versions.



On Tue, Jul 26, 2022 at 11:35 PM Kenneth Knowles  wrote:

> Bumping this and adding +John Casey  who knows
> about KafkaIO and unbounded sources, though probably less about the
> FlinkRunner. It seems you have isolated it to the Flink translation logic.
> I'm not sure who would be the best expert to evaluate if that logic is
> still OK.
>
> Kenn
>
> On Wed, Jun 29, 2022 at 11:07 AM Kathula, Sandeep <
> sandeep_kath...@intuit.com> wrote:
>
>> Hi,
>>
>>We have a stateless application which
>>
>>
>>
>>1. Reads from kafka
>>2. Doing some stateless transformations by reading from in memory
>>databases and updating the records
>>3. Writing back to Kafka.
>>
>>
>>
>>
>>
>>
>>
>> *With Beam 2.23 and Flink 1.9, we are seeing checkpoints are working fine
>> (it takes max 1 min).*
>>
>>
>>
>> *With Beam 2.29 and Flink 1.12, we are seeing checkpoints taking longer
>> time (it takes max 6-7 min sometimes)*
>>
>>
>>
>> *With Beam 2.38 and Flink 1.14, we are seeing checkpoints timing out
>> after 10 minutes.*
>>
>>
>>
>>
>>
>> I am checking Beam code and after some logging and analysis found the
>> problem is at
>> https://github.com/apache/beam/blob/release-2.38.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L287-L307
>>
>>
>>
>>
>>
>> We are still using the old API to read from Kafka and not yet using KafkaIO
>> based on SplittableDoFn.
>>
>>
>>
>> There are two threads
>>
>>1. Legacy source thread reading from kafka and doing entire
>>processing.
>>2. Thread which emits watermark on timer
>>
>> https://github.com/apache/beam/blob/release-2.38.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L454-L474
>>
>>
>>
>> Both these code blocks are in synchronized block waiting for same
>> checkpoint lock. Under heavy load, the thread reading from kafka is running
>> for ever in the while loop and  the thread emitting the watermarks is
>> waiting for ever to get the lock not emitting the watermarks and the
>> checkpoint times out.
>>
>>
>>
>>
>>
>> Is it a known issue and do we have any solution here? For now we are
>> putting Thread.sleep(1) once for every 10 sec after the synchronized block
>> so that the thread emitting the watermarks can be unblocked and run.
>>
>>
>>
>> One of my colleagues tried to follow up on this (attaching the previous
>> email here) but we didn’t get any reply. Any help on this would be
>> appreciated.
>>
>>
>>
>> Thanks,
>>
>> Sandeep
>>
>


Beam High Priority Issue Report

2022-07-27 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/22463 [Bug]: Upgrade pip before 
installing Beam for Python default expansion service
https://github.com/apache/beam/issues/22454 [Bug]: Spanner ITs are failing in 
Py37 PostCommits
https://github.com/apache/beam/issues/22440 [Bug]: Python Batch Dataflow 
SideInput LoadTests failing
https://github.com/apache/beam/issues/22401 [Bug]: BigQueryIO getFailedInserts 
fails when using Storage APIs 
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka SDF and 
fix known and discovered issues
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at 
getConnection() in WriteFn
https://github.com/apache/beam/issues/22188 BigQuery Storage API sink sometimes 
gets stuck outputting to an invalid timestamp
https://github.com/apache/beam/issues/21935 [Bug]: Reject illformed GBK Coders
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer 
whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 
failures parent bug
https://github.com/apache/beam/issues/21703 pubsublite.ReadWriteIT failing in 
beam_PostCommit_Java_DataflowV1 and V2
https://github.com/apache/beam/issues/21702 SpannerWriteIT failing in beam 
PostCommit Java V1
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 
failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21696 Flink Tests failure :  
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21694 BigQuery Storage API insert with 
writeResult retry and write to error table
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing 
new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21468 
beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load 
tests failing
https://github.com/apache/beam/issues/21465 Kafka commit offset drop data on 
failure for runners that have non-checkpointing shuffle
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in 
beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21268 Race between member variable being 
accessed due to leaking uninitialized state via OutboundObserverFactory
https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a duplicate 
BQ load job if a 503 error code is returned from googleapi
https://github.com/apache/beam/issues/21266 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21265 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
https://github.com/apache/beam/issues/21263 (Broken Pipe induced) Bricked 
Dataflow Pipeline 
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21257 Either Create or DirectRunner