Re: [VOTE] Vendored Dependencies Release

2022-08-05 Thread Luke Cwik via dev
+1

I verified the signatures of the artifacts, that the jar doesn't contain
classes outside of the org/apache/beam/vendor/grpc/v1p48p1 package and I
tested the artifact against our precommits using
https://github.com/apache/beam/pull/22595

On Fri, Aug 5, 2022 at 1:42 PM Luke Cwik  wrote:

> Please review the release of the following artifacts that we vendor:
>  * beam-vendor-grpc-1_48_1
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 0.1, as
> follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with fingerprint
> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
> * all artifacts to be deployed to the Maven Central Repository [3],
> * commit hash "db8db0b6ed0fe1e4891f207f0f7f811798e54db1" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1277/
> [4]
> https://github.com/apache/beam/commit/db8db0b6ed0fe1e4891f207f0f7f811798e54db1
>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Byron Ellis via dev
Indeed, there's nothing stopping you from doing codegen where it's useful
but I think it's probably easier to implement codegen from dynamic than it
is to go the other way around (Avro vs Proto)

On Fri, Aug 5, 2022 at 1:15 PM Chamikara Jayalath 
wrote:

>
>
> On Fri, Aug 5, 2022 at 12:00 PM Byron Ellis  wrote:
>
>> I think there are some practical advantages to having the ability to
>> support a dynamic version---at previous places where I've worked having
>> Kafka's Schema Service was incredibly useful for data processing (it was a
>> Java/Scala shop and we mostly used a "decode to POJO" approach rather than
>> codegen.)
>>
>
> Yeah, that's my thought as well. I think it will be pretty useful during
> development/testing cycles, especially if we push code generation to the
> release time. Also, it will be useful for trying out any SchemaTransforms
> developed/released by third parties where generated stubs might not be
> available.
>
>
>>
>> On Fri, Aug 5, 2022 at 10:08 AM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>>
>>>
>>>
>>> On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette 
>>> wrote:
>>>
 Thanks Cham! I really like the proposal, I left a few comments. I also
 had one higher-level point I wanted to elevate here:

 > Pipeline SDKs can generate user-friendly stub-APIs based on
 transforms registered with an expansion service, eliminating the need to
 develop language-specific wrappers.
 This would be great! I think one point to consider is whether we can do
 this statically. We could package up these stubs with releases and include
 them in API docs for each language, making them much more discoverable.
 That could be an extension on top of your proposal (e.g. as part of its
 build, each SDK spins up other known expansion services and generates code
 based on the discovery responses), but maybe it could be cleaner if we
 don't really need the dynamic version?

>>>
>>> So my proposal suggested two solutions for wrappers.
>>> * A higher level (dynamic) API (SchemaAwareExternalTransform) that can
>>> be used to discover/initialize/use any SchemaTransform.
>>> * Developing tooling to generate stubs for each language. This is
>>> possible since SchemaTransform gives a cleaner way to define/interpret the
>>> construction API of a transform.
>>>
>>> I think both can be useful. For example, the prior might be useful to
>>> quickly test/try out new SchemaTransforms without going through code
>>> generation.
>>>
>>> Also, I agree with you that it might be good to generate such stubs (and
>>> corresponding docs) during release time instead of generating and
>>> committing stubs to the repo.
>>>
>>> Thanks,
>>> Cham
>>>
>>>

 Brian


 On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
 dev@beam.apache.org> wrote:

> Hi All,
>
> I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
>
>- Expansion service can be used to easily initialize and expand
>transforms without need for additional code.
>- Expansion service can be used to easily discover already
>registered transforms.
>- Pipeline SDKs can generate user-friendly stub-APIs based on
>transforms registered with an expansion service, eliminating the need 
> to
>develop language-specific wrappers.
>
> Please see here for my proposal:
> https://s.apache.org/easy-multi-language
>
> Lemme know if you have any comments/questions/suggestions :)
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
>


[VOTE] Vendored Dependencies Release

2022-08-05 Thread Luke Cwik via dev
Please review the release of the following artifacts that we vendor:
 * beam-vendor-grpc-1_48_1

Hi everyone,
Please review and vote on the release candidate #1 for the version 0.1, as
follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* the official Apache source release to be deployed to dist.apache.org [1],
which is signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* commit hash "db8db0b6ed0fe1e4891f207f0f7f811798e54db1" [4],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1] https://dist.apache.org/repos/dist/dev/beam/vendor/
[2] https://dist.apache.org/repos/dist/release/beam/KEYS
[3] https://repository.apache.org/content/repositories/orgapachebeam-1277/
[4]
https://github.com/apache/beam/commit/db8db0b6ed0fe1e4891f207f0f7f811798e54db1


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Chamikara Jayalath via dev
On Fri, Aug 5, 2022 at 12:00 PM Byron Ellis  wrote:

> I think there are some practical advantages to having the ability to
> support a dynamic version---at previous places where I've worked having
> Kafka's Schema Service was incredibly useful for data processing (it was a
> Java/Scala shop and we mostly used a "decode to POJO" approach rather than
> codegen.)
>

Yeah, that's my thought as well. I think it will be pretty useful during
development/testing cycles, especially if we push code generation to the
release time. Also, it will be useful for trying out any SchemaTransforms
developed/released by third parties where generated stubs might not be
available.


>
> On Fri, Aug 5, 2022 at 10:08 AM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>>
>>
>> On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette  wrote:
>>
>>> Thanks Cham! I really like the proposal, I left a few comments. I also
>>> had one higher-level point I wanted to elevate here:
>>>
>>> > Pipeline SDKs can generate user-friendly stub-APIs based on transforms
>>> registered with an expansion service, eliminating the need to develop
>>> language-specific wrappers.
>>> This would be great! I think one point to consider is whether we can do
>>> this statically. We could package up these stubs with releases and include
>>> them in API docs for each language, making them much more discoverable.
>>> That could be an extension on top of your proposal (e.g. as part of its
>>> build, each SDK spins up other known expansion services and generates code
>>> based on the discovery responses), but maybe it could be cleaner if we
>>> don't really need the dynamic version?
>>>
>>
>> So my proposal suggested two solutions for wrappers.
>> * A higher level (dynamic) API (SchemaAwareExternalTransform) that can be
>> used to discover/initialize/use any SchemaTransform.
>> * Developing tooling to generate stubs for each language. This is
>> possible since SchemaTransform gives a cleaner way to define/interpret the
>> construction API of a transform.
>>
>> I think both can be useful. For example, the prior might be useful to
>> quickly test/try out new SchemaTransforms without going through code
>> generation.
>>
>> Also, I agree with you that it might be good to generate such stubs (and
>> corresponding docs) during release time instead of generating and
>> committing stubs to the repo.
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> Brian
>>>
>>>
>>> On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hi All,

 I believe we can make the multi-language pipelines offering [1] much
 easier to use by updating the expansion service to be fully aware of
 SchemaTransforms. Additionally this will make it easy to
 register/discover/use transforms defined in one SDK from all other SDKs.
 Specifically we could add the following features.

- Expansion service can be used to easily initialize and expand
transforms without need for additional code.
- Expansion service can be used to easily discover already
registered transforms.
- Pipeline SDKs can generate user-friendly stub-APIs based on
transforms registered with an expansion service, eliminating the need to
develop language-specific wrappers.

 Please see here for my proposal:
 https://s.apache.org/easy-multi-language

 Lemme know if you have any comments/questions/suggestions :)

 Thanks,
 Cham

 [1]
 https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines




Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Byron Ellis via dev
I think there are some practical advantages to having the ability to
support a dynamic version---at previous places where I've worked having
Kafka's Schema Service was incredibly useful for data processing (it was a
Java/Scala shop and we mostly used a "decode to POJO" approach rather than
codegen.)

On Fri, Aug 5, 2022 at 10:08 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

>
>
> On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette  wrote:
>
>> Thanks Cham! I really like the proposal, I left a few comments. I also
>> had one higher-level point I wanted to elevate here:
>>
>> > Pipeline SDKs can generate user-friendly stub-APIs based on transforms
>> registered with an expansion service, eliminating the need to develop
>> language-specific wrappers.
>> This would be great! I think one point to consider is whether we can do
>> this statically. We could package up these stubs with releases and include
>> them in API docs for each language, making them much more discoverable.
>> That could be an extension on top of your proposal (e.g. as part of its
>> build, each SDK spins up other known expansion services and generates code
>> based on the discovery responses), but maybe it could be cleaner if we
>> don't really need the dynamic version?
>>
>
> So my proposal suggested two solutions for wrappers.
> * A higher level (dynamic) API (SchemaAwareExternalTransform) that can be
> used to discover/initialize/use any SchemaTransform.
> * Developing tooling to generate stubs for each language. This is possible
> since SchemaTransform gives a cleaner way to define/interpret the
> construction API of a transform.
>
> I think both can be useful. For example, the prior might be useful to
> quickly test/try out new SchemaTransforms without going through code
> generation.
>
> Also, I agree with you that it might be good to generate such stubs (and
> corresponding docs) during release time instead of generating and
> committing stubs to the repo.
>
> Thanks,
> Cham
>
>
>>
>> Brian
>>
>>
>> On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi All,
>>>
>>> I believe we can make the multi-language pipelines offering [1] much
>>> easier to use by updating the expansion service to be fully aware of
>>> SchemaTransforms. Additionally this will make it easy to
>>> register/discover/use transforms defined in one SDK from all other SDKs.
>>> Specifically we could add the following features.
>>>
>>>- Expansion service can be used to easily initialize and expand
>>>transforms without need for additional code.
>>>- Expansion service can be used to easily discover already
>>>registered transforms.
>>>- Pipeline SDKs can generate user-friendly stub-APIs based on
>>>transforms registered with an expansion service, eliminating the need to
>>>develop language-specific wrappers.
>>>
>>> Please see here for my proposal:
>>> https://s.apache.org/easy-multi-language
>>>
>>> Lemme know if you have any comments/questions/suggestions :)
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>>>
>>>


Re: [Release] 2.41.0 release update

2022-08-05 Thread Ahmet Altay via dev
Kiley, do we still have the same blockers? Do you need any help?

On Thu, Aug 4, 2022 at 12:18 PM Kiley Sok via dev 
wrote:

> Last remaining issue was cherry-picked. There may be one last issue with
> gRPC that's being investigated.
>
> https://github.com/apache/beam/issues/22283
>
> On Thu, Jul 28, 2022 at 2:20 PM Kiley Sok  wrote:
>
>> Quick update for today:
>>
>> I'm still working through the validation tests, but we currently have 2
>> open issues:
>> https://github.com/apache/beam/issues/22454
>> https://github.com/apache/beam/issues/22188
>>
>>
>>
>> On Wed, Jul 27, 2022 at 5:03 PM Kiley Sok  wrote:
>>
>>> Hi all,
>>>
>>> I've cut the release branch:
>>> https://github.com/apache/beam/tree/release-2.41.0
>>>
>>> There's one known issue
>>> 
>>>  that
>>> needs to be cherry picked. Please let me know if you have a change that
>>> needs to go in.
>>>
>>> Thanks,
>>> Kiley
>>>
>>>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Chamikara Jayalath via dev
On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette  wrote:

> Thanks Cham! I really like the proposal, I left a few comments. I also had
> one higher-level point I wanted to elevate here:
>
> > Pipeline SDKs can generate user-friendly stub-APIs based on transforms
> registered with an expansion service, eliminating the need to develop
> language-specific wrappers.
> This would be great! I think one point to consider is whether we can do
> this statically. We could package up these stubs with releases and include
> them in API docs for each language, making them much more discoverable.
> That could be an extension on top of your proposal (e.g. as part of its
> build, each SDK spins up other known expansion services and generates code
> based on the discovery responses), but maybe it could be cleaner if we
> don't really need the dynamic version?
>

So my proposal suggested two solutions for wrappers.
* A higher level (dynamic) API (SchemaAwareExternalTransform) that can be
used to discover/initialize/use any SchemaTransform.
* Developing tooling to generate stubs for each language. This is possible
since SchemaTransform gives a cleaner way to define/interpret the
construction API of a transform.

I think both can be useful. For example, the prior might be useful to
quickly test/try out new SchemaTransforms without going through code
generation.

Also, I agree with you that it might be good to generate such stubs (and
corresponding docs) during release time instead of generating and
committing stubs to the repo.

Thanks,
Cham


>
> Brian
>
>
> On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>> Hi All,
>>
>> I believe we can make the multi-language pipelines offering [1] much
>> easier to use by updating the expansion service to be fully aware of
>> SchemaTransforms. Additionally this will make it easy to
>> register/discover/use transforms defined in one SDK from all other SDKs.
>> Specifically we could add the following features.
>>
>>- Expansion service can be used to easily initialize and expand
>>transforms without need for additional code.
>>- Expansion service can be used to easily discover already registered
>>transforms.
>>- Pipeline SDKs can generate user-friendly stub-APIs based on
>>transforms registered with an expansion service, eliminating the need to
>>develop language-specific wrappers.
>>
>> Please see here for my proposal: https://s.apache.org/easy-multi-language
>>
>> Lemme know if you have any comments/questions/suggestions :)
>>
>> Thanks,
>> Cham
>>
>> [1]
>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>>
>>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Brian Hulette via dev
Thanks Cham! I really like the proposal, I left a few comments. I also had
one higher-level point I wanted to elevate here:

> Pipeline SDKs can generate user-friendly stub-APIs based on transforms
registered with an expansion service, eliminating the need to develop
language-specific wrappers.
This would be great! I think one point to consider is whether we can do
this statically. We could package up these stubs with releases and include
them in API docs for each language, making them much more discoverable.
That could be an extension on top of your proposal (e.g. as part of its
build, each SDK spins up other known expansion services and generates code
based on the discovery responses), but maybe it could be cleaner if we
don't really need the dynamic version?

Brian


On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Hi All,
>
> I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
>
>- Expansion service can be used to easily initialize and expand
>transforms without need for additional code.
>- Expansion service can be used to easily discover already registered
>transforms.
>- Pipeline SDKs can generate user-friendly stub-APIs based on
>transforms registered with an expansion service, eliminating the need to
>develop language-specific wrappers.
>
> Please see here for my proposal: https://s.apache.org/easy-multi-language
>
> Lemme know if you have any comments/questions/suggestions :)
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
>


Re: [idea] A new IO connector named DataLakeIO, which support to connect Beam and data lake, such as Delta Lake, Apache Hudi, Apache iceberg.

2022-08-05 Thread Sachin Agarwal via dev
This is wonderful to hear -
https://beam.apache.org/contribute/get-started-contributing/#contribute-code
has the process to contribute; we're very much looking forward to seeing
your DataLakeIO!

On Fri, Aug 5, 2022 at 9:02 AM 张涛  wrote:

>
> Hi, we developed a new IO connector named DataLakeIO, to connect Beam and
> data lake, such as Delta Lake, Apache Hudi, Apache iceberg. Beam can use
> DataLakeIO to read data from data lake, and write data to data lake. We did
> not find data lake IO on
> https://beam.apache.org/documentation/io/built-in/, we want to contribute
> this new IO connector to Beam, what should we do next? Thank you very
> much!
>


[idea] A new IO connector named DataLakeIO, which support to connect Beam and data lake, such as Delta Lake, Apache Hudi, Apache iceberg.

2022-08-05 Thread 张涛

Hi, we developed a new IO connector named DataLakeIO, to connect Beam and data 
lake, such as Delta Lake, Apache Hudi, Apache iceberg. Beam can use DataLakeIO 
to read data from data lake, and write data to data lake. We did not find data 
lake IO on https://beam.apache.org/documentation/io/built-in/, we want to 
contribute this new IO connector to Beam, what should we do next? Thank you 
very much!

Beam High Priority Issue Report

2022-08-05 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/22440 [Bug]: Python Batch Dataflow 
SideInput LoadTests failing
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka SDF and 
fix known and discovered issues
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at 
getConnection() in WriteFn
https://github.com/apache/beam/issues/22283 [Bug]: Python Lots of fn runner 
test items cost exactly 5 seconds to run
https://github.com/apache/beam/issues/22188 BigQuery Storage API sink sometimes 
gets stuck outputting to an invalid timestamp
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer 
whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 
failures parent bug
https://github.com/apache/beam/issues/21703 pubsublite.ReadWriteIT failing in 
beam_PostCommit_Java_DataflowV1 and V2
https://github.com/apache/beam/issues/21702 SpannerWriteIT failing in beam 
PostCommit Java V1
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 
failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21696 Flink Tests failure :  
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21694 BigQuery Storage API insert with 
writeResult retry and write to error table
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing 
new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21468 
beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load 
tests failing
https://github.com/apache/beam/issues/21465 Kafka commit offset drop data on 
failure for runners that have non-checkpointing shuffle
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in 
beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21268 Race between member variable being 
accessed due to leaking uninitialized state via OutboundObserverFactory
https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a duplicate 
BQ load job if a 503 error code is returned from googleapi
https://github.com/apache/beam/issues/21266 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21265 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
https://github.com/apache/beam/issues/21263 (Broken Pipe induced) Bricked 
Dataflow Pipeline 
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21257 Either Create or DirectRunner fails 
to produce all elements to the following transform
https://github.com/apache/beam/issues/21123 Multiple jobs running on Flink 
session cluster reuse the persistent Python environment.
https://github.com/apache/beam/issues/21121