Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-25 Thread Brian Hulette via dev
Thanks for writing this up Valentyn!

I'm curious Jarek, does Airflow take any dependencies on popular libraries
like pandas, numpy, pyarrow, scipy, etc... which users are likely to have
their own dependency on? I think these dependencies are challenging in a
different way than the client libraries - ideally we would support a wide
version range so as not to require users to upgrade those libraries in
lockstep with Beam. However in some cases our dependency is pretty tight
(e.g. the DataFrame API's dependency on pandas), so we need to make sure to
explicitly test with multiple different versions. Does Airflow have any
similar issues?

Thanks!
Brian

On Thu, Aug 25, 2022 at 5:36 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Hi Jarek,
>
> Thanks a lot for detailed feedback and sharing the Airflow story, this is
> exactly what I was hoping to hear in response from the mailing list!
>
> 600+ dependencies is very impressive, so I'd be happy to chat more and
> learn from your experience.
>
> On Wed, Aug 24, 2022 at 5:50 AM Jarek Potiuk  wrote:
>
>> Comment (from a bit outsider)
>>
>> Fantastic document Valentyn.
>>
>> Very, very insightful and interesting. We feel a lot of the same pain in
>> Apache Airflow (actually even more because we have not 20 but 620+
>> dependencies) but we are also a bit more advanced in the way how we are
>> managing the dependencies - some of the ideas you had there are already
>> tested and tried in Airflow, some of them are a bit different but we can
>> definitely share "principles" and we are a little higher in the "supply
>> chain" (i.e. Apache Beam Python SDK is our dependency).
>>
>> I left some suggestions and some comments describing in detail how the
>> same problems look like in Airflow and how we addressed them (if we did)
>> and I am happy to participate in further discussions. I am "the dependency
>> guy" in Airflow and happy to share my experiences and help to work out some
>> problems - and especially help to solve problems coming from using multiple
>> google-client libraries and diamond dependencies (we are just now dealing
>> with similar issue - where likely we will have to do a massive update of
>> several of our clients - hopefully with the involvement of Composer team.
>> And I'd love to be involved in a joint discussion with the google client
>> team to work out some common and expectations that we can rely on when we
>> define our future upgrade strategy for google clients.
>>
>> I will watch it here and be happy to spend quite some time on helping to
>> hash it out.
>>
>> BTW. You can also watch my talk I gave last year at PyWaw about "Managing
>> Python dependencies at Scale"
>> https://www.youtube.com/watch?v=_SjMdQLP30s=2549s where I explain the
>> approach we took, reasoning behind it etc.
>>
>> J.
>>
>>
>> On Wed, Aug 24, 2022 at 2:45 AM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> Recently, several issues [1-3]  have highlighted outage risks and
>>> developer inconveniences due to  dependency management practices in Beam
>>> Python.
>>>
>>> With dependabot and other tooling  that we have integrated with Beam,
>>> one of the missing pieces seems to be having a clear guideline of how we
>>> should be specifying requirements for our dependencies and when and how we
>>> should be updating them to have a sustainable process.
>>>
>>> As a conversation starter, I put together a retrospective
>>> [4]
>>> covering a recent incident and would like to get community opinions on the
>>> open questions.
>>>
>>> In particular, if you have experience managing dependencies for other
>>> Python libraries with rich dependency chains, knowledge of available
>>> tooling or first hand experience dealing with other dependency issues in
>>> Beam, your input would be greatly appreciated.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> [1] https://github.com/apache/beam/issues/22218
>>> [2] https://github.com/apache/beam/pull/22550#issuecomment-1217348455
>>> [3] https://github.com/apache/beam/issues/22533
>>> [4]
>>> https://docs.google.com/document/d/1gxQF8mciRYgACNpCy1wlR7TBa8zN-Tl6PebW-U8QvBk/edit
>>>
>>


Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-25 Thread Valentyn Tymofieiev via dev
Hi Jarek,

Thanks a lot for detailed feedback and sharing the Airflow story, this is
exactly what I was hoping to hear in response from the mailing list!

600+ dependencies is very impressive, so I'd be happy to chat more and
learn from your experience.

On Wed, Aug 24, 2022 at 5:50 AM Jarek Potiuk  wrote:

> Comment (from a bit outsider)
>
> Fantastic document Valentyn.
>
> Very, very insightful and interesting. We feel a lot of the same pain in
> Apache Airflow (actually even more because we have not 20 but 620+
> dependencies) but we are also a bit more advanced in the way how we are
> managing the dependencies - some of the ideas you had there are already
> tested and tried in Airflow, some of them are a bit different but we can
> definitely share "principles" and we are a little higher in the "supply
> chain" (i.e. Apache Beam Python SDK is our dependency).
>
> I left some suggestions and some comments describing in detail how the
> same problems look like in Airflow and how we addressed them (if we did)
> and I am happy to participate in further discussions. I am "the dependency
> guy" in Airflow and happy to share my experiences and help to work out some
> problems - and especially help to solve problems coming from using multiple
> google-client libraries and diamond dependencies (we are just now dealing
> with similar issue - where likely we will have to do a massive update of
> several of our clients - hopefully with the involvement of Composer team.
> And I'd love to be involved in a joint discussion with the google client
> team to work out some common and expectations that we can rely on when we
> define our future upgrade strategy for google clients.
>
> I will watch it here and be happy to spend quite some time on helping to
> hash it out.
>
> BTW. You can also watch my talk I gave last year at PyWaw about "Managing
> Python dependencies at Scale"
> https://www.youtube.com/watch?v=_SjMdQLP30s=2549s where I explain the
> approach we took, reasoning behind it etc.
>
> J.
>
>
> On Wed, Aug 24, 2022 at 2:45 AM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> Hi everyone,
>>
>> Recently, several issues [1-3]  have highlighted outage risks and
>> developer inconveniences due to  dependency management practices in Beam
>> Python.
>>
>> With dependabot and other tooling  that we have integrated with Beam, one
>> of the missing pieces seems to be having a clear guideline of how we should
>> be specifying requirements for our dependencies and when and how we should
>> be updating them to have a sustainable process.
>>
>> As a conversation starter, I put together a retrospective
>> [4]
>> covering a recent incident and would like to get community opinions on the
>> open questions.
>>
>> In particular, if you have experience managing dependencies for other
>> Python libraries with rich dependency chains, knowledge of available
>> tooling or first hand experience dealing with other dependency issues in
>> Beam, your input would be greatly appreciated.
>>
>> Thanks,
>> Valentyn
>>
>> [1] https://github.com/apache/beam/issues/22218
>> [2] https://github.com/apache/beam/pull/22550#issuecomment-1217348455
>> [3] https://github.com/apache/beam/issues/22533
>> [4]
>> https://docs.google.com/document/d/1gxQF8mciRYgACNpCy1wlR7TBa8zN-Tl6PebW-U8QvBk/edit
>>
>


Re: [ANNOUNCE] Apache Beam 2.41.0 Released

2022-08-25 Thread Pablo Estrada via dev
Thank you Kiley!

On Thu, Aug 25, 2022 at 10:55 AM Kiley Sok  wrote:

> The Apache Beam team is pleased to announce the release of version 2.41.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/beam-2.41.0/
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.41.0.
>
> -- Kiley, on behalf of The Apache Beam team
>


[ANNOUNCE] Apache Beam 2.41.0 Released

2022-08-25 Thread Kiley Sok
The Apache Beam team is pleased to announce the release of version 2.41.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam blog: https://beam.apache.org/blog/beam-2.41.0/

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.41.0.

-- Kiley, on behalf of The Apache Beam team


Beam Dependency Check Report (2022-08-25)

2022-08-25 Thread Apache Jenkins Server
<<< text/html; charset=UTF-8: Unrecognized >>>


Re: SingleStore IO

2022-08-25 Thread John Casey via dev
Hi Adalbert,

The nature of scheduling work with splittable DoFns is such that trying to
start all splits at the same time isn't really supported. In addition, the
general assumption of splitting work in Beam is that a split can be retried
in isolation from other splits, which doesn't look supported by SingleStore
parallel read.

That said, this looks really promising, so I'd be happy to get on a call to
help better understand your design, and see if we can find a solution.

John

On Thu, Aug 25, 2022 at 10:16 AM Adalbert Makarovych <
amakarovych...@singlestore.com> wrote:

> Hello,
>
> I'm working on the SingleStore IO connector and would like to discuss it
> with Beam developers.
> It would be great if the connector can use SingleStore parallel read
> .
> In the ideal case, the connector should use Single-read mode as it is
> faster than Multiple-read and consumes much less memory.
>
> One of the problems is that in Single-read mode, each reader must initiate
> its read query before any readers will receive data. Is it possible to
> somehow configure Beam to start all DoFns at the same time? Or to get the
> numbers of started DoFns at the runtime?
>
> The other problem is that Single-read allows reading data from partition
> only once, so if one reading thread failed - all others should be restarted
> to retry. Is it possible to achieve this behavior? Or to at least
> gracefully fail without additional retries?
>
> Here are the first drafts of the design documentation
> 
> .
> I would appreciate any help with this stuff :)
>
> --
> Adalbert Makarovych
> Software Engineer at SingleStore
>
>
> 
>


SingleStore IO

2022-08-25 Thread Adalbert Makarovych
Hello,

I'm working on the SingleStore IO connector and would like to discuss it
with Beam developers.
It would be great if the connector can use SingleStore parallel read
.
In the ideal case, the connector should use Single-read mode as it is
faster than Multiple-read and consumes much less memory.

One of the problems is that in Single-read mode, each reader must initiate
its read query before any readers will receive data. Is it possible to
somehow configure Beam to start all DoFns at the same time? Or to get the
numbers of started DoFns at the runtime?

The other problem is that Single-read allows reading data from partition
only once, so if one reading thread failed - all others should be restarted
to retry. Is it possible to achieve this behavior? Or to at least
gracefully fail without additional retries?

Here are the first drafts of the design documentation

.
I would appreciate any help with this stuff :)

-- 
Adalbert Makarovych
Software Engineer at SingleStore




Java object serialization error, java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible

2022-08-25 Thread Elliot Metsger
Howdy folks, super-new to Beam, and attempting to get a simple example
working with Go, using the portable runner and Spark. There seems to be an
incompatibility between Java components, and I’m not quite sure where the
disconnect is, but at the root it seems to be an incompatibility with
object serializations.

When I submit the job via the go sdk, it errors out on the Spark side with:
[8:59 AM] 22/08/25 12:45:59 ERROR TransportRequestHandler: Error while
invoking RpcHandler#receive() for one-way message.
java.io.InvalidClassException:
org.apache.spark.deploy.ApplicationDescription; local class incompatible:
stream classdesc serialVersionUID = 6543101073799644159, local class
serialVersionUID = 1574364215946805297
I’m using apache/beam_spark_job_server:2.41.0 and apache/spark:latest.
 (docker-compose[0], hello world wordcount example pipeline[1]).

Any ideas on where to look?  It looks like the Beam JobService is using
Java 8 (?) and Spark is using Java 11.  I’ve tried downgrading Spark from
3.3.0 to 3.1.3 (the earliest version for which Docker images are
available), and downgrading to Beam 2.40.0 with no luck.

This simple repo[2] should demonstrate the issue.  Any pointers would be
appreciated!

[0]: https://github.com/emetsger/beam-test/blob/develop/docker-compose.yml
[1]:
https://github.com/emetsger/beam-test/blob/develop/debugging_wordcount.go
[2]: https://github.com/emetsger/beam-test


Beam High Priority Issue Report (70)

2022-08-25 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/22854 [Bug]: Type inference failing for 
Python SDK with External transforms and beam.Row
https://github.com/apache/beam/issues/22779 [Bug]: SpannerIO.readChangeStream() 
stops forwarding change records and starts continuously throwing (large number) 
of Operation ongoing errors 
https://github.com/apache/beam/issues/22749 [Bug]: Bytebuddy version update 
causes Invisible parameter type error
https://github.com/apache/beam/issues/22743 [Bug]: Test flake: 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImplTest.testInsertWithinRowCountLimits
https://github.com/apache/beam/issues/22440 [Bug]: Python Batch Dataflow 
SideInput LoadTests failing
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka SDF and 
fix known and discovered issues
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at 
getConnection() in WriteFn
https://github.com/apache/beam/issues/22283 [Bug]: Python Lots of fn runner 
test items cost exactly 5 seconds to run
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer 
whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 
failures parent bug
https://github.com/apache/beam/issues/21703 pubsublite.ReadWriteIT failing in 
beam_PostCommit_Java_DataflowV1 and V2
https://github.com/apache/beam/issues/21702 SpannerWriteIT failing in beam 
PostCommit Java V1
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 
failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21696 Flink Tests failure :  
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21694 BigQuery Storage API insert with 
writeResult retry and write to error table
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing 
new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21468 
beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load 
tests failing
https://github.com/apache/beam/issues/21465 Kafka commit offset drop data on 
failure for runners that have non-checkpointing shuffle
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in 
beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21268 Race between member variable being 
accessed due to leaking uninitialized state via OutboundObserverFactory
https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a duplicate 
BQ load job if a 503 error code is returned from googleapi
https://github.com/apache/beam/issues/21266 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21265 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible
https://github.com/apache/beam/issues/21263 (Broken Pipe induced) Bricked 
Dataflow Pipeline 
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky