tvalentyn opened a new issue, #28246:
URL: https://github.com/apache/beam/issues/28246
### What happened?
We have identified a memory leak that affects Beam Python SDK versions
2.47.0 and above. We expect the leak to be fully remediated in Beam 2.51.0.
**Mitigation**
Until Beam 2.51.0 is released, consider any of the following workarounds:
* Use apache-beam==2.46.0 or below.
* Install protobuf 3.x in the submission and runtime environment. For
example, you can use a `--requirements_file` pipeline option with a file that
includes:
```
protobuf==3.20.3
grpcio-status==1.48.2
```
For more information, see:
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
* Use a python implementation of protobuf by setting a
`PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` environment variable in the
runtime environment. For example, you could create a custom Beam SDK container
from a Dockerfile that looks like the following:
```
FROM gcr.io/cloud-dataflow/v1beta3/python310-fnapi:2.47.0
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
```
For more information, see:
https://beam.apache.org/documentation/runtime/environments/
* Install an updated version of protobuf dependency once released (see
below).
**Additional details**
The leak can be reproduced by a pipeline:
```
with beam.Pipeline(options=pipeline_options) as p:
# duplicate reads to increase throughput
inputs = []
for i in range(32):
inputs.append(
p | f"Read pubsub{i}" >>
ReadFromPubSub(topic='projects/pubsub-public-data/topics/taxirides-realtime',
with_attributes=True)
)
inputs | beam.Flatten()
```
The leak was triggered by Beam switching default `protobuf` package version
from 3.19.x to 4.22.x in https://github.com/apache/beam/pull/24599. The new
versions of `protobuf` also switched the default protobuf implemetation to a
`upb` implementation. The `upb` implementation had two known leaks that have
since been mitigated by protobuf team in:
https://github.com/protocolbuffers/protobuf/issues/10088,
https://github.com/protocolbuffers/upb/issues/1243 . The latest available
`protobuf==4.24.2` does not yet have the fix, but we have confirmed that using
a patched version built in
https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]