This appears to be a recent issue reported also by others (e.g. https://github.com/apache/beam/issues/28142), it's being actively investigated. Therefore, it is unlikely that memory fragmentation is an issue.
On Tue, Aug 22, 2023 at 5:21 PM Valentyn Tymofieiev <valen...@google.com> wrote: > Hi, thanks for reaching out. > > I'd be curious to see whether the memory consumption patterns you observe > change if you switch the memory allocator library. > > For example, you could try to use a custom container, install jemalloc and > enable it. See: https://beam.apache.org/documentation/runtime/environments > , https://cloud.google.com/dataflow/docs/guides/using-custom-containers > > Your Dockerfile might look like the following: > > FROM apache/beam_python3.10_sdk:2.49.0 > > # Prebuilt other dependencies > RUN apt-get update \ > && apt-get install -y libjemalloc-dev > > ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so > > # Set the entrypoint to the Apache Beam SDK launcher. > ENTRYPOINT ["/opt/apache/beam/boot"] > > > On Tue, Aug 22, 2023 at 10:42 AM Cheng Han Lee <le...@allium.so> wrote: > >> Hello! >> >> I'm an avid apache beam user (on Dataflow) and we use beam to stream >> blockchain data to various sinks. I recently noticed some memory issues >> across all our pipelines but have yet to be able to find the root cause and >> was hoping someone on your team might be able to help. If this isn't the >> right avenue for it, please let me know how I should reach out. >> >> The details are here in stackoverflow: >> >> >> https://stackoverflow.com/questions/76950068/memory-leak-in-apache-beam-python-readfrompubsub-io >> >> Thanks, >> Chenghan >> CTO | Allium >> >