Has anyone looked into out of band pickling for Beam's Python SDK, i.e., Pickle protocol version 5? https://www.python.org/dev/peps/pep-0574/ https://docs.python.org/3/library/pickle.html#out-of-band-buffers
For Beam pipelines passing around NumPy arrays (or collections of NumPy arrays, like pandas or Xarray) I've noticed that serialization costs can be significant. Beam seems to currently incur at least one one (maybe two) unnecessary memory copies. Pickle protocol version 5 exists for solving exactly this problem. You can serialize collections of arbitrary Python objects in a fully streaming fashion using memory buffers. This is a Python 3.8 feature, but the "pickle5" library provides a backport to Python 3.6 and 3.7. It has been supported by NumPy since version 1.16, released in January 2019. Cheers, Stephan