Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
Pickle protocol version 5?
https://www.python.org/dev/peps/pep-0574/
https://docs.python.org/3/library/pickle.html#out-of-band-buffers

For Beam pipelines passing around NumPy arrays (or collections of NumPy
arrays, like pandas or Xarray) I've noticed that serialization costs can be
significant. Beam seems to currently incur at least one one (maybe two)
unnecessary memory copies.

Pickle protocol version 5 exists for solving exactly this problem. You can
serialize collections of arbitrary Python objects in a fully streaming
fashion using memory buffers. This is a Python 3.8 feature, but the
"pickle5" library provides a backport to Python 3.6 and 3.7. It has been
supported by NumPy since version 1.16, released in January 2019.

Cheers,
Stephan

Reply via email to