+1 FWIW I recently ran into the exact case you described (high serialization cost). The solution was to implement some not-so-intuitive alternative transforms in my case, but I would have very much appreciated faster serialization performance.
Thanks, Evan On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sho...@google.com> wrote: > Has anyone looked into out of band pickling for Beam's Python SDK, i.e., > Pickle protocol version 5? > https://www.python.org/dev/peps/pep-0574/ > https://docs.python.org/3/library/pickle.html#out-of-band-buffers > > For Beam pipelines passing around NumPy arrays (or collections of NumPy > arrays, like pandas or Xarray) I've noticed that serialization costs can be > significant. Beam seems to currently incur at least one one (maybe two) > unnecessary memory copies. > > Pickle protocol version 5 exists for solving exactly this problem. You can > serialize collections of arbitrary Python objects in a fully streaming > fashion using memory buffers. This is a Python 3.8 feature, but the > "pickle5" library provides a backport to Python 3.6 and 3.7. It has been > supported by NumPy since version 1.16, released in January 2019. > > Cheers, > Stephan >