+1

FWIW I recently ran into the exact case you described (high serialization
cost). The solution was to implement some not-so-intuitive alternative
transforms in my case, but I would have very much appreciated faster
serialization performance.

Thanks,
Evan

On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sho...@google.com> wrote:

> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
> Pickle protocol version 5?
> https://www.python.org/dev/peps/pep-0574/
> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>
> For Beam pipelines passing around NumPy arrays (or collections of NumPy
> arrays, like pandas or Xarray) I've noticed that serialization costs can be
> significant. Beam seems to currently incur at least one one (maybe two)
> unnecessary memory copies.
>
> Pickle protocol version 5 exists for solving exactly this problem. You can
> serialize collections of arbitrary Python objects in a fully streaming
> fashion using memory buffers. This is a Python 3.8 feature, but the
> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
> supported by NumPy since version 1.16, released in January 2019.
>
> Cheers,
> Stephan
>

Reply via email to