Beam's PickleCoder would need to be updated to pass the "buffer_callback" argument into pickle.dumps() and the "buffers" argument into pickle.loads(). I expect this would be relatively straightforward.
Then it should "just work", assuming that data is stored in objects (like NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band Pickle protocol. On Tue, May 25, 2021 at 2:50 PM Brian Hulette <bhule...@google.com> wrote: > I'm not aware of anyone looking at it. > > Will out-of-band pickling "just work" in Beam for types that implement the > correct interface in Python 3.8? > > On Tue, May 25, 2021 at 2:43 PM Evan Galpin <evan.gal...@gmail.com> wrote: > >> +1 >> >> FWIW I recently ran into the exact case you described (high serialization >> cost). The solution was to implement some not-so-intuitive alternative >> transforms in my case, but I would have very much appreciated faster >> serialization performance. >> >> Thanks, >> Evan >> >> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sho...@google.com> wrote: >> >>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e., >>> Pickle protocol version 5? >>> https://www.python.org/dev/peps/pep-0574/ >>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>> >>> For Beam pipelines passing around NumPy arrays (or collections of NumPy >>> arrays, like pandas or Xarray) I've noticed that serialization costs can be >>> significant. Beam seems to currently incur at least one one (maybe two) >>> unnecessary memory copies. >>> >>> Pickle protocol version 5 exists for solving exactly this problem. You >>> can serialize collections of arbitrary Python objects in a fully streaming >>> fashion using memory buffers. This is a Python 3.8 feature, but the >>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >>> supported by NumPy since version 1.16, released in January 2019. >>> >>> Cheers, >>> Stephan >>> >>