I got to thinking about this again and ran some benchmarks. The result is documented in the GitHub issue [1].
tl;dr: we can't realize a huge benefit since we don't actually have an out-of-band path for exchanging the buffers. However, pickle 5 can yield improved in-band performance as well, and I think we can take advantage of this with some relatively simple adjustments to PickleCoder and OutputStream. [1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001 [2] https://peps.python.org/pep-0574/#improved-in-band-performance On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer <sho...@google.com> wrote: > I'm unlikely to have bandwidth to take this one on, but I do think it > would be quite valuable! > > On Thu, May 27, 2021 at 4:42 PM Brian Hulette <bhule...@google.com> wrote: > >> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would >> you have any interest in taking it on? >> >> On Tue, May 25, 2021 at 3:09 PM Brian Hulette <bhule...@google.com> >> wrote: >> >>> Hm this would definitely be of interest for the DataFrame API, which is >>> shuffling pandas objects. This issue [1] confirms what you suggested above, >>> that pandas supports out-of-band pickling since DataFrames are mostly just >>> collections of numpy arrays. >>> >>> Brian >>> >>> [1] https://github.com/pandas-dev/pandas/issues/34244 >>> >>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <sho...@google.com> wrote: >>> >>>> Beam's PickleCoder would need to be updated to pass the >>>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument >>>> into pickle.loads(). I expect this would be relatively straightforward. >>>> >>>> Then it should "just work", assuming that data is stored in objects >>>> (like NumPy arrays or wrappers of NumPy arrays) that implement the >>>> out-of-band Pickle protocol. >>>> >>>> >>>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >>>>> I'm not aware of anyone looking at it. >>>>> >>>>> Will out-of-band pickling "just work" in Beam for types that implement >>>>> the correct interface in Python 3.8? >>>>> >>>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <evan.gal...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> FWIW I recently ran into the exact case you described (high >>>>>> serialization cost). The solution was to implement some not-so-intuitive >>>>>> alternative transforms in my case, but I would have very much appreciated >>>>>> faster serialization performance. >>>>>> >>>>>> Thanks, >>>>>> Evan >>>>>> >>>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sho...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Has anyone looked into out of band pickling for Beam's Python SDK, >>>>>>> i.e., Pickle protocol version 5? >>>>>>> https://www.python.org/dev/peps/pep-0574/ >>>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>>>>>> >>>>>>> For Beam pipelines passing around NumPy arrays (or collections of >>>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization >>>>>>> costs >>>>>>> can be significant. Beam seems to currently incur at least one one >>>>>>> (maybe >>>>>>> two) unnecessary memory copies. >>>>>>> >>>>>>> Pickle protocol version 5 exists for solving exactly this problem. >>>>>>> You can serialize collections of arbitrary Python objects in a fully >>>>>>> streaming fashion using memory buffers. This is a Python 3.8 feature, >>>>>>> but >>>>>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has >>>>>>> been supported by NumPy since version 1.16, released in January 2019. >>>>>>> >>>>>>> Cheers, >>>>>>> Stephan >>>>>>> >>>>>>