Beam's PickleCoder would need to be updated to pass the "buffer_callback"
argument into pickle.dumps() and the "buffers" argument into
pickle.loads(). I expect this would be relatively straightforward.

Then it should "just work", assuming that data is stored in objects (like
NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
Pickle protocol.


On Tue, May 25, 2021 at 2:50 PM Brian Hulette <bhule...@google.com> wrote:

> I'm not aware of anyone looking at it.
>
> Will out-of-band pickling "just work" in Beam for types that implement the
> correct interface in Python 3.8?
>
> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <evan.gal...@gmail.com> wrote:
>
>> +1
>>
>> FWIW I recently ran into the exact case you described (high serialization
>> cost). The solution was to implement some not-so-intuitive alternative
>> transforms in my case, but I would have very much appreciated faster
>> serialization performance.
>>
>> Thanks,
>> Evan
>>
>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sho...@google.com> wrote:
>>
>>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
>>> Pickle protocol version 5?
>>> https://www.python.org/dev/peps/pep-0574/
>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>
>>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>>> significant. Beam seems to currently incur at least one one (maybe two)
>>> unnecessary memory copies.
>>>
>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>> can serialize collections of arbitrary Python objects in a fully streaming
>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>> supported by NumPy since version 1.16, released in January 2019.
>>>
>>> Cheers,
>>> Stephan
>>>
>>

Reply via email to