Re: Out of band pickling in Python (pickle5)

2022-09-19 Thread Brian Hulette via dev
I got to thinking about this again and ran some benchmarks. The result is
documented in the GitHub issue [1].

tl;dr: we can't realize a huge benefit since we don't actually have an
out-of-band path for exchanging the buffers. However, pickle 5 can yield
improved in-band performance as well, and I think we can take advantage of
this with some relatively simple adjustments to PickleCoder and
OutputStream.

[1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001
[2] https://peps.python.org/pep-0574/#improved-in-band-performance

On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer  wrote:

> I'm unlikely to have bandwidth to take this one on, but I do think it
> would be quite valuable!
>
> On Thu, May 27, 2021 at 4:42 PM Brian Hulette  wrote:
>
>> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
>> you have any interest in taking it on?
>>
>> On Tue, May 25, 2021 at 3:09 PM Brian Hulette 
>> wrote:
>>
>>> Hm this would definitely be of interest for the DataFrame API, which is
>>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>>> that pandas supports out-of-band pickling since DataFrames are mostly just
>>> collections of numpy arrays.
>>>
>>> Brian
>>>
>>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>>
>>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>>>
 Beam's PickleCoder would need to be updated to pass the
 "buffer_callback" argument into pickle.dumps() and the "buffers" argument
 into pickle.loads(). I expect this would be relatively straightforward.

 Then it should "just work", assuming that data is stored in objects
 (like NumPy arrays or wrappers of NumPy arrays) that implement the
 out-of-band Pickle protocol.


 On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
 wrote:

> I'm not aware of anyone looking at it.
>
> Will out-of-band pickling "just work" in Beam for types that implement
> the correct interface in Python 3.8?
>
> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
> wrote:
>
>> +1
>>
>> FWIW I recently ran into the exact case you described (high
>> serialization cost). The solution was to implement some not-so-intuitive
>> alternative transforms in my case, but I would have very much appreciated
>> faster serialization performance.
>>
>> Thanks,
>> Evan
>>
>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer 
>> wrote:
>>
>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>> i.e., Pickle protocol version 5?
>>> https://www.python.org/dev/peps/pep-0574/
>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>
>>> For Beam pipelines passing around NumPy arrays (or collections of
>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization 
>>> costs
>>> can be significant. Beam seems to currently incur at least one one 
>>> (maybe
>>> two) unnecessary memory copies.
>>>
>>> Pickle protocol version 5 exists for solving exactly this problem.
>>> You can serialize collections of arbitrary Python objects in a fully
>>> streaming fashion using memory buffers. This is a Python 3.8 feature, 
>>> but
>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>>> been supported by NumPy since version 1.16, released in January 2019.
>>>
>>> Cheers,
>>> Stephan
>>>
>>


Re: Out of band pickling in Python (pickle5)

2021-05-27 Thread Stephan Hoyer
I'm unlikely to have bandwidth to take this one on, but I do think it would
be quite valuable!

On Thu, May 27, 2021 at 4:42 PM Brian Hulette  wrote:

> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
> you have any interest in taking it on?
>
> On Tue, May 25, 2021 at 3:09 PM Brian Hulette  wrote:
>
>> Hm this would definitely be of interest for the DataFrame API, which is
>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>> that pandas supports out-of-band pickling since DataFrames are mostly just
>> collections of numpy arrays.
>>
>> Brian
>>
>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>
>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>>
>>> Beam's PickleCoder would need to be updated to pass the
>>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument
>>> into pickle.loads(). I expect this would be relatively straightforward.
>>>
>>> Then it should "just work", assuming that data is stored in objects
>>> (like NumPy arrays or wrappers of NumPy arrays) that implement the
>>> out-of-band Pickle protocol.
>>>
>>>
>>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
>>> wrote:
>>>
 I'm not aware of anyone looking at it.

 Will out-of-band pickling "just work" in Beam for types that implement
 the correct interface in Python 3.8?

 On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
 wrote:

> +1
>
> FWIW I recently ran into the exact case you described (high
> serialization cost). The solution was to implement some not-so-intuitive
> alternative transforms in my case, but I would have very much appreciated
> faster serialization performance.
>
> Thanks,
> Evan
>
> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>
>> Has anyone looked into out of band pickling for Beam's Python SDK,
>> i.e., Pickle protocol version 5?
>> https://www.python.org/dev/peps/pep-0574/
>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>
>> For Beam pipelines passing around NumPy arrays (or collections of
>> NumPy arrays, like pandas or Xarray) I've noticed that serialization 
>> costs
>> can be significant. Beam seems to currently incur at least one one (maybe
>> two) unnecessary memory copies.
>>
>> Pickle protocol version 5 exists for solving exactly this problem.
>> You can serialize collections of arbitrary Python objects in a fully
>> streaming fashion using memory buffers. This is a Python 3.8 feature, but
>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>> been supported by NumPy since version 1.16, released in January 2019.
>>
>> Cheers,
>> Stephan
>>
>


Re: Out of band pickling in Python (pickle5)

2021-05-27 Thread Brian Hulette
I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
you have any interest in taking it on?

On Tue, May 25, 2021 at 3:09 PM Brian Hulette  wrote:

> Hm this would definitely be of interest for the DataFrame API, which is
> shuffling pandas objects. This issue [1] confirms what you suggested above,
> that pandas supports out-of-band pickling since DataFrames are mostly just
> collections of numpy arrays.
>
> Brian
>
> [1] https://github.com/pandas-dev/pandas/issues/34244
>
> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>
>> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
>> argument into pickle.dumps() and the "buffers" argument into
>> pickle.loads(). I expect this would be relatively straightforward.
>>
>> Then it should "just work", assuming that data is stored in objects (like
>> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
>> Pickle protocol.
>>
>>
>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
>> wrote:
>>
>>> I'm not aware of anyone looking at it.
>>>
>>> Will out-of-band pickling "just work" in Beam for types that implement
>>> the correct interface in Python 3.8?
>>>
>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
>>> wrote:
>>>
 +1

 FWIW I recently ran into the exact case you described (high
 serialization cost). The solution was to implement some not-so-intuitive
 alternative transforms in my case, but I would have very much appreciated
 faster serialization performance.

 Thanks,
 Evan

 On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:

> Has anyone looked into out of band pickling for Beam's Python SDK,
> i.e., Pickle protocol version 5?
> https://www.python.org/dev/peps/pep-0574/
> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>
> For Beam pipelines passing around NumPy arrays (or collections of
> NumPy arrays, like pandas or Xarray) I've noticed that serialization costs
> can be significant. Beam seems to currently incur at least one one (maybe
> two) unnecessary memory copies.
>
> Pickle protocol version 5 exists for solving exactly this problem. You
> can serialize collections of arbitrary Python objects in a fully streaming
> fashion using memory buffers. This is a Python 3.8 feature, but the
> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
> supported by NumPy since version 1.16, released in January 2019.
>
> Cheers,
> Stephan
>



Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Brian Hulette
Hm this would definitely be of interest for the DataFrame API, which is
shuffling pandas objects. This issue [1] confirms what you suggested above,
that pandas supports out-of-band pickling since DataFrames are mostly just
collections of numpy arrays.

Brian

[1] https://github.com/pandas-dev/pandas/issues/34244

On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:

> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
> argument into pickle.dumps() and the "buffers" argument into
> pickle.loads(). I expect this would be relatively straightforward.
>
> Then it should "just work", assuming that data is stored in objects (like
> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
> Pickle protocol.
>
>
> On Tue, May 25, 2021 at 2:50 PM Brian Hulette  wrote:
>
>> I'm not aware of anyone looking at it.
>>
>> Will out-of-band pickling "just work" in Beam for types that implement
>> the correct interface in Python 3.8?
>>
>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
>> wrote:
>>
>>> +1
>>>
>>> FWIW I recently ran into the exact case you described (high
>>> serialization cost). The solution was to implement some not-so-intuitive
>>> alternative transforms in my case, but I would have very much appreciated
>>> faster serialization performance.
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>>>
 Has anyone looked into out of band pickling for Beam's Python SDK,
 i.e., Pickle protocol version 5?
 https://www.python.org/dev/peps/pep-0574/
 https://docs.python.org/3/library/pickle.html#out-of-band-buffers

 For Beam pipelines passing around NumPy arrays (or collections of NumPy
 arrays, like pandas or Xarray) I've noticed that serialization costs can be
 significant. Beam seems to currently incur at least one one (maybe two)
 unnecessary memory copies.

 Pickle protocol version 5 exists for solving exactly this problem. You
 can serialize collections of arbitrary Python objects in a fully streaming
 fashion using memory buffers. This is a Python 3.8 feature, but the
 "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
 supported by NumPy since version 1.16, released in January 2019.

 Cheers,
 Stephan

>>>


Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Stephan Hoyer
Beam's PickleCoder would need to be updated to pass the "buffer_callback"
argument into pickle.dumps() and the "buffers" argument into
pickle.loads(). I expect this would be relatively straightforward.

Then it should "just work", assuming that data is stored in objects (like
NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
Pickle protocol.


On Tue, May 25, 2021 at 2:50 PM Brian Hulette  wrote:

> I'm not aware of anyone looking at it.
>
> Will out-of-band pickling "just work" in Beam for types that implement the
> correct interface in Python 3.8?
>
> On Tue, May 25, 2021 at 2:43 PM Evan Galpin  wrote:
>
>> +1
>>
>> FWIW I recently ran into the exact case you described (high serialization
>> cost). The solution was to implement some not-so-intuitive alternative
>> transforms in my case, but I would have very much appreciated faster
>> serialization performance.
>>
>> Thanks,
>> Evan
>>
>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>>
>>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
>>> Pickle protocol version 5?
>>> https://www.python.org/dev/peps/pep-0574/
>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>
>>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>>> significant. Beam seems to currently incur at least one one (maybe two)
>>> unnecessary memory copies.
>>>
>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>> can serialize collections of arbitrary Python objects in a fully streaming
>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>> supported by NumPy since version 1.16, released in January 2019.
>>>
>>> Cheers,
>>> Stephan
>>>
>>


Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Brian Hulette
I'm not aware of anyone looking at it.

Will out-of-band pickling "just work" in Beam for types that implement the
correct interface in Python 3.8?

On Tue, May 25, 2021 at 2:43 PM Evan Galpin  wrote:

> +1
>
> FWIW I recently ran into the exact case you described (high serialization
> cost). The solution was to implement some not-so-intuitive alternative
> transforms in my case, but I would have very much appreciated faster
> serialization performance.
>
> Thanks,
> Evan
>
> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>
>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
>> Pickle protocol version 5?
>> https://www.python.org/dev/peps/pep-0574/
>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>
>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>> significant. Beam seems to currently incur at least one one (maybe two)
>> unnecessary memory copies.
>>
>> Pickle protocol version 5 exists for solving exactly this problem. You
>> can serialize collections of arbitrary Python objects in a fully streaming
>> fashion using memory buffers. This is a Python 3.8 feature, but the
>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>> supported by NumPy since version 1.16, released in January 2019.
>>
>> Cheers,
>> Stephan
>>
>


Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Evan Galpin
+1

FWIW I recently ran into the exact case you described (high serialization
cost). The solution was to implement some not-so-intuitive alternative
transforms in my case, but I would have very much appreciated faster
serialization performance.

Thanks,
Evan

On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:

> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
> Pickle protocol version 5?
> https://www.python.org/dev/peps/pep-0574/
> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>
> For Beam pipelines passing around NumPy arrays (or collections of NumPy
> arrays, like pandas or Xarray) I've noticed that serialization costs can be
> significant. Beam seems to currently incur at least one one (maybe two)
> unnecessary memory copies.
>
> Pickle protocol version 5 exists for solving exactly this problem. You can
> serialize collections of arbitrary Python objects in a fully streaming
> fashion using memory buffers. This is a Python 3.8 feature, but the
> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
> supported by NumPy since version 1.16, released in January 2019.
>
> Cheers,
> Stephan
>


Out of band pickling in Python (pickle5)

2021-05-25 Thread Stephan Hoyer
Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
Pickle protocol version 5?
https://www.python.org/dev/peps/pep-0574/
https://docs.python.org/3/library/pickle.html#out-of-band-buffers

For Beam pipelines passing around NumPy arrays (or collections of NumPy
arrays, like pandas or Xarray) I've noticed that serialization costs can be
significant. Beam seems to currently incur at least one one (maybe two)
unnecessary memory copies.

Pickle protocol version 5 exists for solving exactly this problem. You can
serialize collections of arbitrary Python objects in a fully streaming
fashion using memory buffers. This is a Python 3.8 feature, but the
"pickle5" library provides a backport to Python 3.6 and 3.7. It has been
supported by NumPy since version 1.16, released in January 2019.

Cheers,
Stephan