Re: Out of band pickling in Python (pickle5)
I got to thinking about this again and ran some benchmarks. The result is documented in the GitHub issue [1]. tl;dr: we can't realize a huge benefit since we don't actually have an out-of-band path for exchanging the buffers. However, pickle 5 can yield improved in-band performance as well, and I think we can take advantage of this with some relatively simple adjustments to PickleCoder and OutputStream. [1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001 [2] https://peps.python.org/pep-0574/#improved-in-band-performance On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer wrote: > I'm unlikely to have bandwidth to take this one on, but I do think it > would be quite valuable! > > On Thu, May 27, 2021 at 4:42 PM Brian Hulette wrote: > >> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would >> you have any interest in taking it on? >> >> On Tue, May 25, 2021 at 3:09 PM Brian Hulette >> wrote: >> >>> Hm this would definitely be of interest for the DataFrame API, which is >>> shuffling pandas objects. This issue [1] confirms what you suggested above, >>> that pandas supports out-of-band pickling since DataFrames are mostly just >>> collections of numpy arrays. >>> >>> Brian >>> >>> [1] https://github.com/pandas-dev/pandas/issues/34244 >>> >>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer wrote: >>> Beam's PickleCoder would need to be updated to pass the "buffer_callback" argument into pickle.dumps() and the "buffers" argument into pickle.loads(). I expect this would be relatively straightforward. Then it should "just work", assuming that data is stored in objects (like NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band Pickle protocol. On Tue, May 25, 2021 at 2:50 PM Brian Hulette wrote: > I'm not aware of anyone looking at it. > > Will out-of-band pickling "just work" in Beam for types that implement > the correct interface in Python 3.8? > > On Tue, May 25, 2021 at 2:43 PM Evan Galpin > wrote: > >> +1 >> >> FWIW I recently ran into the exact case you described (high >> serialization cost). The solution was to implement some not-so-intuitive >> alternative transforms in my case, but I would have very much appreciated >> faster serialization performance. >> >> Thanks, >> Evan >> >> On Tue, May 25, 2021 at 15:26 Stephan Hoyer >> wrote: >> >>> Has anyone looked into out of band pickling for Beam's Python SDK, >>> i.e., Pickle protocol version 5? >>> https://www.python.org/dev/peps/pep-0574/ >>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>> >>> For Beam pipelines passing around NumPy arrays (or collections of >>> NumPy arrays, like pandas or Xarray) I've noticed that serialization >>> costs >>> can be significant. Beam seems to currently incur at least one one >>> (maybe >>> two) unnecessary memory copies. >>> >>> Pickle protocol version 5 exists for solving exactly this problem. >>> You can serialize collections of arbitrary Python objects in a fully >>> streaming fashion using memory buffers. This is a Python 3.8 feature, >>> but >>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has >>> been supported by NumPy since version 1.16, released in January 2019. >>> >>> Cheers, >>> Stephan >>> >>
Re: Out of band pickling in Python (pickle5)
I'm unlikely to have bandwidth to take this one on, but I do think it would be quite valuable! On Thu, May 27, 2021 at 4:42 PM Brian Hulette wrote: > I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would > you have any interest in taking it on? > > On Tue, May 25, 2021 at 3:09 PM Brian Hulette wrote: > >> Hm this would definitely be of interest for the DataFrame API, which is >> shuffling pandas objects. This issue [1] confirms what you suggested above, >> that pandas supports out-of-band pickling since DataFrames are mostly just >> collections of numpy arrays. >> >> Brian >> >> [1] https://github.com/pandas-dev/pandas/issues/34244 >> >> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer wrote: >> >>> Beam's PickleCoder would need to be updated to pass the >>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument >>> into pickle.loads(). I expect this would be relatively straightforward. >>> >>> Then it should "just work", assuming that data is stored in objects >>> (like NumPy arrays or wrappers of NumPy arrays) that implement the >>> out-of-band Pickle protocol. >>> >>> >>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette >>> wrote: >>> I'm not aware of anyone looking at it. Will out-of-band pickling "just work" in Beam for types that implement the correct interface in Python 3.8? On Tue, May 25, 2021 at 2:43 PM Evan Galpin wrote: > +1 > > FWIW I recently ran into the exact case you described (high > serialization cost). The solution was to implement some not-so-intuitive > alternative transforms in my case, but I would have very much appreciated > faster serialization performance. > > Thanks, > Evan > > On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: > >> Has anyone looked into out of band pickling for Beam's Python SDK, >> i.e., Pickle protocol version 5? >> https://www.python.org/dev/peps/pep-0574/ >> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >> >> For Beam pipelines passing around NumPy arrays (or collections of >> NumPy arrays, like pandas or Xarray) I've noticed that serialization >> costs >> can be significant. Beam seems to currently incur at least one one (maybe >> two) unnecessary memory copies. >> >> Pickle protocol version 5 exists for solving exactly this problem. >> You can serialize collections of arbitrary Python objects in a fully >> streaming fashion using memory buffers. This is a Python 3.8 feature, but >> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has >> been supported by NumPy since version 1.16, released in January 2019. >> >> Cheers, >> Stephan >> >
Re: Out of band pickling in Python (pickle5)
I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would you have any interest in taking it on? On Tue, May 25, 2021 at 3:09 PM Brian Hulette wrote: > Hm this would definitely be of interest for the DataFrame API, which is > shuffling pandas objects. This issue [1] confirms what you suggested above, > that pandas supports out-of-band pickling since DataFrames are mostly just > collections of numpy arrays. > > Brian > > [1] https://github.com/pandas-dev/pandas/issues/34244 > > On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer wrote: > >> Beam's PickleCoder would need to be updated to pass the "buffer_callback" >> argument into pickle.dumps() and the "buffers" argument into >> pickle.loads(). I expect this would be relatively straightforward. >> >> Then it should "just work", assuming that data is stored in objects (like >> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band >> Pickle protocol. >> >> >> On Tue, May 25, 2021 at 2:50 PM Brian Hulette >> wrote: >> >>> I'm not aware of anyone looking at it. >>> >>> Will out-of-band pickling "just work" in Beam for types that implement >>> the correct interface in Python 3.8? >>> >>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin >>> wrote: >>> +1 FWIW I recently ran into the exact case you described (high serialization cost). The solution was to implement some not-so-intuitive alternative transforms in my case, but I would have very much appreciated faster serialization performance. Thanks, Evan On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: > Has anyone looked into out of band pickling for Beam's Python SDK, > i.e., Pickle protocol version 5? > https://www.python.org/dev/peps/pep-0574/ > https://docs.python.org/3/library/pickle.html#out-of-band-buffers > > For Beam pipelines passing around NumPy arrays (or collections of > NumPy arrays, like pandas or Xarray) I've noticed that serialization costs > can be significant. Beam seems to currently incur at least one one (maybe > two) unnecessary memory copies. > > Pickle protocol version 5 exists for solving exactly this problem. You > can serialize collections of arbitrary Python objects in a fully streaming > fashion using memory buffers. This is a Python 3.8 feature, but the > "pickle5" library provides a backport to Python 3.6 and 3.7. It has been > supported by NumPy since version 1.16, released in January 2019. > > Cheers, > Stephan >
Re: Out of band pickling in Python (pickle5)
Hm this would definitely be of interest for the DataFrame API, which is shuffling pandas objects. This issue [1] confirms what you suggested above, that pandas supports out-of-band pickling since DataFrames are mostly just collections of numpy arrays. Brian [1] https://github.com/pandas-dev/pandas/issues/34244 On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer wrote: > Beam's PickleCoder would need to be updated to pass the "buffer_callback" > argument into pickle.dumps() and the "buffers" argument into > pickle.loads(). I expect this would be relatively straightforward. > > Then it should "just work", assuming that data is stored in objects (like > NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band > Pickle protocol. > > > On Tue, May 25, 2021 at 2:50 PM Brian Hulette wrote: > >> I'm not aware of anyone looking at it. >> >> Will out-of-band pickling "just work" in Beam for types that implement >> the correct interface in Python 3.8? >> >> On Tue, May 25, 2021 at 2:43 PM Evan Galpin >> wrote: >> >>> +1 >>> >>> FWIW I recently ran into the exact case you described (high >>> serialization cost). The solution was to implement some not-so-intuitive >>> alternative transforms in my case, but I would have very much appreciated >>> faster serialization performance. >>> >>> Thanks, >>> Evan >>> >>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: >>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e., Pickle protocol version 5? https://www.python.org/dev/peps/pep-0574/ https://docs.python.org/3/library/pickle.html#out-of-band-buffers For Beam pipelines passing around NumPy arrays (or collections of NumPy arrays, like pandas or Xarray) I've noticed that serialization costs can be significant. Beam seems to currently incur at least one one (maybe two) unnecessary memory copies. Pickle protocol version 5 exists for solving exactly this problem. You can serialize collections of arbitrary Python objects in a fully streaming fashion using memory buffers. This is a Python 3.8 feature, but the "pickle5" library provides a backport to Python 3.6 and 3.7. It has been supported by NumPy since version 1.16, released in January 2019. Cheers, Stephan >>>
Re: Out of band pickling in Python (pickle5)
Beam's PickleCoder would need to be updated to pass the "buffer_callback" argument into pickle.dumps() and the "buffers" argument into pickle.loads(). I expect this would be relatively straightforward. Then it should "just work", assuming that data is stored in objects (like NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band Pickle protocol. On Tue, May 25, 2021 at 2:50 PM Brian Hulette wrote: > I'm not aware of anyone looking at it. > > Will out-of-band pickling "just work" in Beam for types that implement the > correct interface in Python 3.8? > > On Tue, May 25, 2021 at 2:43 PM Evan Galpin wrote: > >> +1 >> >> FWIW I recently ran into the exact case you described (high serialization >> cost). The solution was to implement some not-so-intuitive alternative >> transforms in my case, but I would have very much appreciated faster >> serialization performance. >> >> Thanks, >> Evan >> >> On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: >> >>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e., >>> Pickle protocol version 5? >>> https://www.python.org/dev/peps/pep-0574/ >>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>> >>> For Beam pipelines passing around NumPy arrays (or collections of NumPy >>> arrays, like pandas or Xarray) I've noticed that serialization costs can be >>> significant. Beam seems to currently incur at least one one (maybe two) >>> unnecessary memory copies. >>> >>> Pickle protocol version 5 exists for solving exactly this problem. You >>> can serialize collections of arbitrary Python objects in a fully streaming >>> fashion using memory buffers. This is a Python 3.8 feature, but the >>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >>> supported by NumPy since version 1.16, released in January 2019. >>> >>> Cheers, >>> Stephan >>> >>
Re: Out of band pickling in Python (pickle5)
I'm not aware of anyone looking at it. Will out-of-band pickling "just work" in Beam for types that implement the correct interface in Python 3.8? On Tue, May 25, 2021 at 2:43 PM Evan Galpin wrote: > +1 > > FWIW I recently ran into the exact case you described (high serialization > cost). The solution was to implement some not-so-intuitive alternative > transforms in my case, but I would have very much appreciated faster > serialization performance. > > Thanks, > Evan > > On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: > >> Has anyone looked into out of band pickling for Beam's Python SDK, i.e., >> Pickle protocol version 5? >> https://www.python.org/dev/peps/pep-0574/ >> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >> >> For Beam pipelines passing around NumPy arrays (or collections of NumPy >> arrays, like pandas or Xarray) I've noticed that serialization costs can be >> significant. Beam seems to currently incur at least one one (maybe two) >> unnecessary memory copies. >> >> Pickle protocol version 5 exists for solving exactly this problem. You >> can serialize collections of arbitrary Python objects in a fully streaming >> fashion using memory buffers. This is a Python 3.8 feature, but the >> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >> supported by NumPy since version 1.16, released in January 2019. >> >> Cheers, >> Stephan >> >
Re: Out of band pickling in Python (pickle5)
+1 FWIW I recently ran into the exact case you described (high serialization cost). The solution was to implement some not-so-intuitive alternative transforms in my case, but I would have very much appreciated faster serialization performance. Thanks, Evan On Tue, May 25, 2021 at 15:26 Stephan Hoyer wrote: > Has anyone looked into out of band pickling for Beam's Python SDK, i.e., > Pickle protocol version 5? > https://www.python.org/dev/peps/pep-0574/ > https://docs.python.org/3/library/pickle.html#out-of-band-buffers > > For Beam pipelines passing around NumPy arrays (or collections of NumPy > arrays, like pandas or Xarray) I've noticed that serialization costs can be > significant. Beam seems to currently incur at least one one (maybe two) > unnecessary memory copies. > > Pickle protocol version 5 exists for solving exactly this problem. You can > serialize collections of arbitrary Python objects in a fully streaming > fashion using memory buffers. This is a Python 3.8 feature, but the > "pickle5" library provides a backport to Python 3.6 and 3.7. It has been > supported by NumPy since version 1.16, released in January 2019. > > Cheers, > Stephan >
Out of band pickling in Python (pickle5)
Has anyone looked into out of band pickling for Beam's Python SDK, i.e., Pickle protocol version 5? https://www.python.org/dev/peps/pep-0574/ https://docs.python.org/3/library/pickle.html#out-of-band-buffers For Beam pipelines passing around NumPy arrays (or collections of NumPy arrays, like pandas or Xarray) I've noticed that serialization costs can be significant. Beam seems to currently incur at least one one (maybe two) unnecessary memory copies. Pickle protocol version 5 exists for solving exactly this problem. You can serialize collections of arbitrary Python objects in a fully streaming fashion using memory buffers. This is a Python 3.8 feature, but the "pickle5" library provides a backport to Python 3.6 and 3.7. It has been supported by NumPy since version 1.16, released in January 2019. Cheers, Stephan