Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-26 Thread Olivier Grisel
+1 for not adding in-pickle compression as it is already very easy to
handle compression externally (for instance by passing a compressing file
object as an argument to the pickler). Furthermore, as PEP 574 makes it
possible to stream the buffer bytes directly to the file-object without any
temporary memory copy I don't see any benefit in including the compression
into the pickle protocol.

However adding lz4.LZ4File to the standard library in addition to
gzip.GzipFile and lzma.LZMAFile is probably a good idea as LZ4 is really
fast compared to zlib/gzip. But this is not related to PEP 574.

-- 
Olivier
​
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-26 Thread Matthew Rocklin
Hi all,

I agree that compression is often a good idea when moving serialized
objects around on a network, but for what it's worth I as a library author
would always set compress=False and then handle it myself as a separate
step.  There are a few reasons for this:

   1. Bandwidth is often pretty good, especially intra-node, on high
   performance networks, or on decent modern discs (NVMe)
   2. I often use different compression technologies in different
   situations.  LZ4 is a great all-around default, but often snappy, blosc, or
   z-standrad are better suited.  This depends strongly on the characteristics
   of the data.
   3. Very often data often isn't compressible, or is already in some
   compressed form, such as in images, and so compressing only hurts you.

In general, my thought is that compression is a complex topic with enough
intricaces that setting a single sane default that works 70+% of the time
probably isn't possible (at least not with the applications that I get
exposed to).

Instead of baking a particular method into pickle.dumps I would recommend
trying to solve this problem through documentation, pointing users to the
various compression libraries within the broader Python ecosystem, and
perhaps pointing to one of the many blogposts that discuss their strengths
and weaknesses.

Best,
-matt
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-26 Thread Stefan Behnel
Antoine Pitrou schrieb am 25.05.2018 um 23:11:
> On Fri, 25 May 2018 14:50:57 -0600
> Neil Schemenauer wrote:
>> On 2018-05-25, Antoine Pitrou wrote:
>>> Do you have something specific in mind?  
>>
>> I think compressed by default is a good idea.  My quick proposal:
>>
>> - Use fast compression like lz4 or zlib with Z_BEST_SPEED
>>
>> - Add a 'compress' keyword argument with a default of None.  For
>>   protocol 5, None means to compress.  Providing 'compress' != None
>>   for older protocols will raise an error.
> 
> The question is what purpose does it serve for pickle to do it rather
> than for the user to compress the pickle themselves.  You're basically
> saving one line of code.  Am I missing some other advantage?

Regarding the pickling side, if the pickle is large, then it can save
memory to compress while pickling, rather than compressing after pickling.
But that can also be done with file-like objects, so the advantage is small
here.

I think a major advantage is on the unpickling side rather than the
pickling side. Sure, users can compress a pickle after the fact, but if
there's a (set of) standard algorithms that unpickle can handle
automatically, then it's enough to pass "something pickled" into unpickle,
rather than having to know (or figure out) if and how that pickle was
originally compressed, and build up the decompression pipeline for it to
get everything uncompressed efficiently without accidentally wasting memory
or processing time.

Obviously, auto-decompression opens up a gate for compression bombs, but
then, unpickling data from untrusted sources is discouraged anyway, so...

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Nathaniel Smith
On Fri, May 25, 2018 at 3:35 PM, Neil Schemenauer
 wrote:
> This discussion can easily lead into bikeshedding (e.g. relative
> merits of different compression schemes).  Since I'm not
> volunteering to implement anything, I will stop responding at this
> point. ;-)

I think the bikeshedding -- or more to the point, the fact that
there's a wide variety of options for compressing pickles, and none of
them are appropriate in all circumstances -- means that this is
something that should remain a separate layer.

Even super-fast algorithms like lz4 are inefficient when you're
transmitting pickles between two processes on the same system – they
still add extra memory copies. And that's a very common use case.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Neil Schemenauer
On 2018-05-25, Antoine Pitrou wrote:
> The question is what purpose does it serve for pickle to do it rather
> than for the user to compress the pickle themselves.  You're basically
> saving one line of code.

It's one line of code everywhere pickling or unpicking happens.  And
you probably need to import a compression module, so at least two
lines.  Then maybe you need to figure out if the pickle is
compressed and what kind of compression is used.  So, add a few more
lines.

It seems logical to me that users of pickle want it to be fast and
produce small pickles.  Compressing by default seems the right
choice, even though it complicates the implementation.  Ivan brings
up a valid point that compressed pickles are harder to debug.
However, I think that's much less important than being small.

> it requires us to ship the lz4 library with Python

Yeah, that's not so great.  I think zlib with Z_BEST_SPEED would be
fine.  However, some people might worry it is too slow or doesn't
compress enough.  Having lz4 as a battery included seems like a good
idea anyhow.  I understand that it is pretty well established as a
useful compression method.  Obviously requiring a new C library to
be included expands the effort of implementation a lot.

This discussion can easily lead into bikeshedding (e.g. relative
merits of different compression schemes).  Since I'm not
volunteering to implement anything, I will stop responding at this
point. ;-)

Regards,

  Neil
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Antoine Pitrou
On Fri, 25 May 2018 14:50:57 -0600
Neil Schemenauer  wrote:
> On 2018-05-25, Antoine Pitrou wrote:
> > Do you have something specific in mind?  
> 
> I think compressed by default is a good idea.  My quick proposal:
> 
> - Use fast compression like lz4 or zlib with Z_BEST_SPEED
> 
> - Add a 'compress' keyword argument with a default of None.  For
>   protocol 5, None means to compress.  Providing 'compress' != None
>   for older protocols will raise an error.

The question is what purpose does it serve for pickle to do it rather
than for the user to compress the pickle themselves.  You're basically
saving one line of code.  Am I missing some other advantage?

(also note that it requires us to ship the lz4 library with Python, or
another modern compression library such as zstd; zlib's performance
characteristics are outdated)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Neil Schemenauer
On 2018-05-25, Antoine Pitrou wrote:
> Do you have something specific in mind?

I think compressed by default is a good idea.  My quick proposal:

- Use fast compression like lz4 or zlib with Z_BEST_SPEED

- Add a 'compress' keyword argument with a default of None.  For
  protocol 5, None means to compress.  Providing 'compress' != None
  for older protocols will raise an error.

The compression overhead will be small compared to the
pickle/unpickle costs.  If someone wants to apply their own (e.g.
better) compression, they can set compress=False.

An alternative idea is to have two different protocol formats.  E.g.
5 and 6.  One is "pickle 5" with compression, one without
compression.  I don't like that as much since it breaks the idea
that higher protocol numbers are "better".

Regards,

  Neil
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Ivan Pozdeev via Python-Dev

On 25.05.2018 20:36, Raymond Hettinger wrote:



On May 24, 2018, at 10:57 AM, Antoine Pitrou  wrote:

While PEP 574 (pickle protocol 5 with out-of-band data) is still in
draft status, I've made available an implementation in branch "pickle5"
in my GitHub fork of CPython:
https://github.com/pitrou/cpython/tree/pickle5

Also I've published an experimental backport on PyPI, for Python 3.6
and 3.7.  This should help people play with the new API and features
without having to compile Python:
https://pypi.org/project/pickle5/

Any feedback is welcome.

Thanks for doing this.

Hope it isn't too late, but I would like to suggest that protocol 5 support 
fast compression by default.  We normally pickle objects so that they can be 
transported (saved to a file or sent over a socket). Transport costs (reading 
and writing a file or socket) are generally proportional to size, so 
compression is likely to be a net win (much as it was for header compression in 
HTTP/2).

The PEP lists compression as a possible a refinement only for large objects, 
but I expect is will be a win for most pickles to compress them in their 
entirety.


I would advise against that. Pickle format is unreadable as it is, 
compression will make it literally impossible to diagnose problems.

Python supports transparent compression, e.g. with the 'zlib' codec.



Raymond
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/vano%40mail.mipt.ru


--
Regards,
Ivan

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Antoine Pitrou
On Fri, 25 May 2018 10:36:08 -0700
Raymond Hettinger  wrote:
> > On May 24, 2018, at 10:57 AM, Antoine Pitrou  wrote:
> > 
> > While PEP 574 (pickle protocol 5 with out-of-band data) is still in
> > draft status, I've made available an implementation in branch "pickle5"
> > in my GitHub fork of CPython:
> > https://github.com/pitrou/cpython/tree/pickle5
> > 
> > Also I've published an experimental backport on PyPI, for Python 3.6
> > and 3.7.  This should help people play with the new API and features
> > without having to compile Python:
> > https://pypi.org/project/pickle5/
> > 
> > Any feedback is welcome.  
> 
> Thanks for doing this.
> 
> Hope it isn't too late, but I would like to suggest that protocol 5 support 
> fast compression by default.  We normally pickle objects so that they can be 
> transported (saved to a file or sent over a socket). Transport costs (reading 
> and writing a file or socket) are generally proportional to size, so 
> compression is likely to be a net win (much as it was for header compression 
> in HTTP/2).
> 
> The PEP lists compression as a possible a refinement only for large objects, 
> but I expect is will be a win for most pickles to compress them in their 
> entirety.

It's not too late (the PEP is still a draft, and there's a lot of time
before 3.8), but I wonder what would be the benefit of making it a part
of the pickle specification, rather than compressing independently.

Whether and how to compress is generally a compromise between
transmission (or storage) speed and computation speed.  Also, there are
specialized compressors for higher efficiency (for example, Blosc has
datatype-specific compression for Numpy arrays).  Such knowledge can be
embodied in domain-specific libraries such as Dask/distributed, but it
cannot really be incorporated in pickle itself.

Do you have something specific in mind?

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Raymond Hettinger


> On May 24, 2018, at 10:57 AM, Antoine Pitrou  wrote:
> 
> While PEP 574 (pickle protocol 5 with out-of-band data) is still in
> draft status, I've made available an implementation in branch "pickle5"
> in my GitHub fork of CPython:
> https://github.com/pitrou/cpython/tree/pickle5
> 
> Also I've published an experimental backport on PyPI, for Python 3.6
> and 3.7.  This should help people play with the new API and features
> without having to compile Python:
> https://pypi.org/project/pickle5/
> 
> Any feedback is welcome.

Thanks for doing this.

Hope it isn't too late, but I would like to suggest that protocol 5 support 
fast compression by default.  We normally pickle objects so that they can be 
transported (saved to a file or sent over a socket). Transport costs (reading 
and writing a file or socket) are generally proportional to size, so 
compression is likely to be a net win (much as it was for header compression in 
HTTP/2).

The PEP lists compression as a possible a refinement only for large objects, 
but I expect is will be a win for most pickles to compress them in their 
entirety.


Raymond
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-25 Thread Olivier Grisel
I tried this implementation to add no-copy pickling for large numpy arrays
and seems to work as expected (for a simple contiguous array). I took some
notes on the numpy tracker to advertise this PEP to the numpy developers:

https://github.com/numpy/numpy/issues/11161

-- 
Olivier
​
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-24 Thread Victor Stinner
Link to the PEP:

"PEP 574 -- Pickle protocol 5 with out-of-band data"
https://www.python.org/dev/peps/pep-0574/

Victor

2018-05-24 19:57 GMT+02:00 Antoine Pitrou :
>
> Hi,
>
> While PEP 574 (pickle protocol 5 with out-of-band data) is still in
> draft status, I've made available an implementation in branch "pickle5"
> in my GitHub fork of CPython:
> https://github.com/pitrou/cpython/tree/pickle5
>
> Also I've published an experimental backport on PyPI, for Python 3.6
> and 3.7.  This should help people play with the new API and features
> without having to compile Python:
> https://pypi.org/project/pickle5/
>
> Any feedback is welcome.
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 574 (pickle 5) implementation and backport available

2018-05-24 Thread Antoine Pitrou

Hi,

While PEP 574 (pickle protocol 5 with out-of-band data) is still in
draft status, I've made available an implementation in branch "pickle5"
in my GitHub fork of CPython:
https://github.com/pitrou/cpython/tree/pickle5

Also I've published an experimental backport on PyPI, for Python 3.6
and 3.7.  This should help people play with the new API and features
without having to compile Python:
https://pypi.org/project/pickle5/

Any feedback is welcome.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com