Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Brett Cannon
On Fri, Apr 23, 2010 at 11:11, Dan Gindikin  wrote:

> We were having performance problems unpickling a large pickle file, we were
> getting 170s running time (which was fine), but 1100mb memory usage. Memory
> usage ought to have been about 300mb, this was happening because of memory
> fragmentation, due to many unnecessary "puts" in the pickle stream.
>
> We made a pickletools.optimize inspired tool that could run directly on a
> pickle file and used pickletools.genops. This solved the unpickling problem
> (84s, 382mb).
>
> However the tool itself was using too much memory and time (1100s, 470mb),
> so
> I recoded it to scan through the pickle stream directly, without going
> through
> pickletools.genops, giving (240s, 130mb).
>
> Other people that deal with large pickle files are probably having similar
> problems, and since this comes up when dealing with large data it is
> precisely
> in this situation that you probably can't use pickletools.optimize or
> pickletools.genops. It feels like functionality that ought to be added to
> pickletools, is there some way I can contribute this?
>

The best next step is to open an issue at bugs.python.org and upload the
patch. I can't make any guarantees on when someone will look at it or if it
will get accepted, but putting the code there is your best bet for
acceptance.

-Brett
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Alexandre Vassalotti
On Fri, Apr 23, 2010 at 2:11 PM, Dan Gindikin  wrote:
> We were having performance problems unpickling a large pickle file, we were
> getting 170s running time (which was fine), but 1100mb memory usage. Memory
> usage ought to have been about 300mb, this was happening because of memory
> fragmentation, due to many unnecessary "puts" in the pickle stream.
>
> We made a pickletools.optimize inspired tool that could run directly on a
> pickle file and used pickletools.genops. This solved the unpickling problem
> (84s, 382mb).
>
> However the tool itself was using too much memory and time (1100s, 470mb), so
> I recoded it to scan through the pickle stream directly, without going through
> pickletools.genops, giving (240s, 130mb).
>

Collin Winter wrote a simple optimization pass for cPickle in Unladen
Swallow [1]. The code reads through the stream and remove all the
unnecessary PUTs in-place.

[1]: 
http://code.google.com/p/unladen-swallow/source/browse/trunk/Modules/cPickle.c#735

> Other people that deal with large pickle files are probably having similar
> problems, and since this comes up when dealing with large data it is precisely
> in this situation that you probably can't use pickletools.optimize or
> pickletools.genops. It feels like functionality that ought to be added to
> pickletools, is there some way I can contribute this?
>

Just put your code on bugs.python.org and I will take a look.

-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Alexandre Vassalotti
On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti
 wrote:
> Collin Winter wrote a simple optimization pass for cPickle in Unladen
> Swallow [1]. The code reads through the stream and remove all the
> unnecessary PUTs in-place.
>

I just noticed the code removes *all* PUT opcodes, regardless if they
are needed or not. So, this code can only be used if there's no GET in
the stream (which is unlikely for a large stream). I believe Collin
made this trade-off for performance reasons. However, it wouldn't be
hard to make the current code to work like pickletools.optimize().

-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Dan Gindikin
Alexandre Vassalotti  peadrop.com> writes:

> Just put your code on bugs.python.org and I will take a look.
> 

Thanks, I'll put it in there.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Collin Winter
On Fri, Apr 23, 2010 at 11:49 AM, Alexandre Vassalotti
 wrote:
> On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti
>  wrote:
>> Collin Winter wrote a simple optimization pass for cPickle in Unladen
>> Swallow [1]. The code reads through the stream and remove all the
>> unnecessary PUTs in-place.
>>
>
> I just noticed the code removes *all* PUT opcodes, regardless if they
> are needed or not. So, this code can only be used if there's no GET in
> the stream (which is unlikely for a large stream). I believe Collin
> made this trade-off for performance reasons. However, it wouldn't be
> hard to make the current code to work like pickletools.optimize().

The optimization pass is only run if you don't use any GETs. The
optimization is also disabled if you're writing to a file-like object.
These tradeoffs were appropriate for the workload I was optimizing
against.

Collin Winter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Collin Winter
On Fri, Apr 23, 2010 at 11:53 AM, Collin Winter  wrote:
> On Fri, Apr 23, 2010 at 11:49 AM, Alexandre Vassalotti
>  wrote:
>> On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti
>>  wrote:
>>> Collin Winter wrote a simple optimization pass for cPickle in Unladen
>>> Swallow [1]. The code reads through the stream and remove all the
>>> unnecessary PUTs in-place.
>>>
>>
>> I just noticed the code removes *all* PUT opcodes, regardless if they
>> are needed or not. So, this code can only be used if there's no GET in
>> the stream (which is unlikely for a large stream). I believe Collin
>> made this trade-off for performance reasons. However, it wouldn't be
>> hard to make the current code to work like pickletools.optimize().
>
> The optimization pass is only run if you don't use any GETs. The
> optimization is also disabled if you're writing to a file-like object.
> These tradeoffs were appropriate for the workload I was optimizing
> against.

I should add that, adding the necessary bookkeeping to remove only
unused PUTs (instead of the current all-or-nothing scheme) should not
be hard. I'd watch out for a further performance/memory hit; the
pickling benchmarks in the benchmark suite should help assess this.
The current optimization penalizes pickling to speed up unpickling,
which made sense when optimizing pickles that would go into memcache
and be read out 13-15x more often than they were written.

Collin Winter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Alexandre Vassalotti
On Fri, Apr 23, 2010 at 3:07 PM, Collin Winter  wrote:
> I should add that, adding the necessary bookkeeping to remove only
> unused PUTs (instead of the current all-or-nothing scheme) should not
> be hard. I'd watch out for a further performance/memory hit; the
> pickling benchmarks in the benchmark suite should help assess this.

I was thinking about this too. A simple boolean table could be fast,
while keeping the space requirement down. This scheme would be nice to
caches as well.

> The current optimization penalizes pickling to speed up unpickling,
> which made sense when optimizing pickles that would go into memcache
> and be read out 13-15x more often than they were written.

This is my current impression of how pickle is most often used. Are
you aware of a use case of pickle where you do more writes than reads?
I can't think of any.

-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Dan Gindikin
Collin Winter  google.com> writes:
> I should add that, adding the necessary bookkeeping to remove only
> unused PUTs (instead of the current all-or-nothing scheme) should not
> be hard. I'd watch out for a further performance/memory hit; the
> pickling benchmarks in the benchmark suite should help assess this.
> The current optimization penalizes pickling to speed up unpickling,
> which made sense when optimizing pickles that would go into memcache
> and be read out 13-15x more often than they were written.

This wouldn't help our use case, your code needs the entire pickle
stream to be in memory, which in our case would be about 475mb, this
is on top of the 300mb+ data structures that generated the pickle
stream.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Alexandre Vassalotti
On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin  wrote:
> This wouldn't help our use case, your code needs the entire pickle
> stream to be in memory, which in our case would be about 475mb, this
> is on top of the 300mb+ data structures that generated the pickle
> stream.
>

In that case, the best we could do is a two-pass algorithm to remove
the unused PUTs. That won't be efficient, but it will satisfy the
memory constraint. Another solution is to not generate the PUTs at all
by setting the 'fast' attribute on Pickler. But that won't work if you
have a recursive structure, or have code that requires that the
identity of objects to be preserved.

>>> import io, pickle
>>> x=[1,2]
>>> f = io.BytesIO()
>>> p = pickle.Pickler(f, protocol=-1)
>>> p.dump([x,x])
>>> pickletools.dis(f.getvalue())
0: \x80 PROTO  2
2: ]EMPTY_LIST
3: qBINPUT 0
5: (MARK
6: ]EMPTY_LIST
7: qBINPUT 1
9: (MARK
   10: KBININT11
   12: KBININT12
   14: eAPPENDS(MARK at 9)
   15: hBINGET 1
   17: eAPPENDS(MARK at 5)
   18: .STOP
highest protocol among opcodes = 2
>>> [id(x) for x in pickle.loads(f.getvalue())]
[20966504, 20966504]

Now with the 'fast' mode enabled:

>>> f = io.BytesIO()
>>> p = pickle.Pickler(f, protocol=-1)
>>> p.fast = True
>>> p.dump([x,x])
>>> pickletools.dis(f.getvalue())
0: \x80 PROTO  2
2: ]EMPTY_LIST
3: (MARK
4: ]EMPTY_LIST
5: (MARK
6: KBININT11
8: KBININT12
   10: eAPPENDS(MARK at 5)
   11: ]EMPTY_LIST
   12: (MARK
   13: KBININT11
   15: KBININT12
   17: eAPPENDS(MARK at 12)
   18: eAPPENDS(MARK at 3)
   19: .STOP
highest protocol among opcodes = 2
>>> [id(x) for x in pickle.loads(f.getvalue())]
[20966504, 21917992]

As you can observe, the pickle stream generated with the fast mode
might actually be bigger.

By the way, it is weird that the total memory usage of the data
structure is smaller than the size of its respective pickle stream.
What pickle protocol are you using?

-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Dan Gindikin
Alexandre Vassalotti  peadrop.com> writes:

> 
> On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin  gmail.com> 
> wrote:
> > This wouldn't help our use case, your code needs the entire pickle
> > stream to be in memory, which in our case would be about 475mb, this
> > is on top of the 300mb+ data structures that generated the pickle
> > stream.
> >
> 
> In that case, the best we could do is a two-pass algorithm to remove
> the unused PUTs. That won't be efficient, but it will satisfy the
> memory constraint.

That is for what I'm doing for us right now.

> Another solution is to not generate the PUTs at all
> by setting the 'fast' attribute on Pickler. But that won't work if you
> have a recursive structure, or have code that requires that the
> identity of objects to be preserved.

We definitely have some cross links amongst the objects, so we need PUTs.

> By the way, it is weird that the total memory usage of the data
> structure is smaller than the size of its respective pickle stream.
> What pickle protocol are you using?

Its highest protocol, but we have a bunch of extension types that
get expanded into python tuples for pickling.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Collin Winter
On Fri, Apr 23, 2010 at 1:53 PM, Alexandre Vassalotti
 wrote:
> On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin  wrote:
>> This wouldn't help our use case, your code needs the entire pickle
>> stream to be in memory, which in our case would be about 475mb, this
>> is on top of the 300mb+ data structures that generated the pickle
>> stream.
>>
>
> In that case, the best we could do is a two-pass algorithm to remove
> the unused PUTs. That won't be efficient, but it will satisfy the
> memory constraint. Another solution is to not generate the PUTs at all
> by setting the 'fast' attribute on Pickler. But that won't work if you
> have a recursive structure, or have code that requires that the
> identity of objects to be preserved.

I don't think it's possible in general to remove any PUTs if the
pickle is being written to a file-like object. It is possible to reuse
a single Pickler to pickle multiple objects: this causes the Pickler's
memo dict to be shared between the objects being pickled. If you
pickle foo, bar, and baz, foo may not have any GETs, but bar and baz
may have GETs that reference data added to the memo by foo's PUT
operations. Because you can't know what will be written to the
file-like object later, you can't remove any of the PUT instructions
in this scenario.

This kind of thing is done in real-world code like cvs2svn (which I
broke when I was optimizing cPickle; don't break cvs2svn, it's not fun
to fix :). I added some basic tests for this support in cPython's
Lib/test/pickletester.py.

There might be room for app-specific optimizations that do this, but
I'm not sure it would work for a general-usage cPickle that needs to
stay compatible with the current system.

Collin Winter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Antoine Pitrou
Collin Winter  google.com> writes:
> 
> I don't think it's possible in general to remove any PUTs if the
> pickle is being written to a file-like object.

Does cPickle bytecode have some kind of NOP instruction?
You could keep track of which PUTs weren't necessary and zero them out at the
end. It would be much cheaper than writing a whole other "optimized" stream.

(of course it only works on a seekable stream :-))

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Dan Gindikin
Collin Winter  google.com> writes:
> I don't think it's possible in general to remove any PUTs if the
> pickle is being written to a file-like object. It is possible to reuse
> a single Pickler to pickle multiple objects: this causes the Pickler's
> memo dict to be shared between the objects being pickled. If you
> pickle foo, bar, and baz, foo may not have any GETs, but bar and baz
> may have GETs that reference data added to the memo by foo's PUT
> operations. Because you can't know what will be written to the
> file-like object later, you can't remove any of the PUT instructions
> in this scenario.

Hmm, that is a good point. A possible solution would be for the
two-pass optimizer to scan through the entire file, going right
through '.' opcodes. That would deal with the case you are
describing, but not if the user "maliciously" wrote some other
stuff into the file in between pickle dumps, all the while reusing
the same pickler.

I think a better solution would be to make sure that the '.' is
the last thing in the file and die otherwise. This would at least
ensure correctness and detection of cases that this thing could
not handle.

> don't break cvs2svn, it's not fun
> to fix :). I added some basic tests for this support in cPython's
> Lib/test/pickletester.py.

Thanks for the warning :)

> There might be room for app-specific optimizations that do this, but
> I'm not sure it would work for a general-usage cPickle that needs to
> stay compatible with the current system.

That may well be true. Still, when trying to deal with large data
you really need something like this. Our situation was made worse because
we had a extension types. As they were allocated they got interspersed
with temporaries generated by the spurious PUTs, and that is what
really fragmented the memory. However its probably not a stretch to
assume that if you are dealing with large stuff through python you are
going to have extension types in the mix.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Dan Gindikin
Antoine Pitrou  pitrou.net> writes:
> Does cPickle bytecode have some kind of NOP instruction?
> You could keep track of which PUTs weren't necessary and zero them out at the
> end. It would be much cheaper than writing a whole other "optimized" stream.

For a large file, I'm not sure it is much faster to edit it in place
than to rewrite it, and also since it's a large file, you are going
to probably have it compressed, in which case you are out of luck anyway.

> (of course it only works on a seekable stream )

Indeed... since I was dealing with zipped files, I had to pass in
a "seek0" func, like so:

   file_in_seek0_func = lambda x: os.popen('zcat ' + file_in)



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution

2010-04-23 Thread Antoine Pitrou
Dan Gindikin  gmail.com> writes:
> 
> Antoine Pitrou  pitrou.net> writes:
> > Does cPickle bytecode have some kind of NOP instruction?
> > You could keep track of which PUTs weren't necessary and zero them out at 
> > the
> > end. It would be much cheaper than writing a whole other "optimized" stream.
> 
> For a large file, I'm not sure it is much faster to edit it in place
> than to rewrite it, and also since it's a large file, you are going
> to probably have it compressed, in which case you are out of luck anyway.

Depends whether you really care about disk occupation or not. Disk space is
often cheap (much more so than RAM).
Also, I'm quite sure overwriting a couple of blocks in a file is cheaper than
rewriting it entirely. Of course, if you must overwrite every other block, it
might not be true anymore :-)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com