Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 11:11, Dan Gindikin wrote: > We were having performance problems unpickling a large pickle file, we were > getting 170s running time (which was fine), but 1100mb memory usage. Memory > usage ought to have been about 300mb, this was happening because of memory > fragmentation, due to many unnecessary "puts" in the pickle stream. > > We made a pickletools.optimize inspired tool that could run directly on a > pickle file and used pickletools.genops. This solved the unpickling problem > (84s, 382mb). > > However the tool itself was using too much memory and time (1100s, 470mb), > so > I recoded it to scan through the pickle stream directly, without going > through > pickletools.genops, giving (240s, 130mb). > > Other people that deal with large pickle files are probably having similar > problems, and since this comes up when dealing with large data it is > precisely > in this situation that you probably can't use pickletools.optimize or > pickletools.genops. It feels like functionality that ought to be added to > pickletools, is there some way I can contribute this? > The best next step is to open an issue at bugs.python.org and upload the patch. I can't make any guarantees on when someone will look at it or if it will get accepted, but putting the code there is your best bet for acceptance. -Brett ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 2:11 PM, Dan Gindikin wrote: > We were having performance problems unpickling a large pickle file, we were > getting 170s running time (which was fine), but 1100mb memory usage. Memory > usage ought to have been about 300mb, this was happening because of memory > fragmentation, due to many unnecessary "puts" in the pickle stream. > > We made a pickletools.optimize inspired tool that could run directly on a > pickle file and used pickletools.genops. This solved the unpickling problem > (84s, 382mb). > > However the tool itself was using too much memory and time (1100s, 470mb), so > I recoded it to scan through the pickle stream directly, without going through > pickletools.genops, giving (240s, 130mb). > Collin Winter wrote a simple optimization pass for cPickle in Unladen Swallow [1]. The code reads through the stream and remove all the unnecessary PUTs in-place. [1]: http://code.google.com/p/unladen-swallow/source/browse/trunk/Modules/cPickle.c#735 > Other people that deal with large pickle files are probably having similar > problems, and since this comes up when dealing with large data it is precisely > in this situation that you probably can't use pickletools.optimize or > pickletools.genops. It feels like functionality that ought to be added to > pickletools, is there some way I can contribute this? > Just put your code on bugs.python.org and I will take a look. -- Alexandre ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti wrote: > Collin Winter wrote a simple optimization pass for cPickle in Unladen > Swallow [1]. The code reads through the stream and remove all the > unnecessary PUTs in-place. > I just noticed the code removes *all* PUT opcodes, regardless if they are needed or not. So, this code can only be used if there's no GET in the stream (which is unlikely for a large stream). I believe Collin made this trade-off for performance reasons. However, it wouldn't be hard to make the current code to work like pickletools.optimize(). -- Alexandre ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Alexandre Vassalotti peadrop.com> writes: > Just put your code on bugs.python.org and I will take a look. > Thanks, I'll put it in there. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 11:49 AM, Alexandre Vassalotti wrote: > On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti > wrote: >> Collin Winter wrote a simple optimization pass for cPickle in Unladen >> Swallow [1]. The code reads through the stream and remove all the >> unnecessary PUTs in-place. >> > > I just noticed the code removes *all* PUT opcodes, regardless if they > are needed or not. So, this code can only be used if there's no GET in > the stream (which is unlikely for a large stream). I believe Collin > made this trade-off for performance reasons. However, it wouldn't be > hard to make the current code to work like pickletools.optimize(). The optimization pass is only run if you don't use any GETs. The optimization is also disabled if you're writing to a file-like object. These tradeoffs were appropriate for the workload I was optimizing against. Collin Winter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 11:53 AM, Collin Winter wrote: > On Fri, Apr 23, 2010 at 11:49 AM, Alexandre Vassalotti > wrote: >> On Fri, Apr 23, 2010 at 2:38 PM, Alexandre Vassalotti >> wrote: >>> Collin Winter wrote a simple optimization pass for cPickle in Unladen >>> Swallow [1]. The code reads through the stream and remove all the >>> unnecessary PUTs in-place. >>> >> >> I just noticed the code removes *all* PUT opcodes, regardless if they >> are needed or not. So, this code can only be used if there's no GET in >> the stream (which is unlikely for a large stream). I believe Collin >> made this trade-off for performance reasons. However, it wouldn't be >> hard to make the current code to work like pickletools.optimize(). > > The optimization pass is only run if you don't use any GETs. The > optimization is also disabled if you're writing to a file-like object. > These tradeoffs were appropriate for the workload I was optimizing > against. I should add that, adding the necessary bookkeeping to remove only unused PUTs (instead of the current all-or-nothing scheme) should not be hard. I'd watch out for a further performance/memory hit; the pickling benchmarks in the benchmark suite should help assess this. The current optimization penalizes pickling to speed up unpickling, which made sense when optimizing pickles that would go into memcache and be read out 13-15x more often than they were written. Collin Winter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 3:07 PM, Collin Winter wrote: > I should add that, adding the necessary bookkeeping to remove only > unused PUTs (instead of the current all-or-nothing scheme) should not > be hard. I'd watch out for a further performance/memory hit; the > pickling benchmarks in the benchmark suite should help assess this. I was thinking about this too. A simple boolean table could be fast, while keeping the space requirement down. This scheme would be nice to caches as well. > The current optimization penalizes pickling to speed up unpickling, > which made sense when optimizing pickles that would go into memcache > and be read out 13-15x more often than they were written. This is my current impression of how pickle is most often used. Are you aware of a use case of pickle where you do more writes than reads? I can't think of any. -- Alexandre ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Collin Winter google.com> writes: > I should add that, adding the necessary bookkeeping to remove only > unused PUTs (instead of the current all-or-nothing scheme) should not > be hard. I'd watch out for a further performance/memory hit; the > pickling benchmarks in the benchmark suite should help assess this. > The current optimization penalizes pickling to speed up unpickling, > which made sense when optimizing pickles that would go into memcache > and be read out 13-15x more often than they were written. This wouldn't help our use case, your code needs the entire pickle stream to be in memory, which in our case would be about 475mb, this is on top of the 300mb+ data structures that generated the pickle stream. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin wrote: > This wouldn't help our use case, your code needs the entire pickle > stream to be in memory, which in our case would be about 475mb, this > is on top of the 300mb+ data structures that generated the pickle > stream. > In that case, the best we could do is a two-pass algorithm to remove the unused PUTs. That won't be efficient, but it will satisfy the memory constraint. Another solution is to not generate the PUTs at all by setting the 'fast' attribute on Pickler. But that won't work if you have a recursive structure, or have code that requires that the identity of objects to be preserved. >>> import io, pickle >>> x=[1,2] >>> f = io.BytesIO() >>> p = pickle.Pickler(f, protocol=-1) >>> p.dump([x,x]) >>> pickletools.dis(f.getvalue()) 0: \x80 PROTO 2 2: ]EMPTY_LIST 3: qBINPUT 0 5: (MARK 6: ]EMPTY_LIST 7: qBINPUT 1 9: (MARK 10: KBININT11 12: KBININT12 14: eAPPENDS(MARK at 9) 15: hBINGET 1 17: eAPPENDS(MARK at 5) 18: .STOP highest protocol among opcodes = 2 >>> [id(x) for x in pickle.loads(f.getvalue())] [20966504, 20966504] Now with the 'fast' mode enabled: >>> f = io.BytesIO() >>> p = pickle.Pickler(f, protocol=-1) >>> p.fast = True >>> p.dump([x,x]) >>> pickletools.dis(f.getvalue()) 0: \x80 PROTO 2 2: ]EMPTY_LIST 3: (MARK 4: ]EMPTY_LIST 5: (MARK 6: KBININT11 8: KBININT12 10: eAPPENDS(MARK at 5) 11: ]EMPTY_LIST 12: (MARK 13: KBININT11 15: KBININT12 17: eAPPENDS(MARK at 12) 18: eAPPENDS(MARK at 3) 19: .STOP highest protocol among opcodes = 2 >>> [id(x) for x in pickle.loads(f.getvalue())] [20966504, 21917992] As you can observe, the pickle stream generated with the fast mode might actually be bigger. By the way, it is weird that the total memory usage of the data structure is smaller than the size of its respective pickle stream. What pickle protocol are you using? -- Alexandre ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Alexandre Vassalotti peadrop.com> writes: > > On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin gmail.com> > wrote: > > This wouldn't help our use case, your code needs the entire pickle > > stream to be in memory, which in our case would be about 475mb, this > > is on top of the 300mb+ data structures that generated the pickle > > stream. > > > > In that case, the best we could do is a two-pass algorithm to remove > the unused PUTs. That won't be efficient, but it will satisfy the > memory constraint. That is for what I'm doing for us right now. > Another solution is to not generate the PUTs at all > by setting the 'fast' attribute on Pickler. But that won't work if you > have a recursive structure, or have code that requires that the > identity of objects to be preserved. We definitely have some cross links amongst the objects, so we need PUTs. > By the way, it is weird that the total memory usage of the data > structure is smaller than the size of its respective pickle stream. > What pickle protocol are you using? Its highest protocol, but we have a bunch of extension types that get expanded into python tuples for pickling. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
On Fri, Apr 23, 2010 at 1:53 PM, Alexandre Vassalotti wrote: > On Fri, Apr 23, 2010 at 3:57 PM, Dan Gindikin wrote: >> This wouldn't help our use case, your code needs the entire pickle >> stream to be in memory, which in our case would be about 475mb, this >> is on top of the 300mb+ data structures that generated the pickle >> stream. >> > > In that case, the best we could do is a two-pass algorithm to remove > the unused PUTs. That won't be efficient, but it will satisfy the > memory constraint. Another solution is to not generate the PUTs at all > by setting the 'fast' attribute on Pickler. But that won't work if you > have a recursive structure, or have code that requires that the > identity of objects to be preserved. I don't think it's possible in general to remove any PUTs if the pickle is being written to a file-like object. It is possible to reuse a single Pickler to pickle multiple objects: this causes the Pickler's memo dict to be shared between the objects being pickled. If you pickle foo, bar, and baz, foo may not have any GETs, but bar and baz may have GETs that reference data added to the memo by foo's PUT operations. Because you can't know what will be written to the file-like object later, you can't remove any of the PUT instructions in this scenario. This kind of thing is done in real-world code like cvs2svn (which I broke when I was optimizing cPickle; don't break cvs2svn, it's not fun to fix :). I added some basic tests for this support in cPython's Lib/test/pickletester.py. There might be room for app-specific optimizations that do this, but I'm not sure it would work for a general-usage cPickle that needs to stay compatible with the current system. Collin Winter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Collin Winter google.com> writes: > > I don't think it's possible in general to remove any PUTs if the > pickle is being written to a file-like object. Does cPickle bytecode have some kind of NOP instruction? You could keep track of which PUTs weren't necessary and zero them out at the end. It would be much cheaper than writing a whole other "optimized" stream. (of course it only works on a seekable stream :-)) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Collin Winter google.com> writes: > I don't think it's possible in general to remove any PUTs if the > pickle is being written to a file-like object. It is possible to reuse > a single Pickler to pickle multiple objects: this causes the Pickler's > memo dict to be shared between the objects being pickled. If you > pickle foo, bar, and baz, foo may not have any GETs, but bar and baz > may have GETs that reference data added to the memo by foo's PUT > operations. Because you can't know what will be written to the > file-like object later, you can't remove any of the PUT instructions > in this scenario. Hmm, that is a good point. A possible solution would be for the two-pass optimizer to scan through the entire file, going right through '.' opcodes. That would deal with the case you are describing, but not if the user "maliciously" wrote some other stuff into the file in between pickle dumps, all the while reusing the same pickler. I think a better solution would be to make sure that the '.' is the last thing in the file and die otherwise. This would at least ensure correctness and detection of cases that this thing could not handle. > don't break cvs2svn, it's not fun > to fix :). I added some basic tests for this support in cPython's > Lib/test/pickletester.py. Thanks for the warning :) > There might be room for app-specific optimizations that do this, but > I'm not sure it would work for a general-usage cPickle that needs to > stay compatible with the current system. That may well be true. Still, when trying to deal with large data you really need something like this. Our situation was made worse because we had a extension types. As they were allocated they got interspersed with temporaries generated by the spurious PUTs, and that is what really fragmented the memory. However its probably not a stretch to assume that if you are dealing with large stuff through python you are going to have extension types in the mix. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Antoine Pitrou pitrou.net> writes: > Does cPickle bytecode have some kind of NOP instruction? > You could keep track of which PUTs weren't necessary and zero them out at the > end. It would be much cheaper than writing a whole other "optimized" stream. For a large file, I'm not sure it is much faster to edit it in place than to rewrite it, and also since it's a large file, you are going to probably have it compressed, in which case you are out of luck anyway. > (of course it only works on a seekable stream ) Indeed... since I was dealing with zipped files, I had to pass in a "seek0" func, like so: file_in_seek0_func = lambda x: os.popen('zcat ' + file_in) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unpickling memory usage problem, and a proposed solution
Dan Gindikin gmail.com> writes: > > Antoine Pitrou pitrou.net> writes: > > Does cPickle bytecode have some kind of NOP instruction? > > You could keep track of which PUTs weren't necessary and zero them out at > > the > > end. It would be much cheaper than writing a whole other "optimized" stream. > > For a large file, I'm not sure it is much faster to edit it in place > than to rewrite it, and also since it's a large file, you are going > to probably have it compressed, in which case you are out of luck anyway. Depends whether you really care about disk occupation or not. Disk space is often cheap (much more so than RAM). Also, I'm quite sure overwriting a couple of blocks in a file is cheaper than rewriting it entirely. Of course, if you must overwrite every other block, it might not be true anymore :-) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com