Hello List,
I am working with relatively humongous binary files (created via cPickle),
and I stumbled across some unexpected (for me) performance differences between
two approaches I use to load those files:
1. Simply use cPickle.load(fid)
2. Read the file as binary using file.read() and then use cPickle.loads on the
resulting output
In the snippet below, the MakePickle function is a dummy function that
generates a relatively big binary file with cPickle (WARNING: around 3 GB) in
the current directory. I am using NumPy arrays to make the file big but my
original data structure is much more complicated, and things like HDF5 or
databases are currently not an option - I'd like to stay with pickles.
The ReadPickle function simply uses cPickle.load(fid) on the opened binary
file, and on my PC it takes about 2.3 seconds (approach 1).
The ReadPlusLoads function reads the file using file.read() and then use
cPickle.loads on the resulting output (approach 2). On my PC, the file.read()
process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds.
What baffles me is the time it takes to read the file using file.read(): is
there any way to slurp it all in one go (somehow) into a string ready for
cPickle.loads without that much of an overhead?
Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit,
with 16 cores and 100 GB RAM (so memory should not be a problem).
Thank you in advance for all suggestions :-) .
Andrea.
# Begin code
import os, sys
import time
import cPickle
import numpy
class Dummy(object):
def __init__(self, name):
self.name = name
self.data = numpy.random.rand(200, 600, 10)
def MakePickle():
num_objects = 300
list_of_objects = []
for index in xrange(num_objects):
dummy = Dummy('dummy_%d'%index)
list_of_objects.append(dummy)
fid = open('dummy.pkl', 'wb')
start = time.time()
out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL)
end = time.time()
print 'cPickle.dumps time:', end-start
start = end
fid.write(out)
end = time.time()
print 'file.write time:', end-start
fid.close()
def ReadPickle():
fid = open('dummy.pkl', 'rb')
start = time.time()
out = cPickle.load(fid)
end = time.time()
print 'cPickle.load time:', end-start
fid.close()
def ReadPlusLoads():
start = time.time()
fid = open('dummy.pkl', 'rb')
strs = fid.read()
fid.close()
end = time.time()
print 'file.read time:', end-start
start = end
out = cPickle.loads(strs)
end = time.time()
print 'cPickle.loads time:', end-start
if __name__ == '__main__':
ReadPickle()
ReadPlusLoads()
# End code
--
https://mail.python.org/mailman/listinfo/python-list