On 17 Jul., 10:01, Terry Carroll <[EMAIL PROTECTED]> wrote: > I am trying to do something with a very large tarfile from within > Python, and am running into memory constraints. The tarfile in > question is a 4-gigabyte datafile from > freedb.org,http://ftp.freedb.org/pub/freedb/, and has about 2.5 million > members > in it. > > Here's a simple toy program that just goes through and counts the > number of members in the tarfile, printing a status message every N > records (N=10,000 for the smaller file; N=100,000 for the larger). > > I'm finding that memory usage goes through the roof, simply iterating > over the tarfile. I'm using over 2G when I'm barely halfway through > the file. This surprises me; I'd expect the memory associated with > each iteration to be released at the end of the iteration; but > something's obviously building up. > > On one system, this ends with a MemoryError exception. On another > system, it just hangs, bringing the system to its knees, to the point > that it takes a minute or so to do simple task switching. > > Any suggestions to process this beast? I suppose I could just untar > the file, and process 2.5 million individual files, but I'm thinking > I'd rather process it directly if that's possible. > > Here's the toy code. (One explanation about the "import tarfilex as > tarfile" statement. I'm running Activestate Python 2.5.0, and the > tarfile.py module of that vintage was buggy, to the point that it > couldn't read these files at all. I brought down the most recent > tarfile.py fromhttp://svn.python.org/view/python/trunk/Lib/tarfile.py > and saved it as tarfilex.py. It works, at least until I start > processing some very large files, anyway.) > > import tarfilex as tarfile > import os, time > SOURCEDIR = "F:/Installs/FreeDB/" > smallfile = "freedb-update-20080601-20080708.tar" # 63M file > smallint = 10000 > bigfile = "freedb-complete-20080708.tar" # 4,329M file > bigiTnt = 100000 > > TARFILENAME, INTERVAL = smallfile, smallint > # TARFILENAME, INTERVAL = bigfile, bigint > > def filetype(filename): > return os.path.splitext(filename)[1] > > def memusage(units="M"): > import win32process > current_process = win32process.GetCurrentProcess() > memory_info = win32process.GetProcessMemoryInfo(current_process) > bytes = 1 > Kbytes = 1024*bytes > Mbytes = 1024*Kbytes > Gbytes = 1024*Mbytes > unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes} > return memory_info["WorkingSetSize"]//unitfactors[units] > > def opentar(filename): > modes = {".tar":"r", ".gz":"r:gz", ".bz2":"r:bz2"} > openmode = modes[filetype(filename)] > openedfile = tarfile.open(filename, openmode) > return openedfile > > TFPATH=SOURCEDIR+'/'+TARFILENAME > assert os.path.exists(TFPATH) > assert tarfile.is_tarfile(TFPATH) > tf = opentar(TFPATH) > count = 0 > print "%s memory: %sM count: %s (starting)" % (time.asctime(), > memusage(), count) > for tarinfo in tf: > count += 1 > if count % INTERVAL == 0: > print "%s memory: %sM count: %s" % (time.asctime(), > memusage(), count) > print "%s memory: %sM count: %s (completed)" % (time.asctime(), > memusage(), count) > > Results with the smaller (63M) file: > > Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting) > Thu Jul 17 00:18:23 2008 memory: 18M count: 10000 > Thu Jul 17 00:18:26 2008 memory: 32M count: 20000 > Thu Jul 17 00:18:28 2008 memory: 46M count: 30000 > Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed) > > Results with the larger (4.3G) file: > > Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting) > Thu Jul 17 00:19:40 2008 memory: 146M count: 100000 > Thu Jul 17 00:20:41 2008 memory: 289M count: 200000 > Thu Jul 17 00:21:41 2008 memory: 432M count: 300000 > Thu Jul 17 00:22:42 2008 memory: 574M count: 400000 > Thu Jul 17 00:23:47 2008 memory: 717M count: 500000 > Thu Jul 17 00:24:49 2008 memory: 860M count: 600000 > Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000 > Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000 > Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000 > Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000 > Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000 > Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000 > Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000 > Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000 > Traceback (most recent call last): > File "C:\test\freedb\tardemo.py", line 40, in <module> > for tarinfo in tf: > File "C:\test\freedb\tarfilex.py", line 2406, in next > tarinfo = self.tarfile.next() > File "C:\test\freedb\tarfilex.py", line 2311, in next > tarinfo = self.tarinfo.fromtarfile(self) > File "C:\test\freedb\tarfilex.py", line 1235, in fromtarfile > obj = cls.frombuf(buf) > File "C:\test\freedb\tarfilex.py", line 1193, in frombuf > if chksum not in calc_chksums(buf): > File "C:\test\freedb\tarfilex.py", line 261, in calc_chksums > unsigned_chksum = 256 + sum(struct.unpack("148B", buf[:148]) + > struct.unpack("356B", buf[156:512])) > MemoryError
I had a look at tarfile.py in my current Python 2.5 installations lib path. The iterator caches TarInfo objects in a list tf.members . If you only want to iterate and you are not interested in more functionallity, you could use "tf.members=[]" inside your loop. This is a dirty hack ! Greetings, Uwe -- http://mail.python.org/mailman/listinfo/python-list