I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.
I did some back of the envelope calcs which more or less agreed with heapy. The code stores 1 string, which is, on average, about 50 chars or so, and one MD5 hex string per line of code. There's about 40 bytes or so of overhead per string per sys.getsizeof(). I'm also storing an int (24b) and a <10 char string in an object with __slots__ set. Each object, per heapy (this is one area where I might be underestimating things) takes 64 bytes plus instance variable storage, so per line:

50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB plus some memory for the dicts, which is about what heapy is reporting (note I'm currently not actually running all 2M lines, I'm just running subsets for my tests).

Is there something I'm missing? Here's the heapy output after loading ~300k lines:

Partition of a set of 1199849 objects. Total size = 89965376 bytes.
Index   Count   %       Size    %       Cumulative      %       Kind
0       599999  50      38399920        43      38399920        43      str
1       5       0       25167224        28      63567144        71      dict
2       299998  25      19199872        21      82767016        92      0xa13330
3       299836  25      7196064         8       89963080        100     int
4       4       0       1152    0       89964232        100     
collections.defaultdict

Note that 3 of the dicts are empty. I assume that 0xa13330 is the address of the object. I'd actually expect to see 900k strings, but the <10 char string is always the same in this case so perhaps the runtime is using the same object...? At this point, top reports python as using 1.1g of virt and 1.0g of res.

I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.
That I don't know, but that would only explain, at most, a 2x increase in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.

Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.
That's certainly the way the code is written, and heapy seems to confirm that the strings aren't duplicated in memory.

Thanks for sticking with me on this,

MrsE

On 9/25/2012 4:06 AM, Dave Angel wrote:
On 09/25/2012 12:21 AM, Junkshops wrote:
Just curious;  which is it, two million lines, or half a million bytes?
<snip>
Sorry, that should've been a 500Mb, 2M line file.

which machine is 2gb, the Windows machine, or the VM?
VM. Winders is 4gb.

...but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.
Right, I understand that - but what's confusing me is that, given the
memory use is (I assume) monotonically increasing, the code should never
use more than what's reported by heapy once all the data is loaded into
memory, given that memory released by the code to the Python runtime is
reused. To the best of my ability to tell I'm not storing anything I
shouldn't, so the only thing I can think of is that all the object
creation and destruction, for some reason, it preventing reuse of
memory. I'm at a bit of a loss regarding what to try next.
I'm not familiar with heapy, but perhaps it's missing something there.
I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.  I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.

Perhaps one way to save space would be to use a long to store those md5
values.  You'd have to measure it, but I suspect it'd help (at the cost
of lots of extra hexlify-type calls).  Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to