Re: Memory usage per top 10x usage per heapy

2012-09-27 Thread bryanjugglercryptographer
MrsEntity wrote:
 Based on heapy, a db based solution would be serious overkill.

I've embraced overkill and my life is better for it. Don't confuse overkill 
with cost. Overkill is your friend.

The facts of the case: You need to save some derived strings for each of 2M 
input lines. Even half the input runs over the 2GB RAM in your (virtual) 
machine. You're using Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64.

That screams sqlite3. It's overkill, in a good way. It's already there for 
the importing.

Other approaches? You could try to keep everything in RAM, but use less. Tim 
Chase pointed out the memory-efficiency of named tuples. You could save some 
more by switching to Win7/32, Python 2.7/32; VirtualBox makes trying such 
alternatives quick and easy.

Or you could add memory. Compared to good old 32-bit, 64-bit operation consumes 
significantly more memory and supports vastly more memory. There's a bit of a 
mis-match in a 64-bit system with just 2GB of RAM. I know, sounds weird, just 
two billion bytes of RAM. I'll rephrase: just ten dollars worth of RAM. Less if 
you buy it where I do.

I don't know why the memory profiling tools are misleading you. I can think of 
plausible explanations, but they'd just be guesses. There's nothing all that 
surprising in running out of RAM, given what you've explained. A couple K per 
line is easy to burn. 

-Bryan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Tim Chase
On 09/24/12 23:41, Dennis Lee Bieber wrote:
 On Mon, 24 Sep 2012 14:59:47 -0700 (PDT), MrsEntity
 junksh...@gmail.com declaimed the following in
 gmane.comp.python.general:
 
 Hi all,

 I'm working on some code that parses a 500kb, 2M line file line by line and 
 saves, per line, some derived strings
 
   Pardon? A 2million line file will contain, at the minimum 2million
 line-end characters. That four times 500kB just in the line-ends,
 ignoring any data.

As corrected later in the thread, MrsEntity writes


I have, in fact, this very afternoon, invented a means of writing a
carriage return character using only 2 bits of information. I am
prepared to sell licenses to this revolutionary technology for the
low price of $29.95 plus tax.

Sorry, that should've been a 500Mb, 2M line file.


If only other unnamed persons on the list were so gracious rather
than turning the flame-dial to 11.

I hope that when people come to the list, *this* is what they see,
laugh, and want to participate.

Although, MrsEntity could be zombie David A. Huffman, whose encoding
scheme actually *can* store 2M lines in 500kb :-)

-tkc



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Dave Angel
On 09/25/2012 12:21 AM, Junkshops wrote:
 Just curious;  which is it, two million lines, or half a million bytes?
snip
 
 Sorry, that should've been a 500Mb, 2M line file.
 
 which machine is 2gb, the Windows machine, or the VM?
 VM. Winders is 4gb.
 
 ...but I would point out that just because
 you free up the memory from the Python doesn't mean it gets released
 back to the system.  The C runtime manages its own heap, and is pretty
 persistent about hanging onto memory once obtained.  It's not normally a
 problem, since most small blocks are reused.  But it can get
 fragmented.  And i have no idea how well Virtual Box maps the Linux
 memory map into the Windows one.
 Right, I understand that - but what's confusing me is that, given the
 memory use is (I assume) monotonically increasing, the code should never
 use more than what's reported by heapy once all the data is loaded into
 memory, given that memory released by the code to the Python runtime is
 reused. To the best of my ability to tell I'm not storing anything I
 shouldn't, so the only thing I can think of is that all the object
 creation and destruction, for some reason, it preventing reuse of
 memory. I'm at a bit of a loss regarding what to try next.

I'm not familiar with heapy, but perhaps it's missing something there.
I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.  I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.

Perhaps one way to save space would be to use a long to store those md5
values.  You'd have to measure it, but I suspect it'd help (at the cost
of lots of extra hexlify-type calls).  Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.


-- 

DaveA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Mark Lawrence

On 25/09/2012 11:51, Tim Chase wrote:
[snip]


If only other unnamed persons on the list were so gracious rather
than turning the flame-dial to 11.



Oh heck what have I said this time?



-tkc


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 25 September 2012 00:58, Junkshops junksh...@gmail.com wrote:

 Hi Tim, thanks for the response.


  - check how you're reading the data:  are you iterating over
the lines a row at a time, or are you using
.read()/.readlines() to pull in the whole file and then
operate on that?

 I'm using enumerate() on an iterable input (which in this case is the
 filehandle).


  - check how you're storing them:  are you holding onto more
than you think you are?

 I've used ipython to look through my data structures (without going into
 ungainly detail, 2 dicts with X numbers of key/value pairs, where X =
 number of lines in the file), and everything seems to be working correctly.
 Like I say, heapy output looks reasonable - I don't see anything surprising
 there. In one dict I'm storing a id string (the first token in each line of
 the file) with values as (again, without going into massive detail) the md5
 of the contents of the line. The second dict has the md5 as the key and an
 object with __slots__ set that stores the line number of the file and the
 type of object that line represents.


Can you give an example of how these data structures look after reading
only the first 5 lines?

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: gracious responses (was: Memory usage per top 10x usage per heapy)

2012-09-25 Thread Tim Chase
On 09/25/12 06:10, Mark Lawrence wrote:
 On 25/09/2012 11:51, Tim Chase wrote:
 If only other unnamed persons on the list were so gracious rather
 than turning the flame-dial to 11.

 
 Oh heck what have I said this time?

You'd *like* to take credit?  ;-)

Nah, not you or any of the regulars here.  The comment was regarding
the flame-fest that's been running in some parallel threads over the
last ~12hr or so.  Mostly instigated by one person with a
particularly quick trigger, vitriolic tongue, and a disregard for
pythonic code.

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: gracious responses (was: Memory usage per top 10x usage per heapy)

2012-09-25 Thread alex23
On Sep 25, 9:39 pm, Tim Chase python.l...@tim.thechases.com wrote:
 Mostly instigated by one person with a
 particularly quick trigger, vitriolic tongue, and a disregard for
 pythonic code.

I'm sorry. I'll get me coat.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Junkshops

I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.
I did some back of the envelope calcs which more or less agreed with 
heapy. The code stores 1 string, which is, on average, about 50 chars or 
so, and one MD5 hex string per line of code. There's about 40 bytes or 
so of overhead per string per sys.getsizeof(). I'm also storing an int 
(24b) and a 10 char string in an object with __slots__ set. Each 
object, per heapy (this is one area where I might be underestimating 
things) takes 64 bytes plus instance variable storage, so per line:


50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB 
plus some memory for the dicts, which is about what heapy is reporting 
(note I'm currently not actually running all 2M lines, I'm just running 
subsets for my tests).


Is there something I'm missing? Here's the heapy output after loading 
~300k lines:


Partition of a set of 1199849 objects. Total size = 89965376 bytes.
Index   Count   %   Size%   Cumulative  %   Kind
0   59  50  3839992043  3839992043  str
1   5   0   2516722428  6356714471  dict
2   28  25  1919987221  8276701692  0xa13330
3   299836  25  7196064 8   89963080100 int
4   4   0   11520   89964232100 
collections.defaultdict

Note that 3 of the dicts are empty. I assume that 0xa13330 is the 
address of the object. I'd actually expect to see 900k strings, but the 
10 char string is always the same in this case so perhaps the runtime 
is using the same object...? At this point, top reports python as using 
1.1g of virt and 1.0g of res.



I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.
That I don't know, but that would only explain, at most, a 2x increase 
in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.



Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.
That's certainly the way the code is written, and heapy seems to confirm 
that the strings aren't duplicated in memory.


Thanks for sticking with me on this,

MrsE

On 9/25/2012 4:06 AM, Dave Angel wrote:

On 09/25/2012 12:21 AM, Junkshops wrote:

Just curious;  which is it, two million lines, or half a million bytes?

snip

Sorry, that should've been a 500Mb, 2M line file.


which machine is 2gb, the Windows machine, or the VM?

VM. Winders is 4gb.


...but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.

Right, I understand that - but what's confusing me is that, given the
memory use is (I assume) monotonically increasing, the code should never
use more than what's reported by heapy once all the data is loaded into
memory, given that memory released by the code to the Python runtime is
reused. To the best of my ability to tell I'm not storing anything I
shouldn't, so the only thing I can think of is that all the object
creation and destruction, for some reason, it preventing reuse of
memory. I'm at a bit of a loss regarding what to try next.

I'm not familiar with heapy, but perhaps it's missing something there.
I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.  I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.

Perhaps one way to save space would be to use a long to store those md5
values.  You'd have to measure it, but I suspect it'd help (at the cost
of lots of extra hexlify-type calls).  Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Junkshops


Can you give an example of how these data structures look after 
reading only the first 5 lines?

Sure, here you go:

In [38]: mpef._ustore._store
Out[38]: defaultdict(type 'dict', {'Measurement': 
{'8991c2dc67a49b909918477ee4efd767': 
micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fe90, 
'7b38b429230f00fe4731e60419e92346': 
micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fad0, 
'b53531471b261c44d52f651add647544': 
micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f4d0, 
'44ea6d949f7c8c8ac3bb4c0bf4943f82': 
micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f910, 
'0de96f928dc471b297f8a305e71ae3e1': 
micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f550}})


In [39]: 
mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].typeStr

Out[39]: 'Measurement'

In [40]: 
mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].lineNumber

Out[40]: 5

In [41]: mpef._ustore._idstore
Out[41]: defaultdict(class 
'micropheno.exchangeformat.KBaseID.IDStore', {'Measurement': 
micropheno.exchangeformat.KBaseID.IDStore object at 0x2f0f950})


In [43]: mpef._ustore._idstore['Measurement']._SIDstore
Out[43]: defaultdict(function lambda at 0x2ece7d0, {'emailRemoved': 
defaultdict(function lambda at 0x2c4caa0, {'microPhenoShew2011': 
defaultdict(type 'dict', {0: {'MLR_124572462': 
'8991c2dc67a49b909918477ee4efd767', 'MLR_124572161': 
'7b38b429230f00fe4731e60419e92346', 'SMMLR_12551352': 
'b53531471b261c44d52f651add647544', 'SMMLR_12551051': 
'0de96f928dc471b297f8a305e71ae3e1', 'SMMLR_12550750': 
'44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})})


-MrsE

On 9/25/2012 4:33 AM, Oscar Benjamin wrote:
On 25 September 2012 00:58, Junkshops junksh...@gmail.com 
mailto:junksh...@gmail.com wrote:


Hi Tim, thanks for the response.


- check how you're reading the data:  are you iterating over
   the lines a row at a time, or are you using
   .read()/.readlines() to pull in the whole file and then
   operate on that?

I'm using enumerate() on an iterable input (which in this case is
the filehandle).


- check how you're storing them:  are you holding onto more
   than you think you are?

I've used ipython to look through my data structures (without
going into ungainly detail, 2 dicts with X numbers of key/value
pairs, where X = number of lines in the file), and everything
seems to be working correctly. Like I say, heapy output looks
reasonable - I don't see anything surprising there. In one dict
I'm storing a id string (the first token in each line of the file)
with values as (again, without going into massive detail) the md5
of the contents of the line. The second dict has the md5 as the
key and an object with __slots__ set that stores the line number
of the file and the type of object that line represents.


Can you give an example of how these data structures look after 
reading only the first 5 lines?


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 25 September 2012 19:08, Junkshops junksh...@gmail.com wrote:


  Can you give an example of how these data structures look after reading
 only the first 5 lines?

 Sure, here you go:

 In [38]: mpef._ustore._store
 Out[38]: defaultdict(type 'dict', {'Measurement':
 {'8991c2dc67a49b909918477ee4efd767':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fe90,
 '7b38b429230f00fe4731e60419e92346':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fad0,
 'b53531471b261c44d52f651add647544':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f4d0,
 '44ea6d949f7c8c8ac3bb4c0bf4943f82':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f910,
 '0de96f928dc471b297f8a305e71ae3e1':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f550}})


Have these exceptions been raised from somewhere before being stored? I
wonder if you're inadvertently keeping execution frames alive. There are
some problems in CPython with this that are related to storing exceptions.



 In [39]:
 mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].typeStr
 Out[39]: 'Measurement'


Seeing how long these hex strings are, I'm confident that you would save a
significant amount of memory by converting them to int.



 In [40]:
 mpef._ustore._store['Measurement']['b53531471b261c44d52f651add647544'].lineNumber
 Out[40]: 5

 In [41]: mpef._ustore._idstore
 Out[41]: defaultdict(class 'micropheno.exchangeformat.KBaseID.IDStore',
 {'Measurement': micropheno.exchangeformat.KBaseID.IDStore object at
 0x2f0f950})

 In [43]: mpef._ustore._idstore['Measurement']._SIDstore
 Out[43]: defaultdict(function lambda at 0x2ece7d0, {'emailRemoved':
 defaultdict(function lambda at 0x2c4caa0, {'microPhenoShew2011':
 defaultdict(type 'dict', {0: {'MLR_124572462':
 '8991c2dc67a49b909918477ee4efd767', 'MLR_124572161':
 '7b38b429230f00fe4731e60419e92346', 'SMMLR_12551352':
 'b53531471b261c44d52f651add647544', 'SMMLR_12551051':
 '0de96f928dc471b297f8a305e71ae3e1', 'SMMLR_12550750':
 '44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})})


Also I think lambda functions might be able to keep the frame alive. Are
they by any chance being created in a function that is called in a loop?

 def f():
... x = 4
... return lambda : x
...
 g = f()
 g()  # Accesses local variable from kept-alive frame
4
 x
Traceback (most recent call last):
  File stdin, line 1, in module
NameError: name 'x' is not defined

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Dave Angel
On 09/25/2012 01:39 PM, Junkshops wrote:

Procedural point:  I know you're trying to conform to the standard that
this mailing list uses, but you're off a little, and it's distracting.
It's also probably more work for you, and certainly for us.

You need an attribution in front of the quoted portions.  This next
section is by me, but you don't say so.  That's because you copy/pasted
it from elsewhere in the reply, and didn't copy the ... Dave Angel
wrote part.

Much easier is to take the reply, and remove the parts you're not going
to respond to, putting your own comments in between the parts that are
left (as you're doing).  And generally, there's no need for anything
after your last remark, so you just delete up to your signature, if any.


 I'm a bit surprised you aren't beyond the 2gb limit, just with the
 structures you describe for the file.  You do realize that each object
 has quite a few bytes of overhead, so it's not surprising to use several
 times the size of a file, to store the file in an organized way.
 I did some back of the envelope calcs which more or less agreed with
 heapy. The code stores 1 string, which is, on average, about 50 chars or
 so, and one MD5 hex string per line of code. There's about 40 bytes or
 so of overhead per string per sys.getsizeof(). I'm also storing an int
 (24b) and a 10 char string in an object with __slots__ set. Each
 object, per heapy (this is one area where I might be underestimating
 things) takes 64 bytes plus instance variable storage, so per line:
 
 50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB
 plus some memory for the dicts, which is about what heapy is reporting
 (note I'm currently not actually running all 2M lines, I'm just running
 subsets for my tests).
 
 Is there something I'm missing? Here's the heapy output after loading
 ~300k lines:
 
 Partition of a set of 1199849 objects. Total size = 89965376 bytes.
 Index Count % Size % Cumulative % Kind
 0 59 50 38399920 43 38399920 43 str
 1 5 0 25167224 28 63567144 71 dict
 2 28 25 19199872 21 82767016 92 0xa13330
 3 299836 25 7196064 8 89963080 100 int
 4 4 0 1152 0 89964232 100
 collections.defaultdict
 
 Note that 3 of the dicts are empty. I assumet 0xa13330 is the
 address of the object. I'd actually expect to see 900k strings, but the
 10 char string is always the same in this case so perhaps the runtime
 is using the same object...? 

CPython currently interns short strings that conform to variable name
rules.  You can't count on that behavior (and i probably don't have it
quite right anyway), but it's probably what you're seeing.


 At this point, top reports python as using
 1.1g of virt and 1.0g of res.
 
 I also
 wonder if heapy has been written to take into account the larger size of
 pointers in a 64bit build.
 That I don't know, but that would only explain, at most, a 2x increase
 in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.
 
 Another thing is to make sure
 that the md5 object used in your two maps is the same object, and not
 just one with the same value.
 That's certainly the way the code is written, and heapy seems to confirm
 that the strings aren't duplicated in memory.
 
 Thanks for sticking with me on this,

You're certainly welcome.  I suspect that heapy has some limitation in
its reporting, and that's what the discrepancy.  Oscar points out that
you have a bunch of exception objects, which certainly looks suspicious.
 If you're somehow storing one of these per line, and heapy isn't
reporting them, that could be a large discrepancy.

He also points out that you have a couple of lambda functions stored in
one of your dictionary.  A lambda function can be an expensive
proposition if you are building millions of them.  So can nested
functions with non-local variable references, in case you have any of those.

Oscar also reminds you of what I suggested for the md5 fields.  Stored
as ints instead of hex strings could save a good bit.  Just remember to
use the same one for both dicts, as you've been doing with the strings.


Other than that, I'm stumped.


-- 

DaveA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Junkshops

On 9/25/2012 11:17 AM, Oscar Benjamin wrote:
On 25 September 2012 19:08, Junkshops junksh...@gmail.com 
mailto:junksh...@gmail.com wrote:



In [38]: mpef._ustore._store
Out[38]: defaultdict(type 'dict', {'Measurement':
{'8991c2dc67a49b909918477ee4efd767':
micropheno.exchangeformat.Exceptions.FileContext object at
0x2f0fe90, '7b38b429230f00fe4731e60419e92346':
micropheno.exchangeformat.Exceptions.FileContext object at
0x2f0fad0, 'b53531471b261c44d52f651add647544':
micropheno.exchangeformat.Exceptions.FileContext object at
0x2f0f4d0, '44ea6d949f7c8c8ac3bb4c0bf4943f82':
micropheno.exchangeformat.Exceptions.FileContext object at
0x2f0f910, '0de96f928dc471b297f8a305e71ae3e1':
micropheno.exchangeformat.Exceptions.FileContext object at
0x2f0f550}})


Have these exceptions been raised from somewhere before being stored? 
I wonder if you're inadvertently keeping execution frames alive. There 
are some problems in CPython with this that are related to storing 
exceptions.
FileContext objects aren't exceptions. They store information about 
where the stored object originally came from, so if there's an MD5 or ID 
clash with a later line in the file the code can report both the current 
line and the older clashing line to the user. I have an Exception 
subclass that takes a FileContext as an argument. There are no 
exceptions thrown in the file I processed to get the heapy results 
earlier in the thread.



In [43]: mpef._ustore._idstore['Measurement']._SIDstore
Out[43]: defaultdict(function lambda at 0x2ece7d0, 
{'emailRemoved': defaultdict(function lambda at 0x2c4caa0, 
{'microPhenoShew2011': defaultdict(type 'dict', {0: 
{'MLR_124572462': '8991c2dc67a49b909918477ee4efd767', 
'MLR_124572161': '7b38b429230f00fe4731e60419e92346', 
'SMMLR_12551352': 'b53531471b261c44d52f651add647544', 
'SMMLR_12551051': '0de96f928dc471b297f8a305e71ae3e1', 
'SMMLR_12550750': '44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})})
Also I think lambda functions might be able to keep the frame alive. 
Are they by any chance being created in a function that is called in a 
loop?



Here's the context for the lambdas:

  def __init__(self):
self._SIDstore = defaultdict(lambda: defaultdict(lambda: 
defaultdict(dict)))


So the lambda is only being called when a new key is added to the top 3 
levels of the datastructure, which in the test case I've been 
discussing, only happens once each.


Although the suggestion to change the hex strings to ints is a good one 
and I'll do it, what I'm really trying to understand is why there's such 
a large difference between the memory use per top (and the fact that the 
code appears to thrash swap) and per heapy and my calculations of how 
much memory the code should be using.


Cheers, MrsEntity
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Junkshops

On 9/25/2012 11:50 AM, Dave Angel wrote:
I suspect that heapy has some limitation in its reporting, and that's 
what the discrepancy.


That would be my first suspicion as well - except that heapy's results 
agree so well with what I expect, and I can't think of any reason I'd be 
using 10x more memory. If heapy is wrong, then I need to try and figure 
out what's using up all that memory some other way... but I don't know 
what that way might be.


... can be an expensive proposition if you are building millions of 
them. So can nested functions with non-local variable references, in 
case you have any of those. 


Not as far as I know.

Cheers, MrsEntity
--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 25 September 2012 21:26, Junkshops junksh...@gmail.com wrote:

  On 9/25/2012 11:17 AM, Oscar Benjamin wrote:

 On 25 September 2012 19:08, Junkshops junksh...@gmail.com wrote:


 In [38]: mpef._ustore._store
 Out[38]: defaultdict(type 'dict', {'Measurement':
 {'8991c2dc67a49b909918477ee4efd767':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fe90,
 '7b38b429230f00fe4731e60419e92346':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0fad0,
 'b53531471b261c44d52f651add647544':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f4d0,
 '44ea6d949f7c8c8ac3bb4c0bf4943f82':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f910,
 '0de96f928dc471b297f8a305e71ae3e1':
 micropheno.exchangeformat.Exceptions.FileContext object at 0x2f0f550}})


  Have these exceptions been raised from somewhere before being stored? I
 wonder if you're inadvertently keeping execution frames alive. There are
 some problems in CPython with this that are related to storing exceptions.

 FileContext objects aren't exceptions. They store information about where
 the stored object originally came from, so if there's an MD5 or ID clash
 with a later line in the file the code can report both the current line and
 the older clashing line to the user. I have an Exception subclass that
 takes a FileContext as an argument. There are no exceptions thrown in the
 file I processed to get the heapy results earlier in the thread.


I don't know whether it would be better or worse but it might be worth
seeing what happens if you replace the FileContext objects with tuples.




  In [43]: mpef._ustore._idstore['Measurement']._SIDstore
 Out[43]: defaultdict(function lambda at 0x2ece7d0, {'emailRemoved':
 defaultdict(function lambda at 0x2c4caa0, {'microPhenoShew2011':
 defaultdict(type 'dict', {0: {'MLR_124572462':
 '8991c2dc67a49b909918477ee4efd767', 'MLR_124572161':
 '7b38b429230f00fe4731e60419e92346', 'SMMLR_12551352':
 'b53531471b261c44d52f651add647544', 'SMMLR_12551051':
 '0de96f928dc471b297f8a305e71ae3e1', 'SMMLR_12550750':
 '44ea6d949f7c8c8ac3bb4c0bf4943f82'}})})})

 Also I think lambda functions might be able to keep the frame alive. Are
 they by any chance being created in a function that is called in a loop?

   Here's the context for the lambdas:

   def __init__(self):
 self._SIDstore = defaultdict(lambda: defaultdict(lambda:
 defaultdict(dict)))

 So the lambda is only being called when a new key is added to the top 3
 levels of the datastructure, which in the test case I've been discussing,
 only happens once each.


I can't see anything wrong with that but then I'm not sure if the lambda
function always keeps its frame alive. If there's only that one line in the
__init__ function then I'd expect it to be fine.



 Although the suggestion to change the hex strings to ints is a good one
 and I'll do it, what I'm really trying to understand is why there's such a
 large difference between the memory use per top (and the fact that the code
 appears to thrash swap) and per heapy and my calculations of how much
 memory the code should be using.


Perhaps you could see what objgraph comes up with:
http://pypi.python.org/pypi/objgraph

So far as I know objgraph doesn't tell you how big objects are but it does
give a nice graphical representation of which objects are alive and which
other objects they are referenced by. You might find that some other object
is kept alive that you didn't expect.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Junkshops

On 9/25/2012 2:17 PM, Oscar Benjamin wrote:
I don't know whether it would be better or worse but it might be worth 
seeing what happens if you replace the FileContext objects with tuples.
I originally used a string, and it was slightly better since you don't 
have the object overhead, but I wanted to code to an interface for the 
context information so started a Context abstract class that FileContext 
inherits from (both have __slots__ set). Using an object without 
__slots__ set was a disaster. However, the difference between a string 
and an object with __slots__ isn't severe.




I can't see anything wrong with that but then I'm not sure if the 
lambda function always keeps its frame alive. If there's only that one 
line in the __init__ function then I'd expect it to be fine.


That's it, I'm afraid.



Perhaps you could see what objgraph comes up with:
http://pypi.python.org/pypi/objgraph

So far as I know objgraph doesn't tell you how big objects are but it 
does give a nice graphical representation of which objects are alive 
and which other objects they are referenced by. You might find that 
some other object is kept alive that you didn't expect.



I'll give it a shot and see what happens.

Cheers, MrsEntity

--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Tim Chase
On 09/25/12 16:17, Oscar Benjamin wrote:
 I don't know whether it would be better or worse but it might be
 worth seeing what happens if you replace the FileContext objects
 with tuples.

If tuples provide a savings but you find them opaque, you might also
consider named-tuples for clarity.

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Ian Kelly
On Tue, Sep 25, 2012 at 12:17 PM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 Also I think lambda functions might be able to keep the frame alive. Are
 they by any chance being created in a function that is called in a loop?

I'm pretty sure they don't.  Closures don't keep a reference to the
calling frame, only to the appropriate cellvars.

Also note that whether a function is a closure has nothing to do with
whether it was defined by a lambda or a def statement.  In fact,
there's no difference between functions created by one vs. the other,
except that one has an interesting __name__ and the other does not.
:-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 25 September 2012 23:09, Ian Kelly ian.g.ke...@gmail.com wrote:

 On Tue, Sep 25, 2012 at 12:17 PM, Oscar Benjamin
 oscar.j.benja...@gmail.com wrote:
  Also I think lambda functions might be able to keep the frame alive. Are
  they by any chance being created in a function that is called in a loop?

 I'm pretty sure they don't.  Closures don't keep a reference to the
 calling frame, only to the appropriate cellvars.


OK, that's good to know.



 Also note that whether a function is a closure has nothing to do with
 whether it was defined by a lambda or a def statement.  In fact,
 there's no difference between functions created by one vs. the other,
 except that one has an interesting __name__ and the other does not.
 :-)


That's true but in my experience most lambda functions are defined inside
another function, whereas most ordinary functions are not. Also when
creating a closure with an ordinary function it's very clear what you are
doing (which is why I don't use lambda functions for this) so I think it's
a little easier to accidentally create a closure with a lambda function.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 25 September 2012 23:10, Tim Chase python.l...@tim.thechases.com wrote:

 On 09/25/12 16:17, Oscar Benjamin wrote:
  I don't know whether it would be better or worse but it might be
  worth seeing what happens if you replace the FileContext objects
  with tuples.

 If tuples provide a savings but you find them opaque, you might also
 consider named-tuples for clarity.


Do they have the same memory usage?

Since tuples don't have a per-instance __dict__, I'd expect them to be a
lot lighter. I'm not sure if I'm interpreting the results below properly
but they seem to suggest that a namedtuple can have a memory consumption
several times larger than an ordinary tuple.

 import sys
 import collections
 A = collections.namedtuple('A', ['x', 'y'])
 sys.getsizeof(a)
72
 sys.getsizeof(A(1, 2))
72
 sys.getsizeof((1, 2))
72
 sys.getsizeof(A(1, 2).__dict__)
280
 A(1, 2).__dict__
OrderedDict([('x', 1), ('y', 2)])
 sys.getsizeof((1, 2).__dict__)
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'tuple' object has no attribute '__dict__'
 A(1, 2).__dict__ is A(3, 4).__dict__
False

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Tim Chase
On 09/25/12 17:55, Oscar Benjamin wrote:
 On 25 September 2012 23:10, Tim Chase python.l...@tim.thechases.com wrote:
 If tuples provide a savings but you find them opaque, you might also
 consider named-tuples for clarity.
 
 Do they have the same memory usage?
 
 Since tuples don't have a per-instance __dict__, I'd expect them to be a
 lot lighter. I'm not sure if I'm interpreting the results below properly
 but they seem to suggest that a namedtuple can have a memory consumption
 several times larger than an ordinary tuple.

I think the how much memory is $METHOD using topic of the thread
is the root of the problem.  From my testing of your question:

 import collections, sys
 A = collections.namedtuple('A', ['x', 'y'])
 nt = A(1,3)
 t = (1,3)
 sys.getsizeof(nt)
72
 sys.getsizeof(t)
72
 nt_s = set(dir(nt))
 t_s = set(dir(t))
 t_s ^ nt_s
set(['__module__', '_make', '_asdict', '_replace', '_fields',
'__slots__', 'y', 'x'])
 t_s - nt_s
set([])

So a named-tuple has 6+n (where n is the number of fields) extra
attributes, but it seems that namedtuples  tuples seem to occupy
the same amount of space (72).

Additionally, pulling up a second console and issuing

  ps v | grep [p]ython

shows the memory usage of the process as I perform these, and after
them, and they both show the same usage (actual test was

1) pull up a fresh python
2) import sys, collections; A = collections.namedtuple('A',['x','y'])
3) check memory usage in other window
4a) x = (1,2)
4b) x = A(1,2)
5) check memory usage again in other window
6) quit python

performing 4a on one run, and 4b on the second run.

Both showed identical memory usage as well (Debian Linux (Stable),
stock Python 2.6.6) at the system level.

I don't know if that little testing is actually worth anything, but
at least it's another data-point as we muddle towards helping
MrsEntity/junkshops.

-tkc



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-25 Thread Oscar Benjamin
On 26 September 2012 00:35, Tim Chase python.l...@tim.thechases.com wrote:

 On 09/25/12 17:55, Oscar Benjamin wrote:
  On 25 September 2012 23:10, Tim Chase python.l...@tim.thechases.com
 wrote:
  If tuples provide a savings but you find them opaque, you might also
  consider named-tuples for clarity.
 
  Do they have the same memory usage?
 
  Since tuples don't have a per-instance __dict__, I'd expect them to be a
  lot lighter. I'm not sure if I'm interpreting the results below properly
  but they seem to suggest that a namedtuple can have a memory consumption
  several times larger than an ordinary tuple.

 I think the how much memory is $METHOD using topic of the thread
 is the root of the problem.  From my testing of your question:

  import collections, sys
  A = collections.namedtuple('A', ['x', 'y'])
  nt = A(1,3)
  t = (1,3)
  sys.getsizeof(nt)
 72
  sys.getsizeof(t)
 72
  nt_s = set(dir(nt))
  t_s = set(dir(t))
  t_s ^ nt_s
 set(['__module__', '_make', '_asdict', '_replace', '_fields',
 '__slots__', 'y', 'x'])
  t_s - nt_s
 set([])


On my system these is an additional __dict__ attribute and it is bigger
than the original tuple:
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import collections, sys
 A = collections.namedtuple('A', ['x', 'y'])
 nt = A(1,3)
 t = (1,3)
 set(dir(nt)) - set(dir(t))
set(['__module__', '_replace', '_make', 'y', '__slots__', '_asdict',
'__dict__', 'x', '_fields'])
 sys.getsizeof(nt.__dict__)
280
 sys.getsizeof(t.__dict__)
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'tuple' object has no attribute '__dict__'



 So a named-tuple has 6+n (where n is the number of fields) extra
 attributes, but it seems that namedtuples  tuples seem to occupy
 the same amount of space (72).

 Additionally, pulling up a second console and issuing

   ps v | grep [p]ython

 shows the memory usage of the process as I perform these, and after
 them, and they both show the same usage (actual test was

 1) pull up a fresh python
 2) import sys, collections; A = collections.namedtuple('A',['x','y'])
 3) check memory usage in other window
 4a) x = (1,2)
 4b) x = A(1,2)
 5) check memory usage again in other window
 6) quit python

 performing 4a on one run, and 4b on the second run.

 Both showed identical memory usage as well (Debian Linux (Stable),
 stock Python 2.6.6) at the system level.


Python uses memory pools for small memory allocations. I don't think it's
possible to tell from the outside how much memory is being used at such a
fine level.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Memory usage per top 10x usage per heapy

2012-09-24 Thread MrsEntity
Hi all,

I'm working on some code that parses a 500kb, 2M line file line by line and 
saves, per line, some derived strings into various data structures. I thus 
expect that memory use should monotonically increase. Currently, the program is 
taking up so much memory - even on 1/2 sized files - that on 2GB machine I'm 
thrashing swap. What's strange is that heapy (http://guppy-pe.sourceforge.net/) 
is showing that the code uses about 10x less memory than reported by top, and 
the heapy data seems consistent with what I was expecting based on the objects 
the code stores. I tried using memory_profiler 
(http://pypi.python.org/pypi/memory_profiler) but it didn't really provide any 
illuminating information. The code does create and discard a number of objects 
per line of the file, but they should not be stored anywhere, and heapy seems 
to confirm that. So, my questions are:

1) For those of you kind enough to help me figure out what's going on, what 
additional data would you like? I didn't want swamp everyone with the code and 
heapy/memory_profiler output but I can do so if it's valuable.
2) How can I diagnose (and hopefully fix) what's causing the massive memory 
usage when it appears, from heapy, that the code is performing reasonably?

Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64

Thanks very much.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-24 Thread Tim Chase
On 09/24/12 16:59, MrsEntity wrote:
 I'm working on some code that parses a 500kb, 2M line file line
 by line and saves, per line, some derived strings into various
 data structures. I thus expect that memory use should
 monotonically increase. Currently, the program is taking up so
 much memory - even on 1/2 sized files - that on 2GB machine I'm
 thrashing swap.

It might help to know what comprises the into various data
structures.  I do a lot of ETL work on far larger files,
with similar machine specs, and rarely touch swap.

 2) How can I diagnose (and hopefully fix) what's causing the
 massive memory usage when it appears, from heapy, that the code
 is performing reasonably?

I seem to recall that Python holds on to memory that the VM
releases, but that it *should* reuse it later.  So you'd get
the symptom of the memory-usage always increasing, never
decreasing.

Things that occur to me:

- check how you're reading the data:  are you iterating over
  the lines a row at a time, or are you using
  .read()/.readlines() to pull in the whole file and then
  operate on that?

- check how you're storing them:  are you holding onto more
  than you think you are?  Would it hurt to switch from a
  dict to store your data (I'm assuming here) to using the
  anydbm module to temporarily persist the large quantity of
  data out to disk in order to keep memory usage lower?

Without actual code, it's hard to do a more detailed
analysis.

-tkc
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-24 Thread Junkshops

Hi Tim, thanks for the response.


- check how you're reading the data:  are you iterating over
   the lines a row at a time, or are you using
   .read()/.readlines() to pull in the whole file and then
   operate on that?
I'm using enumerate() on an iterable input (which in this case is the 
filehandle).



- check how you're storing them:  are you holding onto more
   than you think you are?
I've used ipython to look through my data structures (without going into 
ungainly detail, 2 dicts with X numbers of key/value pairs, where X = 
number of lines in the file), and everything seems to be working 
correctly. Like I say, heapy output looks reasonable - I don't see 
anything surprising there. In one dict I'm storing a id string (the 
first token in each line of the file) with values as (again, without 
going into massive detail) the md5 of the contents of the line. The 
second dict has the md5 as the key and an object with __slots__ set that 
stores the line number of the file and the type of object that line 
represents.



Would it hurt to switch from a
   dict to store your data (I'm assuming here) to using the
   anydbm module to temporarily persist the large quantity of
   data out to disk in order to keep memory usage lower?
That's the thing though - according to heapy, the memory usage *is* low 
and is more or less what I expect. What I don't understand is why top is 
reporting such vastly different memory usage. If a memory profiler is 
saying everything's ok, it makes it very difficult to figure out what's 
causing the problem. Based on heapy, a db based solution would be 
serious overkill.


-MrsE

On 9/24/2012 4:22 PM, Tim Chase wrote:

On 09/24/12 16:59, MrsEntity wrote:

I'm working on some code that parses a 500kb, 2M line file line
by line and saves, per line, some derived strings into various
data structures. I thus expect that memory use should
monotonically increase. Currently, the program is taking up so
much memory - even on 1/2 sized files - that on 2GB machine I'm
thrashing swap.

It might help to know what comprises the into various data
structures.  I do a lot of ETL work on far larger files,
with similar machine specs, and rarely touch swap.


2) How can I diagnose (and hopefully fix) what's causing the
massive memory usage when it appears, from heapy, that the code
is performing reasonably?

I seem to recall that Python holds on to memory that the VM
releases, but that it *should* reuse it later.  So you'd get
the symptom of the memory-usage always increasing, never
decreasing.

Things that occur to me:

- check how you're reading the data:  are you iterating over
   the lines a row at a time, or are you using
   .read()/.readlines() to pull in the whole file and then
   operate on that?

- check how you're storing them:  are you holding onto more
   than you think you are?  Would it hurt to switch from a
   dict to store your data (I'm assuming here) to using the
   anydbm module to temporarily persist the large quantity of
   data out to disk in order to keep memory usage lower?

Without actual code, it's hard to do a more detailed
analysis.

-tkc


--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-24 Thread Dave Angel
On 09/24/2012 05:59 PM, MrsEntity wrote:
 Hi all,

 I'm working on some code that parses a 500kb, 2M line file 

Just curious;  which is it, two million lines, or half a million bytes?

 line by line and saves, per line, some derived strings into various data 
 structures. I thus expect that memory use should monotonically increase. 
 Currently, the program is taking up so much memory - even on 1/2 sized files 
 - that on 2GB machine 

which machine is 2gb, the Windows machine, or the VM?  You could get
thrashing at either level.

 I'm thrashing swap. What's strange is that heapy 
 (http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x 
 less memory than reported by top, and the heapy data seems consistent with 
 what I was expecting based on the objects the code stores. I tried using 
 memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't 
 really provide any illuminating information. The code does create and discard 
 a number of objects per line of the file, but they should not be stored 
 anywhere, and heapy seems to confirm that. So, my questions are:

 1) For those of you kind enough to help me figure out what's going on, what 
 additional data would you like? I didn't want swamp everyone with the code 
 and heapy/memory_profiler output but I can do so if it's valuable.
 2) How can I diagnose (and hopefully fix) what's causing the massive memory 
 usage when it appears, from heapy, that the code is performing reasonably?

 Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64

 Thanks very much.

Tim raised most of my concerns, but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory usage per top 10x usage per heapy

2012-09-24 Thread Junkshops

Just curious;  which is it, two million lines, or half a million bytes?
I have, in fact, this very afternoon, invented a means of writing a 
carriage return character using only 2 bits of information. I am 
prepared to sell licenses to this revolutionary technology for the low 
price of $29.95 plus tax.


Sorry, that should've been a 500Mb, 2M line file.


which machine is 2gb, the Windows machine, or the VM?

VM. Winders is 4gb.


...but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.
Right, I understand that - but what's confusing me is that, given the 
memory use is (I assume) monotonically increasing, the code should never 
use more than what's reported by heapy once all the data is loaded into 
memory, given that memory released by the code to the Python runtime is 
reused. To the best of my ability to tell I'm not storing anything I 
shouldn't, so the only thing I can think of is that all the object 
creation and destruction, for some reason, it preventing reuse of 
memory. I'm at a bit of a loss regarding what to try next.


Cheers, MrsE

On 9/24/2012 6:14 PM, Dave Angel wrote:

On 09/24/2012 05:59 PM, MrsEntity wrote:

Hi all,

I'm working on some code that parses a 500kb, 2M line file

Just curious;  which is it, two million lines, or half a million bytes?


line by line and saves, per line, some derived strings into various data 
structures. I thus expect that memory use should monotonically increase. 
Currently, the program is taking up so much memory - even on 1/2 sized files - 
that on 2GB machine

which machine is 2gb, the Windows machine, or the VM?  You could get
thrashing at either level.


I'm thrashing swap. What's strange is that heapy 
(http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x less 
memory than reported by top, and the heapy data seems consistent with what I 
was expecting based on the objects the code stores. I tried using 
memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't 
really provide any illuminating information. The code does create and discard a 
number of objects per line of the file, but they should not be stored anywhere, 
and heapy seems to confirm that. So, my questions are:

1) For those of you kind enough to help me figure out what's going on, what 
additional data would you like? I didn't want swamp everyone with the code and 
heapy/memory_profiler output but I can do so if it's valuable.
2) How can I diagnose (and hopefully fix) what's causing the massive memory 
usage when it appears, from heapy, that the code is performing reasonably?

Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64

Thanks very much.

Tim raised most of my concerns, but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.




--
http://mail.python.org/mailman/listinfo/python-list