Re: Writing huge Sets() to disk

Steve Holden Mon, 17 Jan 2005 04:30:06 -0800

Martin MOKREJŠ wrote:

Hi,
 could someone tell me what all does and what all doesn't copy
references in python. I have found my script after reaching some
state and taking say 600MB, pushes it's internal dictionaries
to hard disk. The for loop consumes another 300MB (as gathered
by vmstat) to push the data to dictionaries, then releases
little bit less than 300MB and the program start to fill-up
again it's internal dictionaries, when "full" will do the
flush again ...

 The point here is, that this code takes a lot of extra memory.
I believe it's the references problem, and I remeber complains
of frineds facing same problem. I'm a newbie, yes, but don't
have this problem with Perl. OK, I want to improve my Pyhton
knowledge ... :-))

Right ho! In fact I suspect you are still quite new to programming as a whole, for reasons that may become clear as we proceed.

def push_to_disk(self): _dict_on_disk_tuple = (None, self._dict_on_disk1, self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4, self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7, self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10, self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13, self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16, self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19, self._dict_on_disk20)

It's a bit unfortunate that all those instance variables are global to the method, as it means we can't clearly see what you intend them to do. However ...

Whenever I see such code, it makes me suspect that the approach to the problem could be more subtle. It appears you have decided to partition your data into twenty chunks somehow. The algorithm is clearly not coded in a way that would make it easy to modify the number of chunks.

[Hint: by "easy" I mean modifying a statement that reads

    chunks = 20

to read

    chunks = 40

for example]. To avoid this, we might use (say) a list of temp edicts, for example (the length of this could easily then be parameterized as mentioned. So where (my psychic powers tell me) your __init__() method currently contains

    self._dict_on_disk1 = something()
    self._dict_on_disk2 = something()
        ...
    self._dict_on_disk20 = something()

I would have written

    self._disk_dicts = []
    for i in range(20):
        self._disk_dicts.append(something)

Than again, I probably have an advantage over you. I'm such a crappy typist I can guarantee I'd make at least six mistakes doing it your way :-)

_size = 0

What with all these leading underscores I presume it must be VERY important to keep these object's instance variables private. Do you have a particular reason for that, or just general Perl-induced paranoia? :-)

# # sizes of these tmpdicts range from 10-10000 entries for each! for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _size += 1 if _tmpdict: _dict_on_disk = _dict_on_disk_tuple[_size] for _word, _value in _tmpdict.iteritems(): try: _string = _dict_on_disk[_word] # I discard _a and _b, maybe _string.find(' ') combined with slice would do better? _abs_count, _a, _b, _expected_freq = _string.split() _abs_count = int(_abs_count).__add__(_value) _t = (str(_abs_count), '0', '0', '0') except KeyError: _t = (str(_value), '0', '0', '0')
                   # this writes a copy to the dict, right?
                   _dict_on_disk[_word] = ' '.join(_t)
# # clear the temporary dictionaries in ourself # I think this works as expected and really does release memory # for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3, self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7, self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11, self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15, self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19, self._tmpdict20): _tmpdict.clear()

There you go again with that huge tuple. You just like typing, don't you? You already wrote that one out just above. Couldn't you have assigned it to a local variable?

By the way, remind me again of the reason for the leading None in the _dict_on_disk_tuple, would you?

The crucial misunderstanding here might be the meaning of "release memory". While clearing the dictionary will indeed remove references to the objects formerly contained therein, and thus (possibly) render those items subject to garbage collection, that *won't* make the working set (i.e. virtual memory pages allocated to your process's data storage) any smaller. The garbage collector doesn't return memory to the operating system, it merely aggregates it for use in storing new Python objects.


  The above routine doesn't release of the memory back when it
exits.

And your evidence for this assertion is ...?


  See, the loop takes 25 minutes already, and it's prolonging
as the program is in about 1/3 or 1/4 of the total input.
The rest of my code is fast in contrast to this (below 1 minute).

-rw-------  1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
-rw-------  1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
-rw-------  1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
-rw-------  1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
-rw-------  1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
-rw-------  1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
-rw-------  1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
-rw-------  1 mmokrejs users  66584576 Jan 17 10:59 diskdict5.db
-rw-------  1 mmokrejs users   5750784 Jan 17 10:57 diskdict4.db
-rw-------  1 mmokrejs users    311296 Jan 17 10:57 diskdict3.db
-rw-------  1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
-rw-------  1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
-rw-------  1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
-rw-------  1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
-rw-------  1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
-rw-------  1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
-rw-------  1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
-rw-------  1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db

   Some spoke about mmaped files. Could I take advantage of that
with bsddb module or bsddb?

No.

   Is gdbm better in some ways? Recently you have said dictionary
operations are fast ... Once more. I want to turn of locking support.
I can make the values as strings of fixed size, if mmap() would be
available. The number of keys doesn't grow much in time, mostly
there are only updates.

Also (possibly because I come late to this thread) I don't really understand your caching strategy. I presume at some stage you look in one of the twenty temp dicts, and if you don;t find something you read it back in form disk?

This whole thing seems a little disorganized. Perhaps if you started with a small dataset your testing and development work would proceed more quickly, and you'd be less intimidated by the clear need to refactor your code.

regards
 Steve
--
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

Reply via email to