On 10/16/2012 01:03 PM, Prasad, Ramit wrote:
Abhishek Pratap wrote:
Sent: Tuesday, October 16, 2012 11:57 AM
To: tutor@python.org
Subject: [Tutor] managing memory large dictionaries in python

Hi Guys

For my problem I need to store 400-800 million 20 characters keys in a
dictionary and do counting. This data structure takes about 60-100 Gb
of RAM.
I am wondering if there are slick ways to map the dictionary to a file
on disk and not store it in memory but still access it as dictionary
object. Speed is not the main concern in this problem and persistence
is not needed as the counting will only be done once on the data. We
want the script to run on smaller memory machines if possible.

I did think about databases for this but intuitively it looks like a
overkill coz for each key you have to first check whether it is
already present and increase the count by 1  and if not then insert
the key into dbase.

Just want to take your opinion on this.

Thanks!
-Abhi

I do not think that a database would be overkill for this type of task.

Agreed.

Your process may be trivial but the amount of data it has manage is not 
trivial. You can use a simple database like SQLite. Otherwise, you
could create a file for each key and update the count in there. It will
run on a small amount of memory but will be slower than using a db.

Well, maybe -- depends on how many unique entries exist. Most vanilla systems are going to crash (or give the appearance thereof) if you end up with millions of file entries in a directory. If a filesystem based answer is sought, I'd consider generating 16-bit CRCs per key and appending the keys to the CRC named file, then pass those, sort and do the final counting.

Emile

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to