i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys. ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.
i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this?
While others have suggested databases, they may be a bit
overkill, depending on your needs. Python2.5+ supplies not only
the sqlite3 module, but older versions (at least back to 2.0)
offer the anydbm module (changed to "dbm" in 3.0), allowing you
to create an on-disk string-to-string dictionary:
import anydbm
db = anydbm.open("data.db", "c")
# populate some data
# using "db" as your dictionary
import csv
f = file("800megs.txt")
data = csv.reader(f, delimiter='\t')
data.next() # discard a header row
for key, value in data:
db[key] = value
f.close()
print db["some key"]
db.close()
The resulting DB object is a little sparsely documented, but for
the most part it can be treated like a dictionary. The advantage
is that, if the source data doesn't change, you can parse once
and then just use your "data.db" file from there out.
-tkc
--
http://mail.python.org/mailman/listinfo/python-list