i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.

i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this?

While others have suggested databases, they may be a bit overkill, depending on your needs. Python2.5+ supplies not only the sqlite3 module, but older versions (at least back to 2.0) offer the anydbm module (changed to "dbm" in 3.0), allowing you to create an on-disk string-to-string dictionary:

  import anydbm
  db = anydbm.open("data.db", "c")

  # populate some data
  # using "db" as your dictionary
  import csv
  f = file("800megs.txt")
  data = csv.reader(f, delimiter='\t')
  data.next()  # discard a header row
  for key, value in data:
    db[key] = value
  f.close()
  print db["some key"]

  db.close()

The resulting DB object is a little sparsely documented, but for the most part it can be treated like a dictionary. The advantage is that, if the source data doesn't change, you can parse once and then just use your "data.db" file from there out.

-tkc







--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to