On Jan 28, 6:08 pm, Aaron Brady <castiro...@gmail.com> wrote: > On Jan 28, 4:43 pm, perfr...@gmail.com wrote: > > > On Jan 28, 5:14 pm, John Machin <sjmac...@lexicon.net> wrote: > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > > hello all, > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > value which is a list containing about 1 to 5 million (small) > > > > dictionaries. for example, > > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > > 'world'}, ...], > > > > key2: [...]} > > > > > in total there are about 10 to 15 million lists if we concatenate > > > > together all the values of every key in 'mydict'. mydict is a > > > > structure that represents data in a very large file (about 800 > > > > megabytes). > > snip > > > in reply to the other poster: i thought 'shelve' simply calls pickle. > > if thats the case, it wouldnt be any faster, right ? > > Yes, but not all at once. It's a clear winner if you need to update > any of them later, but if it's just write-once, read-many, it's about > the same. > > You said you have a million dictionaries. Even if each took only one > byte, you would still have a million bytes. Do you expect a faster I/ > O time than the time it takes to write a million bytes? > > I want to agree with John's worry about RAM, unless you have several+ > GB, as you say. You are not dealing with small numbers.
in my case, i just write the pickle file once and then read it in later. in that case, cPickle and shelve would be identical, if i understand correctly? the file i'm reading in is ~800 MB file, and the pickle file is around 300 MB. even if it were 800 MB, it doesn't make sense to me that python's i/o would be that slow... it takes roughly 5 seconds to write one megabyte of a binary file (the pickled object in this case), which just seems wrong. does anyone know anything about this? about how i/o can be sped up for example? the dictionary might have a million keys, but each key's value is very small. i tried the same example where the keys are short strings (and there are about 10-15 million of them) and each value is an integer, and it is still very slow. does anyone know how to test whether i/o is the bottle neck, or whether it's something specific about pickle? thanks. -- http://mail.python.org/mailman/listinfo/python-list