On Jan 30, 2:44 pm, perfr...@gmail.com wrote: > On Jan 28, 6:08 pm, Aaron Brady <castiro...@gmail.com> wrote: > > > > > On Jan 28, 4:43 pm, perfr...@gmail.com wrote: > > > > On Jan 28, 5:14 pm, John Machin <sjmac...@lexicon.net> wrote: > > > > > On Jan 29, 3:13 am, perfr...@gmail.com wrote: > > > > > > hello all, > > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > > value which is a list containing about 1 to 5 million (small) > > > > > dictionaries. for example, > > > > > > mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': > > > > > 'world'}, ...], > > > > > key2: [...]} > > > > > > in total there are about 10 to 15 million lists if we concatenate > > > > > together all the values of every key in 'mydict'. mydict is a > > > > > structure that represents data in a very large file (about 800 > > > > > megabytes). > > > snip > > > > in reply to the other poster: i thought 'shelve' simply calls pickle. > > > if thats the case, it wouldnt be any faster, right ? > > > Yes, but not all at once. It's a clear winner if you need to update > > any of them later, but if it's just write-once, read-many, it's about > > the same. > > > You said you have a million dictionaries. Even if each took only one > > byte, you would still have a million bytes. Do you expect a faster I/ > > O time than the time it takes to write a million bytes? > > > I want to agree with John's worry about RAM, unless you have several+ > > GB, as you say. You are not dealing with small numbers. > > in my case, i just write the pickle file once and then read it in > later. in that case, cPickle and shelve would be identical, if i > understand correctly?
No not identical. 'shelve' is not a dictionary, it's a database object that implements the mapping protocol. 'isinstance( shelve, dict )' is False, for example. > the file i'm reading in is ~800 MB file, and the pickle file is around > 300 MB. even if it were 800 MB, it doesn't make sense to me that > python's i/o would be that slow... it takes roughly 5 seconds to write > one megabyte of a binary file (the pickled object in this case), which > just seems wrong. does anyone know anything about this? about how i/o > can be sped up for example? You can try copying a 1-MB file. Or something like: f= open( 'temp.temp', 'w' ) for x in range( 100000 ): f.write( '0'* 10 ) You know how long it takes OSes to boot, right? > the dictionary might have a million keys, but each key's value is very > small. i tried the same example where the keys are short strings (and > there are about 10-15 million of them) and each value is an integer, > and it is still very slow. does anyone know how to test whether i/o is > the bottle neck, or whether it's something specific about pickle? > > thanks. You could fall back to storing a parallel list by hand, if you're just using string and numeric primitives. -- http://mail.python.org/mailman/listinfo/python-list