On 10/29/2011 03:00 AM, Steven D'Aprano wrote: > On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote: > >> Hi, >> >> I would like to save many dicts with a fixed amount of keys tuples to a >> file in a memory efficient manner (no random, but only sequential >> access is required)
> > What do you mean "keys tuples"? Corrected phrase: I would like to save many dicts with a fixed (and known) amount of keys in a memory efficient manner (no random, but only sequential access is required) to a file (which can later be sent over a slow expensive network to other machines) Example: Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue', 'message1', 'message2' 'timestamp' is an integer 'floatvalue' is a float 'intvalue' an int 'message1' is a string with a length of max 2000 characters, but can often be very short 'message2' the same as message1 so a typical dict will look like { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1' : '', 'message2' : '=' * 1999 } > > What do you call "many"? Fifty? A thousand? A thousand million? How many > items in each dict? Ten? A million? File size can be between 100kb and over 100Mb per file. Files will be accumulated over months. I just want to use the smallest possible space, as the data is collected over a certain time (days / months) and will be transferred via UMTS / EDGE / GSM network, where the transfer takes already for quite small data sets several minutes. I want to reduce the transfer time, when requesting files on demand (and the amount of data in order to not exceed the monthly quota) >> As the keys are the same for each entry I considered converting them to >> tuples. > > I don't even understand what that means. You're going to convert the keys > to tuples? What will that accomplish? >> As the keys are the same for each entry I considered converting them (the before mentioned dicts) to tuples. so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1' : '', 'message2' : '=' * 1999 } would become [ 12, 3.14159, 42, '', ''=' * 1999 ] > > >> The tuples contain only strings, ints (long ints) and floats (double) >> and the data types for each position within the tuple are fixed. >> >> The fastest and simplest way is to pickle the data or to use json. Both >> formats however are not that optimal. > > How big are your JSON files? 10KB? 10MB? 10GB? > > Have you tried using pickle's space-efficient binary format instead of > text format? Try using protocol=2 when you call pickle.Pickler. No. This is probably already a big step forward. As I know the data types if each element in the tuple I would however prefer a representation, which is not storing the data types for each typle over and over again (as they are the same for each dict / tuple) > > Or have you considered simply compressing the files? Compression makes sense but the inital file format should be already rather 'compact' > >> I could store ints and floats with pack. As strings have variable length >> I'm not sure how to save them efficiently (except adding a length first >> and then the string. > > This isn't 1980 and you're very unlikely to be using 720KB floppies. > Premature optimization is the root of all evil. Keep in mind that when > you save a file to disk, even if it contains only a single bit of data, > the actual space used will be an entire block, which on modern hard > drives is very likely to be 4KB. Trying to compress files smaller than a > single block doesn't actually save you any space. > > >> Is there already some 'standard' way or standard library to store such >> data efficiently? > > Yes. Pickle and JSON plus zip or gzip. > pickle protocol-2 + gzip of the tuple derived from the dict, might be good enough for the start. I have to create a little more typical data in order to see how many percent of my payload would consist of repeating the data types for each tuple. -- http://mail.python.org/mailman/listinfo/python-list