Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a persistent key-store thing ( https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (bcolz.blosc.org). Here are some performance numbers: python key-store.py -f numpy -d __test -l 0 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 1.906 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.191 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 75M __test So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes: $ python key-store.py -f numpy -d __test -l 9 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 6.636 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.384 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 28M __test Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode: $ python key-store.py -f bcolz -d __test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.479 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.103 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 82M __test Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default: $ python key-store.py -f bcolz -d __test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.487 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.98 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 29M __test So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds. In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB): $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 133.650 Retrieving 100 keys in arbitrary order... Time ( query) --> 2.881 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test 39G /media/faltet/docker/__test and now, with compression on: $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 145.633 Retrieving 100 keys in arbitrary order... Time ( query) --> 1.339 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test 12G /media/faltet/docker/__test So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected). But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too. Francesc 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>: > I'd try storing the data in hdf5 (probably via h5py, which is a more > basic interface without all the bells-and-whistles that pytables > adds), though any method you use is going to be limited by the need to > do a seek before each read. Storing the data on SSD will probably help > a lot if you can afford it for your data size. > > On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> > wrote: > > Hi, > > > > I have a very large dictionary that must be shared across processes and > does not fit in RAM. I need access to this object to be fast. The key is an > integer ID and the value is a list containing two elements, both of them > numpy arrays (one has ints, the other has floats). The key is sequential, > starts at 0, and there are no gaps, so the “outer” layer of this data > structure could really just be a list with the key actually being the > index. The lengths of each pair of arrays may differ across keys. > > > > For a visual: > > > > { > > key=0: > > [ > > numpy.array([1,8,15,…, 16000]), > > numpy.array([0.1,0.1,0.1,…,0.1]) > > ], > > key=1: > > [ > > numpy.array([5,6]), > > numpy.array([0.5,0.5]) > > ], > > … > > } > > > > I’ve tried: > > - manager proxy objects, but the object was so big that low-level > code threw an exception due to format and monkey-patching wasn’t successful. > > - Redis, which was far too slow due to setting up connections and > data conversion etc. > > - Numpy rec arrays + memory mapping, but there is a restriction > that the numpy arrays in each “column” must be of fixed and same size. > > - I looked at PyTables, which may be a solution, but seems to have > a very steep learning curve. > > - I haven’t tried SQLite3, but I am worried about the time it > takes to query the DB for a sequential ID, and then translate byte arrays. > > > > Any ideas? I greatly appreciate any guidance you can provide. > > > > Thanks, > > Ryan > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > -- > Nathaniel J. Smith -- http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion