On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote: > On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote: > >Another OT thread to pick your brains. :) > > > >What's a good, efficient file structure for storing extremely large > >lookup tables? (Extremely large as in > 10 million entries, with keys > >and values roughly about 100 bytes each.) The structure must support > >efficient adding and lookup of entries, as these two operations will > >be very frequent. > > > >I did some online research, and it seems that hashtables perform > >poorly on disk, because the usual hash functions cause random > >scattering of related data (which are likely to be access with higher > >temporal locality), which incurs lots of disk seeks. > > > >I thought about B-trees, but they have high overhead (and are a pain > >to implement), and also only exhibit good locality if table entries > >are accessed sequentially; the problem is I'm working with > >high-dimensional data and the order of accesses is unlikely to be > >sequential. However, they do exhibit good spatial locality in > >higher-dimensional space (i.e., if entry X is accessed first, then > >the next entry Y is quite likely to be close to X in that space). > >Does anybody know of a good data structure that can take advantage of > >this fact to minimize disk accesses? > > > > > >T > > As a first attempt could you use a key-value database (like REDIS if > you have enough memory to fit everything in)? Or is that out of the > question.
Well, currently I'm just using plain ole in-memory hash tables to store the data. It works quite well for small problems, but I'm finding that memory runs out quickly with problems of moderate size, so I need to move the storage to disk. I'm just looking to minimize I/O latency, which hash tables are notorious for. (A previous incarnation of the program uses Berkeley DB hash tables, but it becomes I/O bound and slow when the problem size is large.) > Another question is can your queries be batched? If that is the > case and your data is bigger than your available memory, then try > Googling "Lars Arge Buffer Tree" which might work well. Hmm. I'm not sure if it's possible to batch the queries unless I multithread the algorithm; the queries are generated on-the-fly. But they *are* somewhat predictable (spatial locality: see below), so maybe that might help? [...] > If I had to design something quick on the spot, my first guess would > be to use a grid on the first two dimensions and then bin the > 'points' or keys within each grid square and build a simpler > structure on those. This won't work so well though for really high > dimension data or if the 'points' are randomly distributed. > > Also, what exactly do you mean by "in that space" when you say: > > "if entry X is accessed first, then the next entry Y is quite likely > to be close to X in that space". > > Do you mean that the value of Y in the next dimension is numerically > close (or expected to be) to X? [...] How many dimensions is considered "really high"? The number of dimensions can be up to about 50 or so. Is that high? Also, the points are not randomly distributed; they are highly-connected (i.e., they tend to be adjacent to each other). Basically, one possible representation of the keys is as an n-dimensional array of integers. (They are not natively in that form, but can be compressed into that form easily.) And the way the algorithm works is that if the key (p1,p2,p3,...) is accessed, then the next key (q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only in a small number of coordinates, and the difference in coordinates is likely to be very small (usually in the +1/-1 range). Actually, now that I think of it... the problem may be far more tractable than I realized. I was previously thinking of some kind of n-dimensional space-partitioning storage scheme, where the state space is partitioned into n-dimensional cube buckets, and points are placed in their corresponding buckets. However, this had the problem that the n-cube contains 2^n vertices, so the bucket sizes would have to be k^n, and since I can't control n, it makes it difficult to map bucket sizes to page sizes for optimal I/O performance. For large n, the bucket size would be unmanageably big (e.g., in the 50-dimensional case). But it just dawned on me that partitioning into n-cubes is unnecessary, because most of the time, I only care about points that differ in a small number of coordinates from the current point. So a more efficient bucket shape would be something closer to an n-cross (higher-dimensional equivalent of the octahedron), which only has 2*n vertices, or one of its derivatives. This makes bucket sizes far easier to control, and far more efficient to store. The tricky part is only, how to partition n-space into n-cross-shaped pieces? It would have to be some kind of tiling (otherwise there'd be ambiguity over which bucket a given point falls into). So now I've to do some research into n-dimensional tilings... :P T -- "The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."