On Friday, 24 March 2017 at 06:30:25 UTC, H. S. Teoh wrote:

You have to keep in mind that one downside of hashtables is that they tend to be unfriendly towards caching hierarchies
...

For certain applications, where key lookups are more-or-less random, this is the best you could do, but for a good number of applications, there tends to be some correlation between lookups, and even things like B-trees could potentially do better because of their cache-friendliness (and locality), even if you assume I/O is much faster than your traditional spindle-based hard drive. The O(log n) could have a much smaller constant than the hashtable O(1) if lookups tend to be correlated (i.e., exhibit locality), and lots more cache misses are incurred in the latter.

Having said that, though, large-capacity low-latency SSDs are very interesting to me because I'm interested in certain applications that involve very large lookup tables. Currently I'm just using traditional hashtables, but once you get to a certain size, I/O overhead dominates and progress grinds to a halt. I've been researching ways of taking advantage of locality to ease this, but so far haven't gotten it to actually work yet.


T

Thanks T for the great insight. Very helpful! Nice to see that you, Laeeth, the ACM, Intel and Micron all agree: this new technology could be very disruptive.

In addition to prevalence of random lookups, certain additional app conditions and designs might make "gigantic AAs" more useful or competitive with alternatives:

1. Use by programmers who need a data structure that's easier to reason about or faster and safer to implement or prototype, and easier for future users to read and maintain.

2. Apps with large tables that use (or could use) non-natural keys, e.g., crypto-quality GUIDs, which can't benefit from (or impose the extra computational cost of) the sorting that must be performed and maintained on natural key values (e.g., LastName="Smith") in order to use a B-Tree search.

3. Tables (and related tables) whose row data (and even aggregate data) could be "clumped" (think "documentized") to achieve a singularity (and not merely some degree of locality), thus consolidating multiple lookups into a single lookup.

-In other words, a key's value is itself an array of data that can be fetched in a single lookup.

-The key id for each "next" (updated) version of a document could also be stored with the current version.

-The next key-value entry to hold the update could be created immediately but not have its values written unless and until the prior version of the document has changed.

-Better yet in an append-only design, creation of the next key-value entry could be deferred until a revision actually occurs.

-Such "chained documentization" would automate histories, and could be further abstracted and adapted (e.g., with columnstore concepts) to accommodate apps in which lookups aren't mostly random.

Finally, re: caches: I haven't found whether it is or isn't possible to combine a server's DRAM and its Optane SSD RAM or DIMMs to form a single pool of RAM. Mixing DIMMS of different speeds is a no-no; but if this could be done, then the hottest data could be "cached" in DRAM and thus spared the 10x latency penalty of the SSD device.

Reply via email to