On Tue, Apr 20, 2021 at 8:56 PM Stefan Keller <sfkel...@gmail.com> wrote:
> Dear Olegs, dear Nikolay, dear all > > Allow me to revive this thread: > > Are there any advances in a learned index for PostgreSQL? > > Background: I'm trying to benchmark those experimental indices. For > this I did some bibliography work (see below). Fun fact: Not only > Postgres people love high-proof drinks, sorry, high-proof index names: > see "From wisckey to bourbon" :-). > Have you seen recent paper "Benchmarking Learned Indexes" ? https://vldb.org/pvldb/vol14/p1-marcus.pdf I haven't look into the code https://github.com/learnedsystems/sosd > > My main discovery is a discussion report - which included Stonebraker > - entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first > two takeaways are "#1: The initial comparisons of learned indices with > optimized traditional indices should be further expanded to include > concurrency control and multi-user settings. (...) #2: A key benefit > of learned indices may come from the learned structures requiring > lower space utilization, rather than a reduction in search time." > > Yours, Stefan > > > [1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel, > J. M., RĂ©, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment > and Prognosis. Data Engineering, 3. Web access: > https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf > > > Bibliography (without any claim for completeness): > > Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, > May). The case for learned index structures. In Proceedings of the > 2018 International Conference on Management of Data (pp. 489-504). Web > access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909 > > "Recursive model index, a learned index structure" (based on Kraska et > al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi > (Go language), https://github.com/learnedsystems/RMI (Rust) > > Kaur, T. (2018). Learned Index Structures: Practical Implementations > and Future Directions. Master Thesis. Web access: > > https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf > (pretends to be open source C++ but none found) > > Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska, > T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned > index. In Proceedings of the Third International Workshop on > Exploiting Artificial Intelligence Techniques for Data Management (pp. > 1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659), > Source code: https://github.com/learnedsystems/RadixSpline (C++). > > Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B., > Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to > bourbon: A learned index for log-structured merge trees. In 14th > {USENIX} Symposium on Operating Systems Design and Implementation > ({OSDI} 20) (pp. 155-171). Web access: > https://www.usenix.org/system/files/osdi20-dai_0.pdf > > Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... & > Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In > Proceedings of the 2020 ACM SIGMOD International Conference on > Management of Data (pp. 969-984). Web access: > https://doi.org/10.1145/3318464.3389711 / > https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code: > https://github.com/microsoft/ALEX (C++, MIT license) > > Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov > <o.iva...@postgrespro.ru>: > > > > On 12/12/2017 04:33 PM, Oleg Bartunov wrote: > > > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov > > > <samokhva...@gmail.com> wrote: > > >> Very interesting read: https://arxiv.org/abs/1712.01208 > > >> > > >> HN discussion: https://news.ycombinator.com/item?id=15894896 > > >> > > >> Some of the comments (from Twitter > > >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean > and co > > >> at GOOG just released a paper showing how machine-learned indexes can > > >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster > than > > >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster > > >> unlike CPU. Amazing." > > >> > > >> Can those ideas be applied to Postgres in its current state? Or it's > not > > >> really down-to-earth? > > > Oleg made some analysis of the paper. > > If the keys are comparable and the data distribution is simple enough > > (which seems to happen rather often) and the data does not change, we > > can learn the distribution, and so perform the search faster, and save > > the memory also. That is what authors wrote and that is what must work > > in my opinion. > > > > The limitations of the approach, which authors mention in their work: > > + The data does not change. The proposed method can be extended more or > > less straightforward to nearly append-only workloads and to workloads in > > which the data distribution does not change or changes very slowly (the > > data itself or its amount may change, nevertheless). Otherwise it is > > challenging, because the key positions cannot be considered as > > independent values anymore, but it looks still possible. The other > > proposed by the authors option is to store new objects in buffer and > > retrain model in the background, but that does not look as a nice > solution. > > + GPU are fast, but the latency of doing operations on the GPU almost > > certainly avoids all speed benefits for such small models. The only case > > in which it is reasonable to use GPU is training models (i. e. building > > such indexes). > > > > The following left unclear for me from the paper: > > + How effectively the approach can be implemented taking into account > > the limitations above. > > + For hash functions authors propose to replace hash function with the > > learned CDF, which can decrease the amount of collisions, which decreaes > > the memory consumption. That is reasonable, but this hash function > > implicitly uses the information about the order on the keys. I suppose > > that using the ordering on the data and with access to the data > > histogram one could achieve similar memory consumption decrease. > > + The proposed hierarchical learning looks natural for the problem, but > > it is not clear that it is the best option. For example, one might use > > the end-to-end learning procedure with weights sparsity regularizers as > > well. I wonder whether the last approach can obtain the competitive > > results with the first one, and if not, then why. Anyway, I have a > > feeling that the proposed method has a room for improvement. > > > > In general, the approach still needs some additional research (mostly > > about the changing data), and has some points to think about how to > > implement them properly (paging, for example), but it looks promising > > enough nevertheless. > > > -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company