Hi Jonathan, On 5/26/23 3:38 PM, Jonathan S. Katz <[email protected]> wrote:
> On 4/26/23 9:31 AM, Giuseppe Broccolo wrote: > > We finally opted for ElasticSearch as search engine, considering that it > > was providing what we needed: > > > > * support to store dense vectors > > * support for kNN searches (last version of ElasticSearch allows this) > > I do want to note that we can implement indexing techniques with GiST > that perform K-NN searches with the "distance" support function[1], so > adding the fundamental functions to help with this around known vector > search techniques could add this functionality. We already have this > today with "cube", but as Nathan mentioned, it's limited to 100 dims. > Yes, I was aware of this. It would be enough to define the required support functions for GiST indexing (I was a bit in the loop when it was tried to add PG14 presorting support to GiST indexing in PostGIS[1]). That would be really helpful indeed. I was just mentioning it because I know about other teams using ElasticSearch as a storage of dense vectors only for this. > > An internal benchmark showed us that we were able to achieve the > > expected performance, although we are still lacking some points: > > > > * clustering of vectors (this has to be done outside the search engine, > > using DBScan for our use case) > > From your experience, have you found any particular clustering > algorithms better at driving a good performance/recall tradeoff? > Nope, it really depends on the use case: the point of using DBScan above was mainly because it's a way of clustering without knowing a priori the number of clusters the algorithm should be able to retrieve, which is actually a parameter needed for Kmeans. Depending on the use case, DBScan might have better performance in noisy datasets (i.e. entries that really do not belong to a cluster in particular). Noise in vectors obtained with embedding models is quite normal, especially when the embedding model is not properly tuned/trained. In our use case, DBScan was more or less the best choice, without biasing the expected clusters. Also PostGIS includes an implementation of DBScan for its geometries[2]. > > * concurrency in updating the ElasticSearch indexes storing the dense > > vectors > > I do think concurrent updates of vector-based indexes is one area > PostgreSQL can ultimately be pretty good at, whether in core or in an > extension. Oh, it would save a lot of overhead in updating indexed vectors! It's something needed when embedding models are re-trained, vectors are re-generated and indexes need to be updated. Regards, Giuseppe. [1] https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284 [2] https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117
