On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
Hi Nathan,

I find the patches really interesting. Personally, as Data/MLOps Engineer, I'm involved in a project where we use embedding techniques to generate vectors from documents, and use clustering and kNN searches to find similar documents basing on spatial neighbourhood of generated vectors.

Thanks! This seems to be a pretty common use-case these days.

We finally opted for ElasticSearch as search engine, considering that it was providing what we needed:

* support to store dense vectors
* support for kNN searches (last version of ElasticSearch allows this)

I do want to note that we can implement indexing techniques with GiST that perform K-NN searches with the "distance" support function[1], so adding the fundamental functions to help with this around known vector search techniques could add this functionality. We already have this today with "cube", but as Nathan mentioned, it's limited to 100 dims.

An internal benchmark showed us that we were able to achieve the expected performance, although we are still lacking some points:

* clustering of vectors (this has to be done outside the search engine, using DBScan for our use case)

From your experience, have you found any particular clustering algorithms better at driving a good performance/recall tradeoff?

* concurrency in updating the ElasticSearch indexes storing the dense vectors

I do think concurrent updates of vector-based indexes is one area PostgreSQL can ultimately be pretty good at, whether in core or in an extension.

I found these patches really interesting, considering that they would solve some of open issues when storing dense vectors. Index support would help a lot with searches though.

Great -- thanks for the feedback,

Jonathan

[1] https://www.postgresql.org/docs/devel/gist-extensibility.html

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to