Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz <jk...@postgresql.org> wrote:

> On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> > We finally opted for ElasticSearch as search engine, considering that it
> > was providing what we needed:
> >
> > * support to store dense vectors
> > * support for kNN searches (last version of ElasticSearch allows this)
>
> I do want to note that we can implement indexing techniques with GiST
> that perform K-NN searches with the "distance" support function[1], so
> adding the fundamental functions to help with this around known vector
> search techniques could add this functionality. We already have this
> today with "cube", but as Nathan mentioned, it's limited to 100 dims.
>

Yes, I was aware of this. It would be enough to define the required support
functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting
support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning
it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.


> > An internal benchmark showed us that we were able to achieve the
> > expected performance, although we are still lacking some points:
> >
> > * clustering of vectors (this has to be done outside the search engine,
> > using DBScan for our use case)
>
>  From your experience, have you found any particular clustering
> algorithms better at driving a good performance/recall tradeoff?
>

Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the
number
of clusters the algorithm should be able to retrieve, which is actually a
parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a
cluster in
particular). Noise in vectors obtained with embedding models is quite
normal,
especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing
the
expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].


> > * concurrency in updating the ElasticSearch indexes storing the dense
> > vectors
>
> I do think concurrent updates of vector-based indexes is one area
> PostgreSQL can ultimately be pretty good at, whether in core or in an
> extension.


Oh, it would save a lot of overhead in updating indexed vectors! It's
something needed
when embedding models are re-trained, vectors are re-generated and indexes
need to
be updated.

Regards,
Giuseppe.

[1]
https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284
[2]
https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117

Reply via email to