Re: Setting BLCKSZ 4kB

2018-01-16 Thread Giuseppe Broccolo
Hi Sanyam,

Interesting topic!

2018-01-16 7:50 GMT+01:00 sanyam jain :

> Hi,
>
> I am trying to solve WAL flooding due to FPWs.
>
>
> What are the cons of setting BLCKSZ as 4kB?
>
>
> When saw the results published on http://blog.coelho.net/
> database/2014/08/17/postgresql-page-size-for-SSD-2.html
>
> 4kB page is giving better performance in comparison to 8kB except when
> tested with 15kB row size.
>
>
> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size
> of file system is 4kB?
>

There is this interesting article of Tomas Vondra:

https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes/

that explains some consequences turning off full_page_writes. If I
correctly understood, turning off full_page_writes with BLCKSZ set to 4kB
can reduce
significantly the amount of produced WAL, but you cannot be sure that you
are completely safe with a PostgreSQL page that can be completely contained
in a 4kB file system page, though modern ones are less vulnerable to
partial writes.

In the article, Tomas focus the attention on the fact that most of full
page writes happens right after a checkpoint: a proper tuning of checkpoint
can help
reducing the amount of writes on the storage, continuing to safely keep
full_page_writes enabled.

Giuseppe.


Re: Setting BLCKSZ 4kB

2018-01-16 Thread Giuseppe Broccolo
Hi Sanyam,

Interesting topic!

2018-01-16 7:50 GMT+01:00 sanyam jain :

> Hi,
>
> I am trying to solve WAL flooding due to FPWs.
>
>
> What are the cons of setting BLCKSZ as 4kB?
>
>
> When saw the results published on http://blog.coelho.net/
> database/2014/08/17/postgresql-page-size-for-SSD-2.html
>
> 4kB page is giving better performance in comparison to 8kB except when
> tested with 15kB row size.
>
>
> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size
> of file system is 4kB?
>

There is this interesting article of Tomas Vondra:

https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes/

that explains some consequences turning off full_page_writes. If I
correctly understood, turning off full_page_writes with BLCKSZ set to 4kB
can reduce
significantly the amount of produced WAL, but you cannot be sure that you
are completely safe with a PostgreSQL page that can be completely contained
in a 4kB file system page, though modern ones are less vulnerable to
partial writes.

In the article, Tomas focus the attention on the fact that most of full
page writes happens right after a checkpoint: a proper tuning of checkpoint
can help
reducing the amount of writes on the storage, continuing to safely keep
full_page_writes enabled.

Giuseppe.


Re: Need help!

2021-04-07 Thread Giuseppe Broccolo
Hi Ayoub,

Il giorno mer 7 apr 2021 alle ore 17:29 FATIHI Ayoub <
ayoubfatihi1...@gmail.com> ha scritto:

> Hi postgres community,
> I am willing to participate in GSoC to speed up the build of the gist
> index in postgis, which is based on postgresql.
> And I need to know *everything* about the GiST API.
> To do so I need to acquire the necessary theory and concepts to start this
> journey.
> I do not have a computer science background, I have little knowledge about
> machines and I code in Python (for scientific computation, data science and
> ml).
> So I am asking what I should learn to complete in an efficient way this
> task at hand: speeding up the build of gist index, if possible how much
> time is needed to accomplish each task.
>

The main thing you have to know IMO is the concept of "extensibility of
indexes" in PostgreSQL. More specifically about GiST, you can have a look
here:

https://www.postgresql.org/docs/devel/gist-extensibility.html

Here there's also a note about the new added method of the API sortsupport,
which is what you need for your task.

Giuseppe.


Re: Postgres Spark connector

2020-12-24 Thread Giuseppe Broccolo
Hi Zhihong,

On Wed, 23 Dec 2020, 17:55 Zhihong Yu,  wrote:

> Hi,
> I searched for Postgres support in Apache Spark.
> I found Spark doc related to JDBC.
>
> I wonder if the community is aware of Spark connector for Postgres
> (hopefully open source) where predicate involving jsonb columns can be
> pushed down.
>

JDBC driver is indeed the best driver which can be used if you have to
persist your Spark dataframes in PostgreSQL, IMO.

It's a connector which supports just pure SQL (no mapping between your
Scala/Java classes and the DB schema for instance, despite ORM frameworks
like Hibernate), so it works at a lower level allowing you to use directly
the queries you would use to handle jsonb data.

Maybe you need to use pure json's between Spark objects and the DB, but the
communication using JDBC driver can be completely based on jsonb.

Giuseppe.

>


Re: Questions about support function and abbreviate

2021-06-12 Thread Giuseppe Broccolo
Hi Han,

Darafei already provided a good answer to your question, I will add just a
few things with the hope of making things more clear for your use case.

SortSupport implementation in PostgreSQL allows to make comparisons at
binary level in a dedicated region of memory where data can be quickly
accessed through
references to actual data in the heap called "sort tuples".  Those
references have a space to include the data of a length of a native pointer
of a system, which is 8 bytes
for 64 bit systems. Although that represents enough space for standard data
types like integers or floats, it's not enough for longer data types, or
varlena data like
geometries.

In this last case, we need to pass to sort tuples an abbreviated version of
the key which should include the most representative part. This is the
scope of the abbreviated
attributes which need to be provided to create the abbreviated keys.

To answer more specifically to your question, the four abbreviated
attributes represent

* comparator  -->  the access method which should
be used of comparison of abbreviated keys
* abbrev_converter   -->  the method which creates the abbreviations
(NOTE in src/backend/access/gist/gistproc.c it just consider the first 32
bits of the hash of a geometry)
* abbrev_abort   -->  the method which should check if the
abbreviation has to be done or not even in cases the length is greater than
the size of the native pointer (NOTE,
   it is not
implemented in src/backend/access/gist/gistproc.c, which means that
abbreviation is always worth)
* abbrev_full_comparator -->  the method which should be used for
comparisons in case of fall back into not abbreviated keys (NOTE, this
attribute coincides to the comparator one
   in case the
abbreviate flag is set to false)

Hope it helps,
Giuseppe.


Il giorno sab 12 giu 2021 alle ore 08:43 Han Wang  ha
scritto:

> Hi Darafei,
>
> Thanks for your reply.
>
> However, I still don't get the full picture of this.  Let me make my
> question more clear.
>
> First of all, in the *`gistproc.c
> `*
> of Postgres, it shows that the `abbreviate` attributes should be set before
> the `abbrev_converter` defined. So I would like to know where to define a
> `SortSupport` structure with `abbreviate` is `true`.
>
> Secondly, in the support functions of internal data type `Point`, the
> `abbrev_full_copmarator` just z-order hash the point first like the
> `abbrev_converter` doing and then compare the hash value. So I don't know
> the difference between `full_comparator` and `comparator` after
> `abbrev_converter`.
>
> Best regards,
> Han
>
> On Sat, Jun 12, 2021 at 2:55 PM Darafei "Komяpa" Praliaskouski <
> m...@komzpa.net> wrote:
>
>> Hello,
>>
>> the abbrev_converter is applied whenever it is defined. The values are
>> sorted using the abbreviated comparator first using the shortened version,
>> and if there is a tie the system asks the real full comparator to resolve
>> it.
>>
>> This article seems to be rather comprehensive:
>> https://brandur.org/sortsupport
>>
>> On Sat, Jun 12, 2021 at 9:51 AM Han Wang  wrote:
>>
>>> Hi all,
>>>
>>> I am trying to implement a sort support function for geometry data types
>>> in PostGIS with the new feature `SortSupport`. However, I have a question
>>> about this.
>>>
>>> I think it is hardly to apply a sort support function to a complex data
>>> type without the `abbrev_converter` to simply the data structure into a
>>> single `Datum`. However, I do not know how the system determines when to
>>> apply the converter.
>>>
>>> I appreciate any answers or suggestions. I am looking forward to hearing
>>> from you.
>>>
>>> Best regards,
>>> Han
>>>
>>
>>
>> --
>> Darafei "Komяpa" Praliaskouski
>> OSM BY Team - http://openstreetmap.by/
>>
>


Re: vector search support

2023-04-26 Thread Giuseppe Broccolo
Hi Nathan,

I find the patches really interesting. Personally, as Data/MLOps Engineer,
I'm involved in a project where we use embedding techniques to generate
vectors from documents, and use clustering and kNN searches to find similar
documents basing on spatial neighbourhood of generated vectors.

We finally opted for ElasticSearch as search engine, considering that it
was providing what we needed:

* support to store dense vectors
* support for kNN searches (last version of ElasticSearch allows this)

An internal benchmark showed us that we were able to achieve the expected
performance, although we are still lacking some points:

* clustering of vectors (this has to be done outside the search engine,
using DBScan for our use case)
* concurrency in updating the ElasticSearch indexes storing the dense
vectors

I found these patches really interesting, considering that they would solve
some of open issues when storing dense vectors. Index support would help a
lot with searches though.

Not sure if it's the best to include in PostgreSQL core, but would be
fantastic to have it as an extension.

All the best,
Giuseppe.

On Sat, 22 Apr 2023, 01:07 Nathan Bossart,  wrote:

> Attached is a proof-of-concept/work-in-progress patch set that adds
> functions for "vectors" repreѕented with one-dimensional float8 arrays.
> These functions may be used in a variety of applications, but I am
> proposing them with the AI/ML use-cases in mind.  I am posting this early
> in the v17 cycle in hopes of gathering feedback prior to PGCon.
>
> With the accessibility of AI/ML tools such as large language models (LLMs),
> there has been a demand for storing and manipulating high-dimensional
> vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
> of these vectors have more than 1500 dimensions.  The cube extension [0]
> provides some of the distance functionality (e.g., taxicab, Euclidean, and
> Chebyshev), but it is missing some popular functions (e.g., cosine
> similarity, dot product), and it is limited to 100 dimensions.  We could
> extend cube to support more dimensions, but this would require reworking
> its indexing code and filling in gaps between the cube data type and the
> array types.  For some previous discussion about using the cube extension
> for this kind of data, see [1].
>
> float8[] is well-supported and allows for effectively unlimited dimensions
> of data.  float8 matches the common output format of many AI embeddings,
> and it allows us or extensions to implement indexing methods around these
> functions.  This patch set does not yet contain indexing support, but we
> are exploring using GiST or GIN for the use-cases in question.  It might
> also be desirable to add support for other linear algebra operations (e.g.,
> operations on matrices).  The attached patches likely only scratch the
> surface of the "vector search" use-case.
>
> The patch set is broken up as follows:
>
>  * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
>  * 0002 adds several vector-related functions, including distance functions
>and a kmeans++ implementation.
>  * 0003 adds support for optionally using the OpenBLAS library, which is an
>implementation of the Basic Linear Algebra Subprograms [2]
>specification.  Basic testing with this library showed a small
>performance boost, although perhaps not enough to justify giving this
>patch serious consideration.
>
> Of course, there are many open questions.  For example, should PostgreSQL
> support this stuff out-of-the-box in the first place?  And should we
> introduce a vector data type or SQL domains for treating float8[] as
> vectors?  IMHO these vector search use-cases are an exciting opportunity
> for the PostgreSQL project, so I am eager to hear what folks think.
>
> [0] https://www.postgresql.org/docs/current/cube.html
> [1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
> [2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: vector search support

2023-05-29 Thread Giuseppe Broccolo
Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz  wrote:

> On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> > We finally opted for ElasticSearch as search engine, considering that it
> > was providing what we needed:
> >
> > * support to store dense vectors
> > * support for kNN searches (last version of ElasticSearch allows this)
>
> I do want to note that we can implement indexing techniques with GiST
> that perform K-NN searches with the "distance" support function[1], so
> adding the fundamental functions to help with this around known vector
> search techniques could add this functionality. We already have this
> today with "cube", but as Nathan mentioned, it's limited to 100 dims.
>

Yes, I was aware of this. It would be enough to define the required support
functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting
support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning
it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.


> > An internal benchmark showed us that we were able to achieve the
> > expected performance, although we are still lacking some points:
> >
> > * clustering of vectors (this has to be done outside the search engine,
> > using DBScan for our use case)
>
>  From your experience, have you found any particular clustering
> algorithms better at driving a good performance/recall tradeoff?
>

Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the
number
of clusters the algorithm should be able to retrieve, which is actually a
parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a
cluster in
particular). Noise in vectors obtained with embedding models is quite
normal,
especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing
the
expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].


> > * concurrency in updating the ElasticSearch indexes storing the dense
> > vectors
>
> I do think concurrent updates of vector-based indexes is one area
> PostgreSQL can ultimately be pretty good at, whether in core or in an
> extension.


Oh, it would save a lot of overhead in updating indexed vectors! It's
something needed
when embedding models are re-trained, vectors are re-generated and indexes
need to
be updated.

Regards,
Giuseppe.

[1]
https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284
[2]
https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117


Re: vector search support

2023-05-29 Thread Giuseppe Broccolo
Hi Nathan,

I noticed you implemented a closest_vector function which returns the
closest vector to a given one using the
Euclidean distance: would it make sense to change the implementation in
order to include also different distance
definitions rather than the Euclidean one (for instance, cosine
similarity)? Depending on the use cases, some
metrics could make more sense than others.

Giuseppe.

On 4/22/23 1:07 AM, Nathan Bossart  wrote:

> Attached is a proof-of-concept/work-in-progress patch set that adds
> functions for "vectors" repreѕented with one-dimensional float8 arrays.
> These functions may be used in a variety of applications, but I am
> proposing them with the AI/ML use-cases in mind.  I am posting this early
> in the v17 cycle in hopes of gathering feedback prior to PGCon.
>
> With the accessibility of AI/ML tools such as large language models (LLMs),
> there has been a demand for storing and manipulating high-dimensional
> vectors in PostgreSQL, particularly around nearest-neighbor queries.  Many
> of these vectors have more than 1500 dimensions.  The cube extension [0]
> provides some of the distance functionality (e.g., taxicab, Euclidean, and
> Chebyshev), but it is missing some popular functions (e.g., cosine
> similarity, dot product), and it is limited to 100 dimensions.  We could
> extend cube to support more dimensions, but this would require reworking
> its indexing code and filling in gaps between the cube data type and the
> array types.  For some previous discussion about using the cube extension
> for this kind of data, see [1].
>
> float8[] is well-supported and allows for effectively unlimited dimensions
> of data.  float8 matches the common output format of many AI embeddings,
> and it allows us or extensions to implement indexing methods around these
> functions.  This patch set does not yet contain indexing support, but we
> are exploring using GiST or GIN for the use-cases in question.  It might
> also be desirable to add support for other linear algebra operations (e.g.,
> operations on matrices).  The attached patches likely only scratch the
> surface of the "vector search" use-case.
>
> The patch set is broken up as follows:
>
>  * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
>  * 0002 adds several vector-related functions, including distance functions
>and a kmeans++ implementation.
>  * 0003 adds support for optionally using the OpenBLAS library, which is an
>implementation of the Basic Linear Algebra Subprograms [2]
>specification.  Basic testing with this library showed a small
>performance boost, although perhaps not enough to justify giving this
>patch serious consideration.
>
> Of course, there are many open questions.  For example, should PostgreSQL
> support this stuff out-of-the-box in the first place?  And should we
> introduce a vector data type or SQL domains for treating float8[] as
> vectors?  IMHO these vector search use-cases are an exciting opportunity
> for the PostgreSQL project, so I am eager to hear what folks think.
>
> [0] https://www.postgresql.org/docs/current/cube.html
> [1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
> [2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>