Re: glibc qsort() vulnerability

2024-02-09 Thread Andrey Borodin



> On 9 Feb 2024, at 21:32, Nathan Bossart  wrote:
>  a lot
> of current use-cases require inspecting specific fields of structs
Yes, I'm proposing to pass to sorting routine not a comparator, but value 
extractor. And then rely on operators <,>,==.
In a pseudocode: instead of sort(array, (a,b)->a.field-b.field) use sort(array, 
x->x.field). And rely on "field" being comparable.

> If that can be made simple and elegant and
> demonstrates substantial improvements
I'll try to produce a PoC and measure it with Andres' intarray test.

> On 9 Feb 2024, at 23:40, Andres Freund  wrote:
> 
> We have some infrastructure for that actually, see sort_template.h.  But
> perhaps we should define a static inline of the generic pg_qsort() even. OTOH,
> plenty places that'll just end up to a pointless amount of code emitted to
> sort ~5 elements on average.
I think there might be another benefit. It's easier to think about values order 
than function comparator that returns -1,0,+1...

>> I bet “call" is more expensive than “if".
> 
> Not really in this case. The call is perfectly predictable - a single qsort()
> will use the same callback for every comparison, whereas the if is perfectly
> *unpredictable*.  A branch mispredict is far more expensive than a correctly
> predicted function call.

Oh, make sense... I did not understand that. But does cpu predicts what 
instruction to fetch even after a call instruction? These cpus are really neat 
things... so, probably, yes, it does.


Best regards, Andrey Borodin.



Re: Transaction timeout

2024-01-31 Thread Andrey Borodin



> On 31 Jan 2024, at 14:27, Japin Li  wrote:
> 
> LGTM.
> 
> If there is no other objections, I'll change it to ready for committer
> next Monday.

I think we have a quorum, so I decided to go ahead and flipped status to RfC. 
Thanks!


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2024-01-26 Thread Andrey Borodin



> On 20 Jan 2024, at 08:31, vignesh C  wrote:
> 
> On Mon, 9 Jan 2023 at 09:49, Andrey Borodin  wrote:
>> 
>> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
>>> does not apply on top of HEAD as in [1], please post a rebased patch:
>>> 
>> Thanks! Here's the rebase.
> 
> I'm seeing that there has been no activity in this thread for more
> than 1 year now, I'm planning to close this in the current commitfest
> unless someone is planning to take it forward.

Hi Vignesh,

thanks for the ping! Most important parts of this patch set are discussed in 
[0]. If that patchset will be committed, I'll withdraw entry for this thread 
from commitfest.
There's a version of Multixact-specific optimizations [1], but I hope they will 
not be necessary with effective caches developed in [0]. It seems to me that 
most important part of those optimization is removing sleeps under SLRU lock on 
standby [2] by Kyotaro Horiguchi. But given that cache optimizations took 4 
years to get closer to commit, I'm not sure we will get this optimization any 
time soon...

Thanks!

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/2ECE132B-C042-4489-930E-DBC5D0DAB84A%40yandex-team.ru#5f7e7022647be9eeecfc2ae75d765500
[2] 
https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d



Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-26 Thread Andrey Borodin



> On 26 Jan 2024, at 22:38, Alvaro Herrera  wrote:
> 
> This is OK because in the
> default compilation each file only has 32 segments, so that requires
> only 32 lwlocks held at once while the file is being deleted.

Do we account somehow that different subsystems do not accumulate 
MAX_SIMUL_LWLOCKS together?
E.g. GiST during split can combine 75 locks, and somehow commit_ts will be 
deactivated by this backend at the same moment and add 32 locks more :)
I understand that this sounds fantastic, these subsystems do not interfere. But 
this is fantastic only until something like that actually happens.
If possible, I'd prefer one lock at a time, any maybe sometimes two-three with 
some guarantees that this is safe.
So, from my POV first solution that you proposed seems much better to me.

Thanks for working on this!


Best regard, Andrey Borodin.



Re: UUID v7

2024-01-24 Thread Andrey Borodin



> On 24 Jan 2024, at 22:00, Marcos Pegoraro  wrote:
> 
> Is enough from 1970 ?
Per standard unix_ts_ms field is a number of milliseconds from UNIX start date 
1970-01-01.

> How about if user wants to have an UUID of his birth date ?

I've claimed my
0078c135-bd00-70b1-865a-63c3741922a5

But again, UUIDs are not designed to store timestamp. They are unique and v7 
promote data locality via time-ordering.


Best regards, Andrey Borodin.



Re: UUID v7

2024-01-19 Thread Andrey Borodin


> On 19 Jan 2024, at 13:25, Andrey Borodin  wrote:
> 
> Also, I've added some documentation on all functions.

Here's v12. Changes:
1. Documentation improvements
2. Code comments
3. Better commit message and reviews list

Best regards, Andrey Borodin.



v12-0001-Implement-UUID-v7.patch
Description: Binary data


Re: UUID v7

2024-01-19 Thread Andrey Borodin


> On 19 Jan 2024, at 08:24, David G. Johnston  
> wrote:
> 
> 
> You are mixing POSIX and ISO-8601 conventions and, as noted in our appendix, 
> they disagree on the direction that is positive.

Thanks! Now everything seems on its place.

I want to include in the patch following tests:
-- extract UUID v1, v6 and v7 timestamp
SELECT uuid_extract_time('C232AB00-9414-11EC-B3C8-9F6BDECED846') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';
SELECT uuid_extract_time('1EC9414C-232A-6B00-B3C8-9F6BDECED846') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';
SELECT uuid_extract_time('017F22E2-79B0-7CC3-98C4-DC0C0C07398F') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';

How do you think, will it be stable all across buildfarm? Or should we change 
anything to avoid false positives inferred from different timestamp parsing?


> On 19 Jan 2024, at 07:58, Lukas Fittl  wrote:
> 
> Note how calling the uuidv7 function again after having called it with a 
> fixed future timestamp, returns the future timestamp, even though it should 
> return the current time.

Thanks for the review.
Well, that was intentional. But now I see it's kind of confusing behaviour. 
I've changed it to more expected version.

Also, I've added some documentation on all functions.


Best regards, Andrey Borodin.


v11-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: UUID v7

2024-01-18 Thread Andrey Borodin



> On 18 Jan 2024, at 20:39, Andrey Borodin  wrote:
> 
> But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
> UTC. And that was 2022-02-23 00:22:22 in UTC-05.


'2022-02-22 19:22:22 UTC' is exactly that moment which was encoded into example 
UUIDs. It's not '2022-02-23 00:22:22 in UTC-05' as I thought.
I got confused by "at timezone" changes which in fact removes timezone 
information. And that's per SQL standard...

Now I'm completely lost in time... I've set local time to NY (UTC-5).

postgres=# select TIMESTAMP WITH TIME ZONE '2022-02-22 14:22:22-05' - TIMESTAMP 
WITH TIME ZONE 'Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00';
 ?column? 
--
 10:00:00
(1 row)

postgres=# select TIMESTAMP WITH TIME ZONE 'Tuesday, February 22, 2022 
2:22:22.00 PM GMT-05:00';
  timestamptz   

 2022-02-22 04:22:22-05
(1 row)

I cannot wrap my mind around it... Any pointers would be appreciated.
I'm certain that code extracted UTC time correctly, I just want a reliable test 
that verifies timestamp constant (+ I understand what is going on).


Best regards, Andrey Borodin.



Re: UUID v7

2024-01-18 Thread Andrey Borodin



> On 18 Jan 2024, at 19:20, Aleksander Alekseev  
> wrote:
> 
> Timestamp and TimestampTz are absolutely the same thing.
My question is not about Postgres data types. I'm asking about examples in the 
standard.

There's an example 017F22E2-79B0-7CC3-98C4-DC0C0C07398F. It is expected to be 
generated on "Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00".
It's exaplained to be 16455577420ns after 1582-10-15 00:00:00 UTC.

But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
UTC. And that was 2022-02-23 00:22:22 in UTC-05.


Best regards, Andrey Borodin.



Re: UUID v7

2024-01-18 Thread Andrey Borodin


> On 17 Jan 2024, at 02:19, Jelte Fennema-Nio  wrote:

I want to ask Kyzer or Brad, I hope they will see this message. I'm working on 
the patch for time extraction for v1, v6 and v7.

Do I understand correctly, that UUIDs contain local time, not UTC time? For 
examples in [0] I see that "A.6. Example of a UUIDv7 Value" I see that February 
22, 2022 2:22:22.00 PM GMT-05:00 results in unix_ts_ms = 0x017F22E279B0, which 
is not UTC, but local time.
Is it intentional? Section "5.1. UUID Version 1" states otherwise.

If so, I should swap signatures of functions from TimestampTz to Timestamp.
I'm hard-coding examples from this standard to tests, so I want to be precise...

If I follow the standard I see this in tests:
+-- extract UUID v1, v6 and v7 timestamp
+SELECT uuid_extract_time('C232AB00-9414-11EC-B3C8-9F6BDECED846') at time zone 
'GMT-05';
+ timezone 
+--
+ Wed Feb 23 00:22:22 2022
+(1 row)

Current patch version attached. I've addressed all other requests: function 
renames, aliases, multiple functions instead of optional params, cleaner 
catalog definitions, not throwing error when [var,ver,time] value is unknown.
What is left: deal with timezones, improve documentation.


Best regards, Andrey Borodin.

[0] 
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis#name-example-of-a-uuidv1-value


v10-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: UUID v7

2024-01-16 Thread Andrey Borodin


> On 5 Jan 2024, at 15:57, Sergey Prokhorenko  
> wrote:

Sergey, Przemysław, Jelte, thanks for your feedback.
Here's v9. Changes:
1. Swapped type of the argument to timestamptz in gen_uuid_v7()
2. Renamed get_uuid_v7_time() to uuid_v7_time()
3. Added  uuid_ver() and uuid_var().

What do you think?

Best regards, Andrey Borodin.


v9-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: Proposal to add page headers to SLRU pages

2023-12-07 Thread Andrey Borodin
07.12.2023, 19:17, "Aleksander Alekseev" :Hi, +1 to the idea to protect SLRUs from corruption. I'm slightly leaning towards the idea of separating checksums from data pages, but anyway this checksums are better than no checksums. On 7 Dec 2023, at 10:06, Li, Yong <y...@ebay.com> wrote: I am still working on patching the pg_upgrade.  Just love to hear your thoughts on the idea and the current patch. FWIW you can take upgrade code from this patch [0] doing all the same stuff :)Sounds like a half-measure to me. If we really want to go down thisrabbit hole IMO SLRU should be moved to shared buffers as proposedelsewhere [1].Thread that I cited stopped in 2018 for this exact reason. 5 years ago. Is this argument still valid?Meanwhile checksums of buffer pages also reside on a page :)Best regards, Andrey Borodin.

Re: UUID v7

2023-10-09 Thread Andrey Borodin
On Mon, Oct 9, 2023 at 11:11 PM Jelte Fennema  wrote:
>
> I think using `now()` is quite prone to sequence rollover. With the
> current patch inserting more than 2^18~=0.26M rows into a table with
> `gen_uuid_v7()` as the default in a single transaction would already
> cause sequence rollover.

Well, the current patch will just use now()+1ms when 2^18 is
exhausted. Even if now() would be passed as an argument (however
current patch does not support an argument).

Best regards, Andrey Borodin.




Re: psql \watch 2nd argument: iteration count

2023-04-06 Thread Andrey Borodin
On Thu, Apr 6, 2023 at 10:23 PM Tom Lane  wrote:
>
> Kirk Wolak  writes:
> > Marked as Ready for Committer.
>
> Pushed with a pretty fair number of cosmetic changes.

Great, thank you!

> If you write a semicolon first, you get four, but it's the semicolon
> producing the first result not \watch.

I did not know that. Well, I knew it in parts, but did not understand
as a whole. Thanks!


Best regards, Andrey Borodin.




Re: Add "host" to startup packet

2023-04-02 Thread Andrey Borodin
Hi Lev,02.04.2023, 14:43, "Lev Kokotov" :Patch attached below. TLDR, I'd like to add "host" to the startup packet.I'm trying to run multiple Postgres servers in a multi-tenant environment behind a pooler. Currently, the only way to differentiate Postgres databases is with the user/dbname combination which are very often included in the startup packet by the client. However, that's not sufficient if you have users that all want to have the user "admin" and the database "production" :)HTTP hosting providers solved this using the "Host" header, allowing the server to identify which website the client wants. In the case of Postgres, this is the DNS or IP address, depending on the client configuration.Upon receiving a startup packet with user, dbname, and host, the pooler (acting also as a proxy) can validate that the credentials exist for the host and that they are valid, and authorize or decline the connection.I like the idea of giving proxy information on database tenant to which client wants to connect. However, name “host” in web is chosen as part of URL specification. I’m not sure it applies here.And it is not clear from your description how server should interpret this information.I have never submitted a patch for Postgres before, so I'm not entirely sure how to test this change, although it seems pretty harmless. Any feedback and help are appreciated!Well, at minimum from the patch should be clear what purpose new feature has, some documentation and test must be included. You can judge from recently committed libpq load balancing what it takes to add a connection option [0]. But, obviously, it makes sense to discuss it before going all the way of implementation.Best regards, Andrey Borodin.[0] https://github.com/postgres/postgres/commit/7f5b1981

Re: regression coverage gaps for gist and hash indexes

2023-03-31 Thread Andrey Borodin
Hi!

On Fri, Mar 31, 2023 at 8:07 AM Andres Freund  wrote:
>
> I was working on committing patch 0001 from [1] and was a bit confused about
> some of the changes to the WAL format for gist and hash index vacuum. It
> looked to me that the changed code just flat out would not work.
>
> Turns out the problem is that we don't reach deletion for hash and gist
> vacuum:
>
> gist:
>
> > Oh, I see. We apparently don't reach the gist deletion code in the tests:
> > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
> > https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174
> >
> > And indeed, if I add an abort() into , it's not reached.
> >
> > And it's not because tests use a temp table, the caller is also unreachable:
> > https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643
>

GiST logs deletions in gistXLogUpdate(), which is covered.
gistXLogDelete() is only used for cleaning during page splits. I'd
propose refactoring GiST WAL to remove gistXLogDelete() and using
gistXLogUpdate() instead.
However I see that gistXLogPageDelete() is not exercised, and is worth
fixing IMO. Simply adding 10x more data in gist.sql helps, but I think
we can do something more clever...


Best regards, Andrey Borodin.




Re: psql \watch 2nd argument: iteration count

2023-03-24 Thread Andrey Borodin
On Thu, Mar 23, 2023 at 10:15 PM Yugo NAGATA  wrote:
>
> Here is my review on the v9 patch.
>
> +   /* we do not prevent numerous names iterations like i=1 
> i=1 i=1 */
> +   have_sleep = true;
>
> Why this is allowed here? I am not sure there is any reason to allow to 
> specify
> multiple "interval" options. (I would apologize it if I missed past 
> discussion.)
I do not know, it just seems normal to me. I've fixed this.

>  postgres=# select 1 \watch interval=3 4
>  \watch: incorrect interval value '4'
>
> I think it is ok, but this error message seems not user-friendly because,
> in the above example, interval values itself is correct, but it seems just
> a syntax error. I wonder it is better to use "watch interval must be specified
> only once" or such here, as the past patch.
Done.

>
> +
> +If number of iterations is specified - query will be executed only
> +given number of times.
> +
>
> Is it common to use "-" here?  I think using comma like
> "If number of iterations is specified, "
> is natural.
Done.

Thank for the review!


Best regards, Andrey Borodin.


v10-0001-Iteration-count-argument-to-psql-watch-command.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-03-17 Thread Andrey Borodin
Hi Peter,

Thanks for the feedback! I'll work on it during the weekend.

On Thu, Mar 16, 2023 at 6:23 PM Peter Geoghegan  wrote:
>
> existence of a "same" routine hints at some confusion about "equality
> versus equivalence" issues.

Hmm...yes, actually, GiST deals with floats routinely. And there might
be some sorts of NaNs and Infs that are equal, but not binary
equivalent.
I'll think more about it.

gist_get_adjusted() calls "same" routine, which for type point will
use FPeq(double A, double B). And this might be kind of a corruption
out of the box. Because it's an epsilon-comparison, ε=1.0E-06.
GiST might miss newly inserted data, because the "adjusted" tuple was
"same" if data is in proximity of 0.01 of any previously indexed
point, but out of known MBRs.
I'll try to reproduce this tomorrow, so far no luck.


Best regards, Andrey Borodin.




Re: psql \watch 2nd argument: iteration count

2023-03-16 Thread Andrey Borodin
On Wed, Mar 15, 2023 at 5:54 PM Michael Paquier  wrote:
>
> On Wed, Mar 15, 2023 at 04:58:49PM +0900, Michael Paquier wrote:
> > Yep.  You are right.
>
> Fixed that and applied 0001.
Great, thanks!

>
> +valptr++;
> +if (strncmp("i", opt, strlen("i")) == 0 ||
> +strncmp("interval", opt, strlen("interval")) == 0)
> +{
>
> Did you look at process_command_g_options() and if some consolidation
> was possible?  It would be nice to have APIs shaped so as more
> sub-commands could rely on the same facility in the future.
I've tried, but they behave so differently. I could reuse only the
"char *valptr = strchr(opt, '=');" thing from there :)
And process_command_g_options() changes data in-place...
Actually, I'm not sure having "i" == "interval" and "c"=="count" is a
good idea here too. I mean I like it, but is it coherent?
Also I do not like repeating 4 times this 5 lines
+ pg_log_error("\\watch: incorrect interval value '%s'", valptr);
+ free(opt);
+ resetPQExpBuffer(query_buf);
+ psql_scan_reset(scan_state);
+ return PSQL_CMD_ERROR;
But I hesitate defining a new function for this...

>
> -\watch [  class="parameter">seconds ]
> +\watch [  class="parameter">seconds [  class="parameter">iterations ] ]
>
> This set of changes is not reflected in the documentation.
Done.

> With an interval in place, we could now automate some tests with
> \watch where it does not fail.  What do you think about adding a test
> with a simple query, an interval of 0s and one iteration?
Done. Also found a bug that we actually were doing N+1 iterations.

Thank you for working on this!

Best regards, Andrey Borodin.


v9-0001-Iteration-count-argument-to-psql-watch-command.patch
Description: Binary data


Re: psql \watch 2nd argument: iteration count

2023-03-14 Thread Andrey Borodin
On Tue, Mar 14, 2023 at 6:25 PM Michael Paquier  wrote:
>
> On Wed, Mar 15, 2023 at 10:19:28AM +0900, Kyotaro Horiguchi wrote:
> > I hesitated to propose such a level of simplification, but basically I
> > was alsothinking the same thing.
+1

> Okay, fine by me to use one single message.  I'd rather still keep the
> three tests, though, as they check the three conditions upon which the
> error would be triggered.

PFA v8. Thanks!

Best regards, Andrey Borodin.


v8-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


v8-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


Re: psql \watch 2nd argument: iteration count

2023-03-13 Thread Andrey Borodin
On Mon, Mar 13, 2023 at 5:26 PM Michael Paquier  wrote:
>
> I have tweaked things as bit as of the attached, and ran pgindent.
> What do you think?
>

Looks good to me.
Thanks!

Best regards, Andrey Borodin.




Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Andrey Borodin
Michael, thanks for reviewing  this!

On Sun, Mar 12, 2023 at 6:17 PM Michael Paquier  wrote:
>
> On Sun, Mar 12, 2023 at 01:05:39PM -0700, Andrey Borodin wrote:
> > In the review above Kyotaro-san suggested that message should contain
> > information on what it expects... So, maybe then
> > pg_log_error("\\watch interval must be non-negative number, but
> > argument is '%s'", opt); ?
> > Or perhaps with articles? pg_log_error("\\watch interval must be a
> > non-negative number, but the argument is '%s'", opt);
>
> -   HELP0("  \\watch [SEC]   execute query every SEC seconds\n");
> +   HELP0("  \\watch [SEC [N]]   execute query every SEC seconds N 
> times\n");
>
> Is that really the interface we'd want to work with in the long-term?
> For one, this does not give the option to specify only an interval
> while relying on the default number of seconds.  This may be fine, but
> it does  not strike me as the best choice.  How about doing something
> more extensible, for example:
> \watch [ (option=value [, option=value] .. ) ] [SEC]
>
> I am not sure that this will be the last option we'll ever add to
> \watch, so I'd rather have us choose a design more flexible than
> what's proposed here, in a way similar to \g or \gx.
I've attached an implementation of this proposed interface (no tests
and help message yet, though, sorry).
I tried it a little bit,  and it works for me.
fire query 3 times
SELECT 1;\watch c=3 0
or with 200ms interval
SELECT 1;\watch i=.2 c=3
nonsense, but correct
SELECT 1;\watch i=1e-100 c=1

Actually Nik was asking for the feature. Nik, what do you think?

On Sun, Mar 12, 2023 at 6:26 PM Michael Paquier  wrote:
>
> While on it, I have some comments about 0001.
>
> -sleep = strtod(opt, NULL);
> -if (sleep <= 0)
> -sleep = 1;
> +char *opt_end;
> +sleep = strtod(opt, &opt_end);
> +if (sleep < 0 || *opt_end)
> +{
> +pg_log_error("\\watch interval must be non-negative number, "
> + "but argument is '%s'", opt);
> +free(opt);
> +resetPQExpBuffer(query_buf);
> +psql_scan_reset(scan_state);
> +return PSQL_CMD_ERROR;
> +}
>
> Okay by me to make this behavior a bit better, though it is not
> something I would backpatch as it can influence existing workflows,
> even if they worked in an inappropriate way.
+1

> Anyway, are you sure that this is actually OK?  It seems to me that
> this needs to check for three things:
> - If sleep is a negative value.
> - errno should be non-zero.
I think we can treat errno and negative values equally.
> - *opt_end == opt.
>
> So this needs three different error messages to show the exact error
> to the user?
I've tried this approach, but could not come up with sufficiently
different error messages...

>  Wouldn't it be better to have a couple of regression
> tests, as well?
Added two tests.

Thanks!


Best regards, Andrey Borodin.


v6-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v6-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Andrey Borodin
On Thu, Mar 9, 2023 at 11:25 AM Nathan Bossart  wrote:
>
> +   pg_log_error("Watch period must be 
> non-negative number, but argument is '%s'", opt);
>
> After looking around at the other error messages in this file, I think we
> should make this more concise.  Maybe something like
>
> pg_log_error("\\watch: invalid delay interval: %s", opt);
In the review above Kyotaro-san suggested that message should contain
information on what it expects... So, maybe then
pg_log_error("\\watch interval must be non-negative number, but
argument is '%s'", opt); ?
Or perhaps with articles? pg_log_error("\\watch interval must be a
non-negative number, but the argument is '%s'", opt);

>
> +   free(opt);
> +   resetPQExpBuffer(query_buf);
> +   return PSQL_CMD_ERROR;
>
> Is this missing psql_scan_reset(scan_state)?
Yes, fixed.

Best regards, Andrey Borodin.


v5-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v5-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: psql \watch 2nd argument: iteration count

2023-03-08 Thread Andrey Borodin
On Wed, Mar 8, 2023 at 10:49 AM Nathan Bossart  wrote:
>
> Is there any reason to disallow 0 for the sleep argument?  I often use
> commands like "\watch .1" to run statements repeatedly with very little
> time in between, and I'd use "\watch 0" instead if it was available.
>

Yes, that makes sense! Thanks!

Best regards, Andrey Borodin.


v4-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v4-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: [EXTERNAL] Re: Support load balancing in libpq

2023-03-02 Thread Andrey Borodin
On Wed, Mar 1, 2023 at 12:03 PM Jelte Fennema  wrote:
>
> done and updated cf entry
>

Hi Jelte!

I've looked into the patch. Although so many improvements can be
suggested, It definitely makes sense as-is too.
These improvements might be, for example, sorting hosts according to
ping latency or something like that. Or, perhaps, some other balancing
policies. Anyway, randomizing is a good start too.

I want to note that the Fisher-Yates algorithm is implemented in a
difficult to understand manner.
+if (j < i) /* avoid fetching undefined data if j=i */
This stuff does not make sense in case of shuffling arrays inplace. It
is important only for making a new copy of an array and only in
languages that cannot access uninitialized values. I'd suggest just
removing this line (in both cases).


Best regards, Andrey Borodin.




Re: psql \watch 2nd argument: iteration count

2023-02-26 Thread Andrey Borodin
Thanks for looking into this!

On Mon, Feb 20, 2023 at 6:15 PM Kyotaro Horiguchi
 wrote:
>
>  FWIW the patch still accepts an incorrect parameter '1abc'
> by ignoring any trailing garbage.
Indeed, fixed.
>
> In any case, I reckon the error message should be more specific. In
> other words, it would be better if it suggests the expected input
> format and range.
+1.
Not a range, actually, because upper limits have no sense for a user.

>
> Regarding the second patch, if we want \watch to throw an error
> message for the garbage trailing to sleep times, I think we should do
> the same for iteration counts.
+1, done.

> Additionally, we need to update the
> documentation.
Done.

Thanks for the review!

Best regards, Andrey Borodin.


v3-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v3-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: psql \watch 2nd argument: iteration count

2023-02-20 Thread Andrey Borodin
> > > Also, if we do not provide a timespan, 2 seconds are selected. But if
> > > we provide an incorrect argument - 1 second is selected.
> > > PFA the patch that adds iteration count argument and makes timespan
> > > argument more consistent.
> >
> > That should probably be fixed.
>
> And should probably be independent patches.
>

PFA 2 independent patches.

Also, I've fixed a place to break after an iteration. Now if we have
e.g. 2 iterations - there will be only 1 sleep time.

Thanks!


Best regards, Andrey Borodin.


v2-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


v2-0001-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


Re: [PATCH] Add pretty-printed XML output option

2023-02-16 Thread Andrey Borodin
On Thu, Feb 16, 2023 at 2:12 PM Jim Jones  wrote:
>
> I'm squashing v12-0001 and v12-0002 (v13 attached).
I've looked into the patch. The code looks to conform to usual expectations.
One nit: this comment should have just one asterisk.
+ /**
And I have a dumb question: is this function protected from using
external XML namespaces? What if the user passes some xmlns that will
force it to read namespace data from the server filesystem? Or is it
not possible? I see there are a lot of calls to xml_parse() anyway,
but still...

Best regards, Andrey Borodin.




psql \watch 2nd argument: iteration count

2023-02-16 Thread Andrey Borodin
Hi hackers!

>From time to time I want to collect some stats from locks, activity
and other stat views into one table from different time points. In
this case the \watch psql command is very handy. However, it's not
currently possible to specify the number of times a query is
performed.
Also, if we do not provide a timespan, 2 seconds are selected. But if
we provide an incorrect argument - 1 second is selected.
PFA the patch that adds iteration count argument and makes timespan
argument more consistent.
What do you think?


Best regards, Andrey Borodin.


0001-Add-iteration-count-argument-to-psql-watch-command.patch
Description: Binary data


Re: Add connection active, idle time to pg_stat_activity

2023-02-16 Thread Andrey Borodin
On Wed, Feb 1, 2023 at 12:46 PM Sergey Dudoladov
 wrote:
>
> I've sketched the first version of a patch to add pg_stat_session.
> Please review this early version.

Hi Sergey!

I've taken a look into the patch and got some notes.
1. It is hard to understand what fastpath backend state is. What do
fastpath metrics mean for a user?
2. Anyway, the path "if (PGSTAT_IS_FASTPATH(beentry))" seems
unreachable to me. I'm a bit surprised that compilers do not produce
warnings about it. Maybe I'm just wrong.
3. Tests do not check any incrementation logic. I think we can have
some test that verifies delta for select some_counter from
pg_stat_session where pid = pg_backend_pid();
4. Macroses like PGSTAT_IS_RUNNING do not look like net win in code
readability and PGSTAT prefix have no semantic load.


That's all I've found so far. Thank you!

Best regards, Andrey Borodin.

PS. We were doing on-air review session [0], I hope Nik will chime-in
with "usability part of a review".

[0] https://youtu.be/vTV8XhWf3mo?t=2404




Re: Silent overflow of interval type

2023-02-15 Thread Andrey Borodin
On Wed, Feb 15, 2023 at 7:08 AM Nikolai  wrote:
>
> The patch attached simply throws an error when an overflow is
> detected. However I'm not sure this is a reasonable approach for a
> code path that could be very hot in some workloads.

Given the extraordinary amount of overflow checks in the nearby code
of timestamp.c, I'd say that this case should not be an exception.
By chance did you look at all other nearby cases, is it the only place
with overflow? (I took a look too, but haven't found anything
suspicious)

Best regards, Andrey Borodin.




Re: Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andrey Borodin
On Mon, Feb 13, 2023 at 4:39 PM Andres Freund  wrote:
>
> The problem I'm talking about is the increased overhead in InstrStopNode(),
> due to BufferUsageAccumDiff() getting more expensive.
>

Thanks, now I understand the problem better. According to godbolt.com
my patch doubles the number of instructions in this function. Unless
we compute only tables\indexes\matviews.
Anyway, without regarding functionality of this particular patch,
BufferUsageAccumDiff() does not seem slow to me. It's just a
branchless bunch of simd instructions. Performance of this function
might matter only when called gazillion times per second.


Best regards, Andrey Borodin.


for godbolt.cpp
Description: Binary data


Re: Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andrey Borodin
On Mon, Feb 13, 2023 at 4:29 PM Andres Freund  wrote:
> > 1. Some more increments on hot paths. We have to add this tiny toll to
> > every single buffer hit, but it will be seldom of any use.
>
> Additionally, I bet it slows down EXPLAIN (ANALYZE, BUFFERS) noticeably. It's
> already quite expensive...
>

I think collection of instrumentation is done unconditionally.
We always do that
pgBufferUsage.shared_blks_hit++;
when the buffer is in shared_buffers.


Best regards, Andrey Borodin.




Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andrey Borodin
Hi hackers!

I was asked to prototype a feature that helps to distinguish shared
buffer usage between index reads and heap reads. Practically it looks
like this:

# explain (analyze,verbose,buffers) select nextval('s');
   QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=8) (actual time=1.159..1.160
rows=1 loops=1)
   Output: nextval('s'::regclass)
   Buffers: shared hit=7(relation 3, index 4) read=6(relation 1, index
4, sequence 1) dirtied=1
 Planning Time: 0.214 ms
 Execution Time: 1.238 ms
(5 rows)


The change is in these parts "(relation 3, index 4)" and "(relation 1,
index 4, sequence 1)". Probably, it should help DBAs to better
understand complex queries.
I think cluttering output with more text is OK as long as "verbose" is
requested.

But there are some caveats:
1. Some more increments on hot paths. We have to add this tiny toll to
every single buffer hit, but it will be seldom of any use.
2. It's difficult to measure writes caused by query, and even dirties.
The patch adds "evictions" caused by query, but they have little
practical sense too.

All in all I do not have an opinion if this feature is a good tradeoff.
What do you think? Does the feature look useful? Do we want a more
polished implementation?

Thanks!


Best regards, Andrey Borodin.


v1-0001-Split-EXPLAIN-ANALYZE-BUFFERS-by-RelKind.patch
Description: Binary data


Re: pglz compression performance, take two

2023-02-13 Thread Andrey Borodin
On Mon, Feb 13, 2023 at 4:03 PM Andres Freund  wrote:
>
> Due to the sanitizer changes, and this feedback, I'm marking the entry as
> waiting on author.
>
Thanks Andres! Yes, I plan to make another attempt to refactor this
patch on the weekend. If this attempt fails, I think we should just
reject it and I'll get back to this during summer.

Best regards, Andrey Borodin.




Re: UUID v7

2023-02-10 Thread Andrey Borodin
On Fri, Feb 10, 2023 at 5:14 PM Andres Freund  wrote:
>
> Perhaps we should name the function something like
> gen_time_ordered_random_uuid() instead?  That gives us a bit more flexibility
> about what uuid version we generate. And it might be easier for users, anyway.
I think users would be happy with any name.

> Still not sure what version we'd best use for now. Perhaps v8?
V8 is just a "custom data" format. Like "place whatever you want".
Though I agree that its sample implementation looks to be better.



On Fri, Feb 10, 2023 at 5:18 PM Tom Lane  wrote:
>
> > Hm. It seems somewhat worrisome to claim something is a v7 UUID when it 
> > might
> > turn out to not be one.
>
> I think there is no need to rush this into v16.  Let's wait for the
> standardization process to play out.
>

Standardization per se does not bring value to users. However, I agree
that eager users can just have it today as an extension and be happy
with it [0].
Maybe it's fine to wait a year for others...


Best regards, Andrey Borodin.


[0] https://github.com/x4m/pg_uuid_next




UUID v7

2023-02-10 Thread Andrey Borodin
Hello pgsql-hackers!

As you may know there's a new version of UUID being standardized [0].
These new algorithms of UUID generation are very promising for
database performance. It keeps data locality for time-ordered values.
>From my POV v7 is especially needed for users. Current standard status
is "draft". And I'm not sure it will be accepted before our feature
freeze for PG16. Maybe we could provide a draft implementation in 16
and adjust it to the accepted version if the standard is changed? PFA
patch with implementation.

What do you think?

cc Brad Peabody and Kyzer R. Davis, authors of the standard
cc Kirk Wolak and Nik Samokhvalov who requested the feature

Thanks!

Best regards, Andrey Borodin.

[0] 
https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04


v1-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: Too coarse predicate locks granularity for B+ tree indexes

2023-02-07 Thread Andrey Borodin
On Tue, Feb 7, 2023 at 4:01 AM Rinat Shigapov  wrote:
>
> Thomas, thank you for the details!
>
> Have you kept the branch that you used to generate the patch? Which commit 
> should the patch apply to?
>

You can try something like
git checkout 'master@{2018-05-13 13:37:00}'
to get a commit by date from rev-parse.

Best regards, Andrey Borodin.




Re: pglz compression performance, take two

2023-02-06 Thread Andrey Borodin
On Mon, Feb 6, 2023 at 11:57 AM Tomas Vondra
 wrote:
>
> I wonder what that means for the patch. I haven't investigated this at
> all, but it seems as if the optimization means we fail to find a match,
> producing a tad larger output. That may be still be a good tradeoff, as
> long as the output is correct (assuming it doesn't break some promise
> regarding expected output).
>

Yes, patch produces correct results and faster. And keeps the
compression ratio the same except for some one odd case.
The only problem is I do not understand _why_ it happens in that odd
case. And so far I failed to extract input\outputs of that odd case,
because it is buried under so many layers of abstraction and affects
only late stats.
Maybe the problem is not in compression at all...

Best regards, Andrey Borodin.




Re: pglz compression performance, take two

2023-02-05 Thread Andrey Borodin
On Sun, Feb 5, 2023 at 5:51 PM Tomas Vondra
 wrote:
>
> On 2/5/23 19:36, Andrey Borodin wrote:
> > On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin  
> > wrote:
> >>
> >> Hello! Please find attached v8.
> >
> > I got some interesting feedback from some patch users.
> > There was an oversight that frequently yielded results that are 1,2 or
> > 3 bytes longer than expected.
> > Looking closer I found that the correctness of the last 3-byte tail is
> > checked in two places. PFA fix for this. Previously compressed data
> > was correct, however in some cases few bytes longer than the result of
> > current pglz implementation.
> >
>
> Thanks. What were the consequences of the issue? Lower compression
> ratio, or did we then fail to decompress the data (or would current pglz
> implementation fail to decompress it)?
>
The data was decompressed fine. But extension tests (Citus's columnar
engine) hard-coded a lot of compression ratio stuff.
And there is still 1 more test where optimized version produces 1 byte
longer output. I'm trying to find it, but with no success yet.

There are known and documented cases when optimized pglz version would
do so. good_match without 10-division and memcmp by 4 bytes. But even
disabling this, still observing 1-byte longer compression results
persists... The problem is the length is changed after deleting some
data, so compression of that particular sequence seems to be somewhere
far away.
It was funny at the beginning - to hunt for 1 byte. But the weekend is
ending, and it seems that byte slipped from me again...

Best regards, Andrey Borodin.




Re: pglz compression performance, take two

2023-02-05 Thread Andrey Borodin
On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin  wrote:
>
> Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks!

Best regards, Andrey Borodin.


v9-0001-Reorganize-pglz-compression-code.patch
Description: Binary data


v9-0002-Fix-oversight-for-compression-of-last-4-bytes.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-02-04 Thread Andrey Borodin
Thank for working on this, Peter!

On Fri, Feb 3, 2023 at 6:50 PM Peter Geoghegan  wrote:
>
> I think that we should focus on getting the GiST patch into shape for
> commit first, since that seems easier.
>

Here's the next version. I've focused on GiST part in this revision.
Changes:
1. Refactored index_chackable so that is shared between all AMs.
2. Renamed gist_index_parent_check -> gist_index_check
3. Gathered reviewers (in no particular order). I hope I didn't forget
anyone. GIN patch is based on work by Grigory Kryachko, but
essentially rewritten by Heikki. Somewhat cosmetically whacked by me.
4. Extended comments for GistScanItem,
gist_check_parent_keys_consistency() and gist_refind_parent().

I tried adding support of GiST in pg_amcheck, but it is largely
assuming the relation is either heap or B-tree. I hope to do that part
tomorrow or in nearest future.

Here's the current version. Thank you!


Best regards, Andrey Borodin.


v23-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v23-0002-Add-gist_index_check-function-to-verify-GiST-ind.patch
Description: Binary data


v23-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


Re: How to implement read operations for my own access method?

2023-02-04 Thread Andrey Borodin
On Sat, Feb 4, 2023 at 12:59 AM jack...@gmail.com  wrote:
>
>
> Hi,I'm trying to implement my own access method. But I find the functions 
> baout read is difficult.
> Can you give me an existed easy extension that impelment the tableamroutine 
> to reference? I ho[e
> it's not complicated like heap_am,and it support insert sqls and select 
> sqls.Thanks
>

Hi Jack,

I'd recommend first to start from official documentation on index
access methods [0].
Please also check the contrib/bloom module [1]. It is designed to
showcase index-as-extension technology.
Also, you can see my free lectures about details of implementation of
built-in access methods [2]. Also, I can offer you my PGCon talk
"Index DIY" about forking GiST into extension [3].

Thank you!

Best regards, Andrey Borodin.
[0] https://www.postgresql.org/docs/current/xindex.html
[1] https://github.com/postgres/postgres/tree/REL_15_STABLE/contrib/bloom
[2] 
https://www.youtube.com/watch?v=UgSeSo973lA&list=PLzrhBdcKTjLTyCIdDO1ig8qCZFZYYi3VH
[3] https://github.com/x4m/index_diy




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-02-04 Thread Andrey Borodin
On Sat, Feb 4, 2023 at 2:57 AM Andres Freund  wrote:
>
> Is there a good way to make breakage in the page recycling mechanism
> visible with gist? I guess to see corruption, I'd have to halt a scan
> before a page is visited with gdb, then cause the page to be recycled
> prematurely in another session, then unblock the first? Which'd then
> visit that page, thinking it to be in a different part of the tree than
> it actually is?
>

In most cases landing on one extra page will not affect the scan.
Worst case that I can imagine - scan is landing on a page that is the
new parent of the deleted page. Even then we cannot end up with
infinite index scan - we will just make one extra loop. Although,
IndexScan will yield duplicate tids.

In case of interference with concurrent insertion we will get a tree
structure departed from optimal, but that is not a problem.

Best regards, Andrey Borodin.




Re: Something is wrong with wal_compression

2023-01-27 Thread Andrey Borodin
On Fri, Jan 27, 2023 at 7:40 PM Tom Lane  wrote:
>
> What are you using it for, that you don't care whether the answer
> is trustworthy?
>

It's not trustworthy anyway. Xid wraparound might happen during
reconnect. I suspect we can design a test that will show that it does
not always show correct results during xid->2pc conversion (there is a
point in time when xid is not in regular and not in 2pc, and I'm not
sure ProcArrayLock is held). Maybe there are other edge cases.

Anyway, if a user wants to know the status of xid in case of
disconnection they have prepared xacts.

Best regards, Andrey Borodin.




Re: Something is wrong with wal_compression

2023-01-26 Thread Andrey Borodin
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane  wrote:
>
> Indeed, it seems like this behavior makes pg_xact_status() basically
> useless as things stand.
>

If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.


Best regards, Andrey Borodin.




Re: Something is wrong with wal_compression

2023-01-26 Thread Andrey Borodin
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
>
> That test case is demonstrating fundamental
> database corruption after a crash.
>

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

It was discussed long before WAL compression here [0]. The thing is it
is easier to reproduce with compression, but compression has nothing
to do with it, as far as I understand.

Proposed fix is here[1], but I think it's better to fix the test. It
should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x
integer);".


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/565FB155-C6B0-41E2-8C44-7B514DC25132%2540yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/20210313012820.GJ29463%40telsasoft.com#0f18d3a4d593ea656fdc761e026fee81




Re: Experiments with Postgres and SSL

2023-01-18 Thread Andrey Borodin
On Wed, Jan 18, 2023 at 7:16 PM Greg Stark  wrote:
>
> So I took a look into what it would take to do and I think it would
> actually be quite feasible. The first byte of a standard TLS
> connection can't look anything like the first byte of any flavour of
> Postgres startup packet because it would be the high order bits of the
> length so unless we start having multi-megabyte startup packets
>

This is a fascinating idea! I like it a lot.
But..do we have to treat any unknown start sequence of bytes as a TLS
connection? Or is there some definite subset of possible first bytes
that clearly indicates that this is a TLS connection or not?

Best regards, Andrey Borodin.




Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Andrey Borodin
On Fri, Jan 13, 2023 at 7:35 PM Jose Arthur Benetasso Villanova
 wrote:
>
> Hello again. I see the change. Thanks
>

Thanks! I also found out that there was a CI complaint about amcheck.h
not including some necessary stuff. Here's a version with a fix for
that.

Best regards, Andrey Borodin.


v21-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v21-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v21-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Andrey Borodin
On Fri, Jan 13, 2023 at 3:46 AM Jose Arthur Benetasso Villanova
 wrote:
>
> The only thing that I found is the gin_index_parent_check function in docs
> still references the "gin_index_parent_check(index regclass,
> heapallindexed boolean) returns void"
>

Correct! Please find the attached fixed version.

Thank you!

Best regards, Andrey Borodin.


v20-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v20-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v20-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Transaction timeout

2023-01-13 Thread Andrey Borodin
Thanks for the review Nikolay!

On Fri, Jan 13, 2023 at 8:03 AM Nikolay Samokhvalov
 wrote:
>
> 1) The current test set has only 2 simple cases – I'd suggest adding one more 
> (that one that didn't work in v1):
>
> gitpod=# set transaction_timeout to '20ms';
> SET
> gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select 
> pg_sleep(.01); commit;
I tried exactly these tests - tests were unstable on Windows. Maybe
that OS has a more coarse-grained timer resolution.
It's a tradeoff between time spent on tests, strength of the test and
probability of false failure. I chose small time without false alarms.

> – it seems we could (should) have one more successful "1s wait, 3s sleep" 
> iteration here, ~727ms somehow wasted in a loop, quite a lot.

I think big chunk from these 727ms were spent between "BEGIN" and
"select now(), clock_timestamp(), pg_sleep(3) \watch 1". I doubt patch
really contains arithmetic errors.

Many thanks for looking into this!


Best regards, Andrey Borodin.




Re: Transaction timeout

2023-01-12 Thread Andrey Borodin
On Thu, Jan 12, 2023 at 11:24 AM Nathan Bossart
 wrote:
>
> On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
> > I've rewritten this part to correctly report all timeouts that did
> > happen. However there's now a tricky comma-formatting code which was
> > tested only manually.
>
> I suspect this will make translation difficult.
I use special functions for this like _()

char* lock_reason = lock_timeout_occurred ? _("lock timeout") : "";

and then
ereport(ERROR, (errcode(err_code),
 errmsg("canceling statement due to %s%s%s%s%s", lock_reason, comma1,
 stmt_reason, comma2, tx_reason)));

I hope it will be translatable...

> >> > > > + ahprintf(AH, "SET transaction_timeout = 0;\n");
> >> > >
> >> > > Hm - why is that the right thing to do?
> >> > Because transaction_timeout has effects of statement_timeout.
> >>
> >> I guess it's just following precedent - but it seems a bit presumptuous
> >> to just disable safety settings a DBA might have set up. That makes some
> >> sense for e.g. idle_in_transaction_session_timeout, because I think
> >> e.g. parallel backup can lead to a connection being idle for a bit.
> >
> > I do not know. My reasoning - everywhere we turn off
> > statement_timeout, we should turn off transaction_timeout too.
> > But I have no strong opinion here. I left this code as is in the patch
> > so far. For the same reason I did not change anything in
> > pg_backup_archiver.c.
>
> From 8383486's commit message:
>
> We disable statement_timeout and lock_timeout during dump and restore,
> to prevent any global settings that might exist from breaking routine
> backups.
>
> I imagine changing this could disrupt existing servers that depend on these
> overrides during backups, although I think Andres has a good point about
> disabling safety settings.  This might be a good topic for another thread.
>
+1.

Thanks for the review!

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2023-01-11 Thread Andrey Borodin
Hi Dilip! Thank you for the review!

On Tue, Jan 10, 2023 at 9:58 PM Dilip Kumar  wrote:
>
> On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin  wrote:
> >
> > On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > > does not apply on top of HEAD as in [1], please post a rebased patch:
> > >
> > Thanks! Here's the rebase.
>
> I was looking into this patch, it seems like three different
> optimizations are squeezed in a single patch
> 1) dividing buffer space in banks to reduce the seq search cost 2) guc
> parameter for buffer size scale 3) some of the buffer size values are
> modified compared to what it is on the head.  I think these are 3
> patches which should be independently committable.
There's no point in dividing SLRU buffers in parts unless the buffer's
size is configurable.
And it's only possible to enlarge default buffers size if SLRU buffers
are divided into banks.
So the features can be viewed as independent commits, but make no
sense separately.

But, probably, it's a good idea to split the patch back anyway, for
easier review.

>
> While looking into the first idea of dividing the buffer space in
> banks, I see that it will speed up finding the buffers but OTOH while
> searching the victim buffer it will actually can hurt the performance
> the slru pages which are frequently accessed are not evenly
> distributed across the banks.  So imagine the cases where we have some
> banks with a lot of empty slots and other banks from which we
> frequently have to evict out the pages in order to get the new pages
> in.
>

Yes. Despite the extremely low probability of such a case, this
pattern when a user accesses pages assigned to only one bank may
happen.
This case is equivalent to having just one bank, which means small
buffers. Just as we have now.

Thank you!

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2023-01-08 Thread Andrey Borodin
On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> does not apply on top of HEAD as in [1], please post a rebased patch:
>
Thanks! Here's the rebase.

Best regards, Andrey Borodin.


v24-0001-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-01-08 Thread Andrey Borodin
On Sun, Jan 8, 2023 at 8:05 PM Andrey Borodin  wrote:
>
> Please find the attached new version. In this patchset heapallindexed
> flag is removed from GIN checks.
>
Uh... sorry, git-formatted wrong branch.
Here's the correct version. Double checked.

Best regards, Andrey Borodin.


v19-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v19-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v19-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-01-08 Thread Andrey Borodin
Hi Jose, thank you for review and sorry for so long delay to answer.

On Wed, Dec 14, 2022 at 4:19 AM Jose Arthur Benetasso Villanova
 wrote:
>
>
> On Sun, 27 Nov 2022, Andrey Borodin wrote:
>
> > On Sun, Nov 27, 2022 at 1:29 PM Andrey Borodin  
> > wrote:
> >>
> > I was wrong. GIN check does similar gin_refind_parent() to lock pages
> > in bottom-up manner and truly verify downlink-child_page invariant.
>
> Does this mean that we need the adjustment in docs?
It seems to me that gin_index_parent_check() docs are correct.

>
> > Here's v17. The only difference is that I added progress reporting to
> > GiST verification.
> > I still did not implement heapallindexed for GIN. Existence of pending
> > lists makes this just too difficult for a weekend coding project :(
> >
> > Thank you!
> >
> > Best regards, Andrey Borodin.
> >
>
> I'm a bit lost here. I tried your patch again and indeed the
> heapallindexed inside gin_check_parent_keys_consistency has a TODO
> comment, but it's unclear to me if you are going to implement it or if the
> patch "needs review". Right now it's "Waiting on Author".
>

Please find the attached new version. In this patchset heapallindexed
flag is removed from GIN checks.

Thank you!

Best regards, Andrey Borodin.


v18-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v18-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


v18-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


Re: pglz compression performance, take two

2023-01-06 Thread Andrey Borodin
On Sun, Nov 27, 2022 at 10:43 AM Andrey Borodin  wrote:
>
> PFA review fixes (step 1 is unchanged).

Hello! Please find attached v8.
Changes are mostly cosmetic:
1. 2 steps from previous message were squashed together
2. I tried to do a better commit message

Thanks!

Best regards, Andrey Borodin.


v8-0001-Reorganize-pglz-compression-code.patch
Description: Binary data


Re: GROUP BY ALL

2023-01-06 Thread Andrey Borodin
On Fri, Jan 6, 2023 at 1:56 PM Bruce Momjian  wrote:
> Because Postgres requires GROUP BY
> of all non-aggregate columns of a target list, Postgres could certainly
> automatically generate the GROUP BY.  However, readers of the query
> might not easily distinguish function calls from aggregates, so in a way
> the GROUP BY is for the reader, not for the database server.
>

How about "SELECT a,b, count(*) FROM t GROUP AUTOMATICALLY;" ? And
then a shorthand for "SELECT a,b, count(*) FROM t GROUP;".

Anyway, the problem is not in clever syntax, but in the fact that it's
an SQL extension, not a standard...

Best regards, Andrey Borodin.




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-01-05 Thread Andrey Borodin
On Mon, Dec 19, 2022 at 6:41 AM Maxim Orlov  wrote:
>
> As always, reviews and opinions are very welcome.


Hi! I think that 64-bit xids are a very important feature and I want
to help advance it. That's why I want to try to understand a patch
better.
Do I get it right that the proposed v51 patchset only changes the SLRU
filenames and type of pageno representation? Is SLRU wraparound still
exactly there after 0x byte?

The thing is we had some nasty bugs because SLRU wraparound is tricky.
And I think it would be beneficial if we could get to continuous SLRU
space. But the patch seems to avoid addressing this problem.

Also, I do not understand what is the reason for splitting 1st and 2nd
steps. Certainly, there must be some logic behind it, but I just can't
grasp it...
And the purpose of the 3rd step with pg_upgrade changes is a complete
mystery for me. Please excuse my incompetence in the topic, but maybe
some commit message or comments would help. What kind of xact segments
conversion we do? Why is it only necessary for xacts, but not other
SLRUs?

Thank you for working on this important project!


Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2022-12-20 Thread Andrey Borodin
On Sat, Jul 23, 2022 at 1:48 AM Thomas Munro  wrote:
>
> On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin  wrote:
> > Thomas, do you still have any doubts? Or is it certain that SLRU will be 
> > replaced by any better subsystem in 16?
>
> Hi Andrey,
>
> Sorry for my lack of replies on this and the other SLRU thread -- I'm
> thinking and experimenting.  More soon.
>

Hi Thomas,

PostgreSQL 16 feature freeze is approaching again. Let's choose
something from possible solutions, even if the chosen one is
temporary.
Thank you!

Best regards, Andrey Borodin.




Re: GROUP BY ALL

2022-12-18 Thread Andrey Borodin
On Sun, Dec 18, 2022 at 8:30 PM Tom Lane  wrote:
>
> I'm not especially on board with "ALL" meaning "ALL (oh, but not
> aggregates)".

Yes, that's the weak part of the proposal. I even thought about
renaming it to "GROUP BY SOMEHOW" or even "GROUP BY SURPRISE ME".
I mean I see some cases when it's useful and much less cases when it's
dangerously ambiguous. E.g. grouping by result of a subquery looks way
too complex and unpredictable. But with simple Vars... what could go
wrong?

Best regards, Andrey Borodin.




GROUP BY ALL

2022-12-18 Thread Andrey Borodin
Hi hackers!

I saw a thread in a social network[0] about GROUP BY ALL. The idea seems useful.
I always was writing something like
select datname, usename, count(*) from pg_stat_activity group by 1,2;
and then rewriting to
select datname, usename, query, count(*) from pg_stat_activity group by 1,2;
and then "aaa, add a number at the end".

With the proposed feature I can write just
select datname, usename, count(*) from pg_stat_activity group by all;

PFA very dummy implementation just for a discussion. I think we can
add all non-aggregating targets.

What do you think?


Best regards, Andrey Borodin.

[0] 
https://www.linkedin.com/posts/mosha_duckdb-firebolt-snowflake-activity-7009615821006131200-VQ0o/


v1-0001-Implement-GROUP-BY-ALL.patch
Description: Binary data


Re: Transaction timeout

2022-12-18 Thread Andrey Borodin
On Wed, Dec 7, 2022 at 1:30 PM Andrey Borodin  wrote:
> I hope to address other feedback on the weekend.

Andres, here's my progress on working with your review notes.

> > @@ -3277,6 +3282,7 @@ ProcessInterrupts(void)
> >*/
> >   lock_timeout_occurred = get_timeout_indicator(LOCK_TIMEOUT, 
> > true);
> >   stmt_timeout_occurred = 
> > get_timeout_indicator(STATEMENT_TIMEOUT, true);
> > + tx_timeout_occurred = 
> > get_timeout_indicator(TRANSACTION_TIMEOUT, true);
> >
> >   /*
> >* If both were set, we want to report whichever timeout 
> > completed
>
> This doesn't update the preceding comment, btw, which now reads oddly:

I've rewritten this part to correctly report all timeouts that did
happen. However there's now a tricky comma-formatting code which was
tested only manually.

> > > > @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
> > > >   SetLatch(MyLatch);
> > > >  }
> > > >
> > > > +static void
> > > > +TransactionTimeoutHandler(void)
> > > > +{
> > > > +#ifdef HAVE_SETSID
> > > > + /* try to signal whole process group */
> > > > + kill(-MyProcPid, SIGINT);
> > > > +#endif
> > > > + kill(MyProcPid, SIGINT);
> > > > +}
> > > > +
> > >
> > > Why does this use signals instead of just setting the latch like
> > > IdleInTransactionSessionTimeoutHandler() etc?
> >
> > I just copied statement_timeout behaviour. As I understand this
> > implementation is prefered if the timeout can catch the backend
> > running at full steam.
>
> Hm. I'm not particularly convinced by that code. Be that as it may, I
> don't think it's a good idea to have one more copy of this code. At
> least the patch should wrap the signalling code in a helper.

Done, now there is a single CancelOnTimeoutHandler() handler.

> > > > diff --git a/src/bin/pg_dump/pg_backup_archiver.c 
> > > > b/src/bin/pg_dump/pg_backup_archiver.c
> > > > index 0081873a72..5229fe3555 100644
> > > > --- a/src/bin/pg_dump/pg_backup_archiver.c
> > > > +++ b/src/bin/pg_dump/pg_backup_archiver.c
> > > > @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH)
> > > >   ahprintf(AH, "SET statement_timeout = 0;\n");
> > > >   ahprintf(AH, "SET lock_timeout = 0;\n");
> > > >   ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n");
> > > > + ahprintf(AH, "SET transaction_timeout = 0;\n");
> > >
> > > Hm - why is that the right thing to do?
> > Because transaction_timeout has effects of statement_timeout.
>
> I guess it's just following precedent - but it seems a bit presumptuous
> to just disable safety settings a DBA might have set up. That makes some
> sense for e.g. idle_in_transaction_session_timeout, because I think
> e.g. parallel backup can lead to a connection being idle for a bit.

I do not know. My reasoning - everywhere we turn off
statement_timeout, we should turn off transaction_timeout too.
But I have no strong opinion here. I left this code as is in the patch
so far. For the same reason I did not change anything in
pg_backup_archiver.c.

> > Either way we can do batch function enable_timeouts() instead
> > enable_timeout_after().
>
> > Does anything of it make sense?
>
> I'm at least as worried about the various calls *after* the execution of
> a statement.

I think this code is just a one bit check
if (get_timeout_active(TRANSACTION_TIMEOUT))
inside of get_timeout_active(). With all 14 timeouts we have, I don't
see a good way to optimize stuff so far.

> > + if (tx_timeout_occurred)
> > + {
> > + LockErrorCleanup();
> > + ereport(ERROR,
> > + (errcode(ERRCODE_TRANSACTION_TIMEOUT),
> > +  errmsg("canceling transaction due to 
> > transaction timeout")));
> > + }
>
> The number of calls to LockErrorCleanup() here feels wrong - there's
> already 8 calls in ProcessInterrupts(). Besides the code duplication I
> also think it's not a sane idea to rely on having LockErrorCleanup()
> before all the relevant ereport(ERROR)s.

I've refactored that code down to 7 calls of LockErrorCleanup() :)
Logic behind various branches is not clear for me, e.g. why we do not
LockErrorCleanup() when reading commands f

Re: Transaction timeout

2022-12-07 Thread Andrey Borodin
On Wed, Dec 7, 2022 at 10:23 AM Andres Freund  wrote:
>
> On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote:
> > Fixed. Added test for this.
>
> The tests don't pass: https://cirrus-ci.com/build/4811553145356288
>
oops, sorry. Here's the fix. I hope to address other feedback on the
weekend. Thank you!

Best regards, Andrey Borodin.


v3-0001-Intorduce-transaction_timeout.patch
Description: Binary data


Re: Transaction timeout

2022-12-05 Thread Andrey Borodin
Thanks for looking into this Andres!

On Mon, Dec 5, 2022 at 3:07 PM Andres Freund  wrote:
>
> I'm a bit worried about adding evermore branches and function calls for
> the processing of single statements. We already spend a noticable
> percentage of the cycles for a single statement in PostgresMain(), this
> adds additional overhead.
>
> I'm somewhat inclined to think that we need some redesign here before we
> add more overhead.
>
We can cap statement_timeout\idle_session_timeout by the budget of
transaction_timeout left.
Either way we can do batch function enable_timeouts() instead
enable_timeout_after().

Does anything of it make sense?

>
> > @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
> >   SetLatch(MyLatch);
> >  }
> >
> > +static void
> > +TransactionTimeoutHandler(void)
> > +{
> > +#ifdef HAVE_SETSID
> > + /* try to signal whole process group */
> > + kill(-MyProcPid, SIGINT);
> > +#endif
> > + kill(MyProcPid, SIGINT);
> > +}
> > +
>
> Why does this use signals instead of just setting the latch like
> IdleInTransactionSessionTimeoutHandler() etc?

I just copied statement_timeout behaviour. As I understand this
implementation is prefered if the timeout can catch the backend
running at full steam.

> > diff --git a/src/bin/pg_dump/pg_backup_archiver.c 
> > b/src/bin/pg_dump/pg_backup_archiver.c
> > index 0081873a72..5229fe3555 100644
> > --- a/src/bin/pg_dump/pg_backup_archiver.c
> > +++ b/src/bin/pg_dump/pg_backup_archiver.c
> > @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH)
> >   ahprintf(AH, "SET statement_timeout = 0;\n");
> >   ahprintf(AH, "SET lock_timeout = 0;\n");
> >   ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n");
> > + ahprintf(AH, "SET transaction_timeout = 0;\n");
>
> Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.

Thank you!

Best regards, Andrey Borodin.




Re: Transaction timeout

2022-12-03 Thread Andrey Borodin
On Fri, Dec 2, 2022 at 10:59 PM Nikolay Samokhvalov
 wrote:
>
> But it fails in the "worst" case I've described above – a series of small 
> statements:

Fixed. Added test for this.

Open questions:
1. Docs
2. Order of reporting if happened lock_timeout, statement_timeout, and
transaction_timeout simultaneously. Currently there's a lot of code
around this...

Thanks!

Best regards, Andrey Borodin.


v2-0001-Intorduce-transaction_timeout.patch
Description: Binary data


Transaction timeout

2022-12-02 Thread Andrey Borodin
Hello,

We have statement_timeout, idle_in_transaction_timeout,
idle_session_timeout and many more! But we have no
transaction_timeout. I've skimmed thread [0,1] about existing timeouts
and found no contraindications to implement transaction_timeout.

Nikolay asked me if I can prototype the feature for testing by him,
and it seems straightforward. Please find attached. If it's not known
to be a bad idea - we'll work on it.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/763A0689-F189-459E-946F-F0EC4458980B%40hotmail.com


v1-0001-Intorduce-transaction_timeout.patch
Description: Binary data


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread Andrey Borodin
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian  wrote:
>
> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> > 2. Process proc die immediately when a backend is waiting for sync
> > replication acknowledgement, as it does today, however, upon restart,
> > don't open up for business (don't accept ready-only connections)
> > unless the sync standbys have caught up.
> >
> >
> > Are you planning to block connections or queries to the database? It would 
> > be
> > good to allow connections and let them query the monitoring views but block 
> > the
> > queries until sync standby have caught up. Otherwise, this leaves a 
> > monitoring
> > hole. In cloud, I presume superusers are allowed to connect and monitor (end
> > customers are not the role members and can't query the data). The same 
> > can't be
> > true for all the installations. Could you please add more details on your
> > approach?
>
> I think ALTER SYSTEM should be allowed, particularly so you can modify
> synchronous_standby_names, no?

We don't allow SQL access during crash recovery until it's caught up
to consistency point. And that's for a reason - the cluster may have
invalid system catalog.
So no, after crash without a quorum of standbys you can only change
auto.conf and send SIGHUP. Accessing the system catalog during crash
recovery is another unrelated problem.

But I'd propose to treat these two points differently, they possess
drastically different scales of danger. Query Cancels are issued here
and there during failovers\switchovers. Crash amidst network
partitioning is not that common.

Best regards, Andrey Borodin.




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-28 Thread Andrey Borodin
On Mon, Nov 28, 2022 at 12:59 PM Bruce Momjian  wrote:
>
> You can prepare a patch, but it unlikely to get much interest until you
> get agreement on what the behavior should be.

We discussed the approach on 2020's Unconference [0]. And there kind
of was an agreement.
Then I made a presentation on FOSDEM with all the details [1].
The patch had been on commitfest since 2019 [2]. There were reviewers
in the CF entry, and we kind of had an agreement.
Jeff Davis proposed a similar patch [3]. And we certainly agreed about cancels.
And now Bharath is proposing the same.

We have the interest and agreement.


Best regards, Andrey Borodin.

[0] 
https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions
[1] 
https://archive.fosdem.org/2021/schedule/event/postgresql_caveats_of_replication/attachments/slides/4365/export/events/attachments/postgresql_caveats_of_replication/slides/4365/sides.pdf
[2] https://commitfest.postgresql.org/26/2402/
[3] 
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel%40j-davis.com#415dc2f7d41b8a251b419256407bb64d




Re: Amcheck verification of GiST and GIN

2022-11-27 Thread Andrey Borodin
On Sun, Nov 27, 2022 at 1:29 PM Andrey Borodin  wrote:
>
> GiST verification checks only one invariant that can be verified if
> page locks acquired the same way as page split does.
> GIN does not require ShareLock because it does not check cross-level 
> invariants.
>

I was wrong. GIN check does similar gin_refind_parent() to lock pages
in bottom-up manner and truly verify downlink-child_page invariant.

Here's v17. The only difference is that I added progress reporting to
GiST verification.
I still did not implement heapallindexed for GIN. Existence of pending
lists makes this just too difficult for a weekend coding project :(

Thank you!

Best regards, Andrey Borodin.


v17-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


v17-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v17-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2022-11-27 Thread Andrey Borodin
Hello!

Thank you for the review!

On Thu, Nov 24, 2022 at 6:04 PM Jose Arthur Benetasso Villanova
 wrote:
>
> It compiled with those 2 warnings:
>
> verify_gin.c: In function 'gin_check_parent_keys_consistency':
> verify_gin.c:481:38: warning: declaration of 'maxoff' shadows a previous
> local [-Wshadow=compatible-local]
>481 | OffsetNumber maxoff =
> PageGetMaxOffsetNumber(page);
>|  ^~
> verify_gin.c:453:41: note: shadowed declaration is here
>453 | maxoff;
>| ^~
> verify_gin.c:423:25: warning: unused variable 'heapallindexed'
> [-Wunused-variable]

Fixed.

>423 | boolheapallindexed = *((bool*)callback_state);
>| ^~
>

This one is in progress yet, heapallindexed check is not implemented yet...


>
> Also, I'm not sure about postgres' headers conventions, inside amcheck.h,
> there is "miscadmin.h" included, and inside verify_gin.c, verify_gist.h
> and verify_nbtree.c both amcheck.h and miscadmin.h are included.
Fixed.

>
> About the documentation, the bt_index_parent_check has comments about the
> ShareLock and "SET client_min_messages = DEBUG1;", and both
> gist_index_parent_check and gin_index_parent_check lack it. verify_gin
> uses DEBUG3, I'm not sure if it is on purpose, but it would be nice to
> document it or put DEBUG1 to be consistent.
GiST and GIN verifications do not take ShareLock for parent checks.
B-tree check cannot verify cross-level invariants between levels when
the index is changing.

GiST verification checks only one invariant that can be verified if
page locks acquired the same way as page split does.
GIN does not require ShareLock because it does not check cross-level invariants.

Reporting progress with DEBUG1 is a good idea, I did not know that
this feature exists. I'll implement something similar in following
versions.

> I did the following test:

Cool! Thank you!

>
> There are more code paths to follow to check the entire code, and I had a
> hard time to corrupt the indices. Is there any automated code to corrupt
> index to test such code?
>

Heapam tests do this in an automated way, look into this file
t/001_verify_heapam.pl.
Surely we can write these tests. At least automate what you have just
done in the review. However, committing similar checks is a very
tedious work: something will inevitably turn buildfarm red as a
watermelon.

I hope I'll post a version with DEBUG1 reporting and heapallindexed soon.
PFA current state.
Thank you for looking into this!

Best regards, Andrey Borodin.


v16-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v16-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v16-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-27 Thread Andrey Borodin
On Tue, Nov 8, 2022 at 9:06 PM Andrey Borodin  wrote:
>
> On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian  wrote:
> >
> > So, what happens when an insufficient number of synchronous replicas
> > reply?
>
> It's a failover.
>
> > Sessions hang because the synchronous behavior cannot be
> > guaranteed.  We then _allow_ query cancel so the user or administrator
> > can get out of the hung sessions and perhaps modify
> > synchronous_standby_names.
>
> Administrators should not modify synchronous_standby_names.
> Administrator must shoot this not in the head.
>

Some funny stuff. If a user tries to cancel a non-replicated transaction
Azure Postgres will answer: "user requested cancel while waiting for
synchronous replication ack. The COMMIT record has already flushed to
WAL locally and might not have been replicatead to the standby. We
must wait here."
AWS RDS will answer: "ignoring request to cancel wait for synchronous
replication"
Yandex Managed Postgres will answer: "canceling wait for synchronous
replication due requested, but cancelation is not allowed. The
transaction has already committed locally and might not have been
replicated to the standby. We must wait here."

So, for many services providing Postgres as a service it's only a
matter of wording.

Best regards, Andrey Borodin.




Re: Use fadvise in wal replay

2022-11-27 Thread Andrey Borodin
On Fri, Nov 25, 2022 at 1:12 PM Pavel Borisov  wrote:
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>

The performance benefit shows up only when readahead is disabled. And
on many workloads readahead brings unneeded data into page cache, so
it's preferred configuration.
In this particular case, time to apply WAL decreases from 53s to 33s.

Thanks!

Best Regards, Andrey Borodin.




Re: pglz compression performance, take two

2022-11-27 Thread Andrey Borodin
Hi Tomas,

On Sun, Nov 27, 2022 at 8:02 AM Tomas Vondra
 wrote:
>
> 1) For PGLZ_HISTORY_SIZE it uses literal 0x0fff, with the explanation:
>
> /* to avoid compare in iteration */
>
> which I think means intent to use this value as a bit mask, but then the
> only place using PGLZ_HISTORY_SIZE does
>
> if (hist_next == PGLZ_HISTORY_SIZE) ...
>
> i.e. a comparison. Maybe I misunderstand the comment, though.
>

As far as I recollect, it's a leftover from an attempt to optimize the
code into branchless version
I.e. instead of
if(hist_next>=PGLZ_HISTORY_SIZE)
 hist_next = 1;
use something like hist_next = hist_next & PGLZ_HISTORY_SIZE.
But the optimization did not show any measurable impact and was
improperly rolled back.

>
> 2) PGLZ_HistEntry was modified and replaces links (pointers) with
> indexes, but the comments still talk about "links", so maybe that needs
> to be changed.

The offsets still form a "linked list"... however I removed some
mentions of pointers, since they are not pointers anymore.

> Also, I wonder why next_id is int16 while hist_idx is
> uint16 (and also why id vs. idx)?

+1. I'd call them next and hash.

int16 next; /* instead of next_d */
uint16 hash; /* instead of hist_idx */

What do you think?
hist_idx comes from the function name... I'm not sure how far renaming
should go here.


>
> 3) minor formatting of comments
>
> 4) the comment in pglz_find_match about traversing the history seems too
> early - it's before handling invalid entries and cleanup, but it does
> not talk about that at all, and the actual while loop is after that.

Yes, this seems right for me.

PFA review fixes (step 1 is unchanged).
I did not include next_id->next and hist_idx -> hash rename.

Thank you!


Best regards, Andrey Borodin.


v7-0002-Review-fixes.patch
Description: Binary data


v7-0001-Reorganize-pglz-compression-code.patch
Description: Binary data


Re: libpq compression (part 2)

2022-11-17 Thread Andrey Borodin
On Thu, Nov 17, 2022 at 7:09 AM Peter Eisentraut
 wrote:
>
> Note that the above code was just changed in dce92e59b1.
Thanks!

> I don't know
> how that affects this patch set.
With dce92e59b1 it would be much easier to find a bug in the compression patch.

Some more notes about the patch. (sorry for posting review notes in so
many different messages)

1. zs_is_valid_impl_id(unsigned int id)
{
return id >= 0 && id < lengthof(zs_algorithms);
}

id is unsigned, no need to check it's non-negative.

2. This literal
{no_compression_name}
should be replaced by explicit form
{no_compression_name, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL}

3. Comments like "Return true if should, false if should not." are useless.

Thanks!

Best regards, Andrey Borodin.




Re: libpq compression (part 2)

2022-11-16 Thread Andrey Borodin
On Tue, Nov 15, 2022 at 7:17 PM Justin Pryzby  wrote:
>

Also I've found one more TODO item for the patch. Currently in
fe-connect.c patch is doing buffer reset:
conn->inStart = conn->inCursor = conn->inEnd = 0;
This effectively consumes butes up tu current cursor. However, packet
inspection is continued. The patch works because in most cases
following code will issue re-read of message. Coincidentally.
/* Get the type of request. */
if (pqGetInt((int *) &areq, 4, conn))
{
/* We'll come back when there are more data */
return PGRES_POLLING_READING;
}

But I think we need a proper
goto keep_going;
to start from the beginning of the message.


Thank you!

Best regards, Andrey Borodin.




Re: libpq compression (part 2)

2022-11-14 Thread Andrey Borodin
On Sat, Nov 12, 2022 at 8:04 PM Andrey Borodin  wrote:
>
> While testing patch some more I observe unpleasant segfaults:
>
> #27 0x0b08fda2 in lz4_decompress (d_stream=0x18cf82a0,
> src=0x7feae4fa505d, src_size=92,
> (gdb) select-frame 27
> (gdb) info locals
> ds = 0x18cf82a0
> decPtr = 0x18cf8aec ""
> decBytes = -87
>

I've debugged the stuff and the problem resulted to be in wrong
message limits for Lz4 compression
+#define MESSAGE_MAX_BYTES 819200
instead of
+#define MESSAGE_MAX_BYTES 8192

Other codecs can utilize continuation of the decompression stream
using ZS_DATA_PENDING, while Lz4 cannot do this. I was going to
produce quickfix for all my lz4 findings, but it occured to me that a
patchset needs a heavy rebase. If no one shows up to fix it I'll do
that in one of the coming weekends. Either way here's a reproducer for
the coredump:
psql 'compression=lz4' -f boom.sql

Thanks!

Best regards, Andrey Borodin.


boom.sql
Description: Binary data


Re: libpq compression (part 2)

2022-11-12 Thread Andrey Borodin
On Sat, Nov 12, 2022 at 1:47 PM Andrey Borodin  wrote:
>
> I've tried the patch, it works as advertised.

While testing patch some more I observe unpleasant segfaults:

#26 0x7fecafa1e058 in __memcpy_ssse3_back () from target:/lib64/libc.so.6
#27 0x0b08fda2 in lz4_decompress (d_stream=0x18cf82a0,
src=0x7feae4fa505d, src_size=92,
src_processed=0x79f4fdf8, dst=0x18b01f80, dst_size=8192,
dst_processed=0x79f4fe60)
#28 0x0b090624 in zs_read (zs=0x18cdfbf0, src=0x7feae4fa505d,
src_size=92, src_processed=0x79f4fdf8,
dst=0x18b01f80, dst_size=8192, dst_processed=0x79f4fe60)
#29 0x0b08eb8f in zpq_read_compressed_message
(zpq=0x7feae4fa5010, dst=0x18b01f80 "Q", dst_len=8192,
dst_processed=0x79f4fe60)
#30 0x0b08f1a9 in zpq_read (zpq=0x7feae4fa5010,
dst=0x18b01f80, dst_size=8192, noblock=false)

(gdb) select-frame 27
(gdb) info locals
ds = 0x18cf82a0
decPtr = 0x18cf8aec ""
decBytes = -87

This is the buffer overrun by decompression. I think the receive
buffer must be twice bigger than the send buffer to accommodate such
messages.
Also this portion of lz4_decompress()
Assert(decBytes > 0);
must actually be a real check and elog(ERROR,). Because clients can
intentionally compose CompressedData to blow up a server.

Best regards, Andrey Borodin.




Re: Use fadvise in wal replay

2022-11-12 Thread Andrey Borodin
On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
>

Hi everyone. The patch is 16 lines, looks harmless and with proven
benefits. I'm moving this into RfC.

Thanks!

Best regards, Andrey Borodin.




Re: libpq compression (part 2)

2022-11-12 Thread Andrey Borodin
On Tue, Aug 2, 2022 at 1:53 PM Jacob Champion  wrote:
>
> and changing the status to "Needs Review"

I've tried the patch, it works as advertised. The code generally is
OK, maybe some functions require comments (because at least their
neighbours have some).
Some linkers complain about zs_is_valid_impl_id() being inline while
used in other modules.

Some architectural notes:
1. Currently when the user requests compression from libpq, but the
server does not support any of the codecs client have - the connection
will be rejected by client. I think this should be configured akin to
SSL: on, try, off.
2. On the zpq_stream level of abstraction we parse a stream of bytes
to look for individual message headers. I think this is OK, just a
little bit awkward.
3. CompressionAck message can be completely replaced by ParameterStatus message.
4. Instead of sending a separate SetCompressionMethod, the
CompressedData can have its header with the index of the used method
and the necessity to restart compression context.
5. Portions of pg_stat_network_traffic can be extracted to separate
patch step to ease the review. And, actually, the scope of this view
is slightly beyond compression anyway.

What do you think?

Also, compression is a very cool and awaited feature, hope to see it
committed one day, thank you for working on this!


Best regards, Andrey Borodin.




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-08 Thread Andrey Borodin
On Thu, Sep 29, 2022 at 3:53 PM Bruce Momjian  wrote:
>
> So, what happens when an insufficient number of synchronous replicas
> reply?

It's a failover.

> Sessions hang because the synchronous behavior cannot be
> guaranteed.  We then _allow_ query cancel so the user or administrator
> can get out of the hung sessions and perhaps modify
> synchronous_standby_names.

Administrators should not modify synchronous_standby_names.
Administrator must shoot this not in the head.

> I have always felt this has to be done at the server level, meaning when
> a synchronous_standby_names replica is not responding after a certain
> timeout, the administrator must be notified by calling a shell command
> defined in a GUC and all sessions will ignore the replica.

Standbys are expelled from the waitlist according to quorum rules. I'd
propose not to invent more quorum rules involving shell scripts.
The Administrator expressed what number of standbys can be offline by
setting synchronous_standby_names. They actively asked for hanging
queries in case of insufficient standbys.

We have reserved administrator connections for the case when all
connection slots are used by hanging queries.


Best regards, Andrey Borodin.




Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 23:01, Tom Lane  wrote:
> 
> Andrey Borodin  writes:
>>> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
>>> That being
>>> said I don't know if adding a timeout would be too expensive for the lwlock
>>> infrastructure.
> 
> I want to object fiercely to loading down LWLock with anything like
> timeouts.  It's supposed to be "lightweight".  If we get away from
> that we're just going to find ourselves needing another lighter-weight
> lock mechanism.

Thanks for clarifying this out, Tom. I agree that spreading timeout-based 
algorithms is not a good thing. And when you have a hammer - everything seems 
like a nail, so it would be temping to use timeout here and there.


> On 12 Sep 2022, at 23:08, Julien Rouhaud  wrote:
> 
> That's what I was thinking, so it looks like a show-stopper for the proposed
> patch.

So, the only option to make things configurable is a switch for 
waiting\waitless locks.

And the other way is refactoring towards partitioned hashtable, namely dshash. 
But I don't see how partitioned locking can save us from a locking disaster. 
Problem is caused by reading all the pgss view colliding with reset() or GC. 
Both this operations deal with each partition - they will conflict anyway, with 
the same result. Time-consuming read of each partition will prevent exclusive 
lock by reset(), and queued exclusive lock will prevent any reads from 
hashtable.


Best regards, Andrey Borodin.



Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
> 
> That being
> said I don't know if adding a timeout would be too expensive for the lwlock
> infrastructure.

Implementation itself is straightforward, but we need to add 3 impls of waiting 
for semaphore with timeout.
POSIX have sem_timedwait(). Windows' WaitForMultipleObjectsEx() have timeout 
arg. SysV have semtimedop().
That's what we need to add something like LWLockAcquireWithTimeout().

Does adding all these stuff sound like a good tradeoff for lock-safe 
pg_stat_statements? If so - I'll start to implement this.

Best regards, Andrey Borodin.



Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> 
> I'm not a fan of that patch as it now silently ignores entries if the lwlock
> can't be acquired *immediately*, without any way to avoid that if your
> configuration and/or workload doesn't lead to this problem, or any way to know
> that entries were ignored.

Practically, workload of this configuration is uncommon. At least I could not 
find any reports of such locking.
But theoretically, all prerequisites of a disaster is very common (variety of 
queries + some QPS of pg_stat_statements view + small work_mem + occasional 
reset() or GC).

Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss on 
every DB in the cluster, that's how check once in a minute became some queries 
per second.

Personally, I'd prefer if I could configure a timeout to aquire lock. That 
timeout would denote maximum delay that pgss can incur on the query. But we 
would need to extend LWLock API for this.



> On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> 
> I think that the better long term approach is to move pg_stat_statements and
> the query texts to dynamic shared memory. 

BTW we don't even need a dynamic memory. We need just a shared memory, probably 
pre-allocated.
I agree that pgss must reside in main memory only, never on disk.

But we still will have a possibility of long lock conflicts preventing queries 
from completing. And the ability to configure pgss hooks timeout would be 
useful anyway.


Best regards, Andrey Borodin.



pg_stat_statements locking

2022-09-11 Thread Andrey Borodin
Hi hackers!

Recently I observed very peculiar incident.

=== Incident description ===

ETL database was operating fine for many months, regularly updated etc. 
Workload was not changing much, but as far as it was ETL database - most of 
queries were different all the time.
On the night of September 7th database was stuck, no monitoring query could be 
executed. DBAs started to deal with the incident, but there is not much to do 
with database service, when you cannot execute a single query. According to VM 
metrics, VM was writing a lot on disk.
On-call engineer was summoned. He observed a lot of backends stuck with similar 
backtrace
#5  LWLockAcquire()
#6  pgss_store()
#7  pgss_post_parse_analyze()
#8  parse_analyze()
#9  pg_analyze_and_rewrite()

After restart, problem reproduced within 50 minutes. But monitoring queries 
were operating, what showed that all backends were stuck in LWLock 
pg_stat_statements. It was impossible to disable pgss with SQL, so engineer 
altered auto.conf and restarted database. This resolved the incident.

Later I was working on analyzing the incident. Enabling pgss back showed traces 
of the problem:
Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)

 usename |state| wait_event | cnt 
-+-++-
 content | active  | DataFileRead   |   1
 content | active  | pg_stat_statements |  42
 content | idle in transaction | ClientRead |   2
 pgwatch_monitor | active  | BufFileRead|   1
 pgwatch_monitor | active  | [null] |   5
 pgwatch_monitor | active  | pg_stat_statements |  85
 postgres| active  | [null] |   1
 repl| active  | WalSenderMain  |   2
 [null]  | active  | [null] |   2
 [null]  | active  | VacuumDelay|   7
(10 rows)

pgwatch was quering 60 databases, every minute and each call to 
pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in 
pg_stat_statements:

datid| 127782
pid  | 4077900
usename  | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type  | IO
wait_event   | BufFileWrite
state| active
backend_xid  | [null]
backend_xmin | 67048029

The contents of pg_stat_statements view overrun work_mem and were materialized 
in tuplestore on disk. This is what cause a lot of disk write on database that 
was not accepting any user query.



TLDR: LWLock "pg_stat_statements" disabled all SQL queries.


=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while 
holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive 
lock.

Consequence. Exclusive lock queues after long held shared lock and prevents all 
shared locks to be taken.
Result. Any query calling pgss hooks hangs.


=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under 
LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()

Now the database is bonked. Any query will hang until pg_stat_statements() 
finishes.


=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() - 
exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared 
lock is used.

I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken 
during entering new query - and I think it's OK to spill some stats in this 
case. PFA patch attached.

This patch prevents from a disaster incurred by described here locking.


=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk 
representation. But I don't see how to reliably protect from this.

What do you think?


Thank!


Best regards, Andrey Borodin.


v1-0001-Demonstrate-and-fix-lock-of-all-SQL-queries-by-pg.patch
Description: Binary data


Re: Support for Rust

2022-09-10 Thread Andrey Borodin
Hi!

> On 10 Sep 2022, at 07:38, Lev Kokotov  wrote:
> 
> Are there any plans or thoughts about adding support for other languages than 
> C into Postgres, namely Rust? I would love to hack on some features but I 
> worry somewhat that the C compiler won't give me enough hints that I'm doing 
> something wrong, and the Rust compiler has been excellent at preventing bugs.

You can write Postgres extensions in Rust. And Postgres extensions are really 
powerful. What kind of features are you interested in?

Undoubtedly, attracting Rust folks to contribute Postgres could be a good 
things.
Yet some very simple questions arise.
1. Is Rust compatible with Memory Contexts and shared memory constructs of 
Postgres? With elog error reporting, PG_TRY() and his friends?
2. Does Rust support same set of platforms as Postgres? Quick glance at Build 
Farm can give an impression of what is supported by Postgres[0].
3. Do we gain anything besides compiler hints? Postgres development is hard due 
to interference of complex subsystems. It will be even harder if those systems 
will be implemented in different languages.

Probably, answers to all these questions are obvious to Rust pros. I just think 
this can be of interest to someone new to Rust (like me).


Best regards, Andrey Borodin.

[0] https://buildfarm.postgresql.org/cgi-bin/show_members.pl



Re: Extensible storage manager API - smgr hooks

2022-08-25 Thread Andrey Borodin



> On 16 Jun 2022, at 13:41, Kirill Reshke  wrote:
> 
> Hello Yura and Anastasia.

FWIW this technology is now a part of Greenplum [0]. We are building GP 
extension that automatically offloads cold data to S3 - a very simplified 
version of Neon for analytical workloads.
When a segment of a table is not used for a long period of time, extension will 
sync files with backup storage in the Cloud.
When the user touches data, extension's smgr will bring table segments back 
from backup or latest synced version.

Our #1 goal is to provide a tool useful for the community. We easily can 
provide same extension for Postgres if this technology (extensible smgr) is in 
core. Does such an extension seem useful for Postgres? Or does this data access 
pattern seems unusual for Postgres? By pattern I mean vast amounts of cold data 
only ever appended and never touched.


Best regards, Andrey Borodin.

[0] https://github.com/greenplum-db/gpdb/pull/13601



Re: pg_stat_wal: tracking the compression effect

2022-08-25 Thread Andrey Borodin



> On 25 Aug 2022, at 12:04, Ken Kato  wrote:
> 
> What do you think?

I think users will need to choose between Lz4 and Zstd. So they need to know 
tradeoff - compression ratio vs cpu time spend per page(or any other segment).

I know that Zstd must be kind of "better", but doubt it have enough runway on 1 
block to show off. If only we could persist compression context between many 
pages...
Compression ratio may be different on different workloads, so system view or 
something similar could be of use.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2022-08-17 Thread Andrey Borodin



> On 17 Aug 2022, at 00:36, i.laza...@postgrespro.ru wrote:
> 
>> Andrey Borodin wrote 2022-07-23 11:39:
>> Yura, thank you for your benchmarks!
>> We already knew that patch can save the day on pathological workloads,
>> now we have a proof of this.
>> Also there's the evidence that user can blindly increase size of SLRU
>> if they want (with the second patch). So there's no need for hard
>> explanations on how to tune the buffers size.
> 
> Hi @Andrey.Borodin, With some considerations and performance checks from 
> @Yura.Sokolov we simplified your approach by the following:
> 
> 1. Preamble. We feel free to increase any SLRU's, since there's no 
> performance degradation on large Buffers count using your SLRU buckets 
> solution.
> 2. `slru_buffers_size_scale` is only one config param introduced for all 
> SLRUs. It scales SLRUs upper cap by power 2.
> 3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` and 
> `slru_buffers_size_scale`. see
> 4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` are 
> applied for every SLRU.
> 5. All SLRU buffers are always sized as power of 2, their hash bucket size is 
> always 8.
> 
> There's attached patch for your consideration. It does gather and simplify 
> both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch` and 
> `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch` to much simpler 
> approach.

I like the idea of one knob instead of one per each SLRU. Maybe we even could 
deduce sane value from NBuffers? That would effectively lead to 0 knobs :)

Your patch have a prefix "v22-0006", does it mean there are 5 previous steps of 
the patchset?

Thank you!


Best regards, Andrey Borodin.



Re: Amcheck verification of GiST and GIN

2022-08-17 Thread Andrey Borodin


> On 23 Jul 2022, at 14:40, Andrey Borodin  wrote:
> 
> Done. PFA attached patchset.
> 
> Best regards, Andrey Borodin.
> 

Here's v13. Changes:
1. Fixed passing through downlink in GIN index
2. Fixed GIN tests (one test case was not working)

Thanks to Vitaliy Kukharik for trying this patches.

Best regards, Andrey Borodin.



v13-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v13-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


v13-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


Re: Logical WAL sender unresponsive during decoding commit

2022-08-16 Thread Andrey Borodin



> On 16 Aug 2022, at 10:25, Masahiko Sawada  wrote:
> 
> The same issue is recently reported[1] on -bugs
Oh, I missed that thread.

> and I proposed the
> patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> ReorderBufferProcessTXN().
I agree that it's a good place for check.

> I think it should be backpatched.
Yes, I think so too.

> [1] 
> https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com

The patch in this thread looks good to me.


Thank you!

Best regards, Andrey Borodin.



Logical WAL sender unresponsive during decoding commit

2022-08-15 Thread Andrey Borodin
Hi hackers!

Some time ago I've seen a hanging logical replication that was trying to send 
transaction commit after doing table pg_repack.
I understand that those things do not mix well. Yet walsender was ignoring 
pg_terminate_backend() and I think this worth fixing.
Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?  Full session 
is attaches as file.

#0  pfree (pointer=0x561850bbee40) at 
./build/../src/backend/utils/mmgr/mcxt.c:1032
#1  0x5617712530d6 in ReorderBufferReturnTupleBuf (tuple=, 
rb=) at 
./build/../src/backend/replication/logical/reorderbuffer.c:469
#2  ReorderBufferReturnChange (rb=, change=0x561772456048) at 
./build/../src/backend/replication/logical/reorderbuffer.c:398
#3  0x561771253da1 in ReorderBufferRestoreChanges 
(rb=rb@entry=0x561771c14e10, txn=0x561771c0b078, 
file=file@entry=0x561771c15168, segno=segno@entry=0x561771c15178) at 
./build/../src/backend/replication/logical/reorderbuffer.c:2570
#4  0x5617712553ba in ReorderBufferIterTXNNext (state=0x561771c15130, 
rb=0x561771c14e10) at 
./build/../src/backend/replication/logical/reorderbuffer.c:1146
#5  ReorderBufferCommit (rb=0x561771c14e10, xid=xid@entry=2976347782, 
commit_lsn=79160378448744, end_lsn=, 
commit_time=commit_time@entry=686095734290578, origin_id=origin_id@entry=0, 
origin_lsn=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1523
#6  0x56177124a30a in DecodeCommit (xid=2976347782, parsed=0x7ffc3cb4c240, 
buf=0x7ffc3cb4c400, ctx=0x561771b10850) at 
./build/../src/backend/replication/logical/decode.c:640
#7  DecodeXactOp (ctx=0x561771b10850, buf=buf@entry=0x7ffc3cb4c400) at 
./build/../src/backend/replication/logical/decode.c:248
#8  0x56177124a6a9 in LogicalDecodingProcessRecord (ctx=0x561771b10850, 
record=0x561771b10ae8) at 
./build/../src/backend/replication/logical/decode.c:117
#9  0x56177125d1e5 in XLogSendLogical () at 
./build/../src/backend/replication/walsender.c:2893
#10 0x56177125f5f2 in WalSndLoop (send_data=send_data@entry=0x56177125d180 
) at ./build/../src/backend/replication/walsender.c:2242
#11 0x561771260125 in StartLogicalReplication (cmd=) at 
./build/../src/backend/replication/walsender.c:1179
#12 exec_replication_command (cmd_string=cmd_string@entry=0x561771abe590 
"START_REPLICATION SLOT dttsjtaa66crdhbm015h LOGICAL 0/0 ( 
\"include-timestamp\" '1', \"include-types\" '1', \"include-xids\" '1', 
\"write-in-chunks\" '1', \"add-tables\" '/* sanitized 
*/.claim_audit,public.__consu"...) at 
./build/../src/backend/replication/walsender.c:1612
#13 0x5617712b2334 in PostgresMain (argc=, 
argv=argv@entry=0x561771b2a438, dbname=, username=) at ./build/../src/backend/tcop/postgres.c:4267
#14 0x56177123857c in BackendRun (port=0x561771b0d7a0, port=0x561771b0d7a0) 
at ./build/../src/backend/postmaster/postmaster.c:4484
#15 BackendStartup (port=0x561771b0d7a0) at 
./build/../src/backend/postmaster/postmaster.c:4167
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#17 0x56177123954b in PostmasterMain (argc=9, argv=0x561771ab70e0) at 
./build/../src/backend/postmaster/postmaster.c:1398
#18 0x561770fae8b6 in main (argc=9, argv=0x561771ab70e0) at 
./build/../src/backend/main/main.c:228

What do you think?

Thank you!


Best regards, Andrey Borodin.


check_for_interrupts.diff
Description: Binary data


Time: 0.731 ms
sas-/* sanitized *//postgres M # select * from pg_replication_slots ;
 slot_name  |  plugin  | slot_type | datoid |   
database   | temporary | active | active_pid |xmin| catalog_xmin |  
restart_lsn  | confirmed_flush_lsn 
+--+---++--+---++++--+---+-
 /* sanitized */ | [null]   | physical  | [null] | [null]   | f | t 
 |1719461 | 2980736032 |   [null] | 4818/908C9B98 | [null]
 vla_/* sanitized */_db_yandex_net | [null]   | physical  | [null] | [null] 
  | f | t  |1719460 | 2980736032 |   [null] | 4818/908DE000 
| [null]
 dttsjtaa66crdhbm015h   | wal2json | logical   |  16441 | /* 
sanitized */ | f | t  |3697390 | [null] |   2976347004 | 
47F6/3FFE5DA0 | 47FE/F588E800
(3 rows)

Time: 0.454 ms
sas-/* sanitized *//postgres M # select pg_terminate_backend(3697390);
 pg_terminate_backend 
--
 t
(1 row)

Time: 0.189 ms
sas-/* sanitized *//postgres M # select * from pg_replication_slots ;
 slot_name  |  plugin  | slot_type | datoid |   
database   | temporary | active | active_pid |xmin| catalog_xmin |  
restart_lsn  | confirmed_flush_lsn 
+--+---++--+---++++

Re: something has gone wrong, but what is it?

2022-08-10 Thread Andrey Borodin



> On 10 Aug 2022, at 19:49, Robert Haas  wrote:
> 
> After a bit of further looking around I noticed that there's another
> check for an invalid auxtype in this function which uses a slightly
> different message text and also PANIC rather than ERROR.

Is there a reason to do
MyBackendType = B_INVALID;
after PANIC or ERROR?

Best regards, Andrey Borodin.




Re: Use fadvise in wal replay

2022-08-07 Thread Andrey Borodin



> On 7 Aug 2022, at 06:39, Bharath Rupireddy 
>  wrote:
> 
> Agree. Why can't we just prefetch the entire WAL file once whenever it
> is opened for the first time? Does the OS have any limitations on max
> size to prefetch at once? It may sound aggressive, but it avoids
> fadvise() system calls, this will be especially useful if there are
> many WAL files to recover (crash, PITR or standby recovery),
> eventually we would want the total WAL file to be prefetched.
> 
> If prefetching the entire WAL file is okay, we could further do this:
> 1) prefetch in XLogFileOpen() and all of segment_open callbacks, 2)
> release in XLogFileClose (it's being dong right now) and all of
> segment_close callbacks - do this perhaps optionally.
> 
> Also, can't we use an existing function FilePrefetch()? That way,
> there is no need for a new wait event type.
> 
> Thoughts?

Thomas expressed this idea upthread. Benchmarks done by Jakub showed that this 
approach had no significant improvement over existing master code.
The same benchmarks showed almost x1.5 improvement of readahead in 8Kb or 128Kb 
chunks.

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-08-05 Thread Andrey Borodin
Hi Bharath,

thank you for the suggestion.

> On 5 Aug 2022, at 16:02, Bharath Rupireddy 
>  wrote:
> 
> On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin  wrote:
>> 
>>> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
>>> 
>>> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  
>>> wrote:
> 
> I have a fundamental question on the overall idea - How beneficial it
> will be if the process that's reading the current WAL page only does
> (at least attempts) the prefetching of future WAL pages? Won't the
> benefit be higher if "some" other background process does prefetching?

IMO prefetching from other thread would have negative effect.
fadvise() call is non-blocking, startup process won't do IO. It just informs 
kernel to schedule asynchronous page read.
On the other hand synchronization with other process might cost more than 
fadvise().

Anyway cost of calling fadise() once per 16 page reads is neglectable.

Thank you!

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

2022-08-04 Thread Andrey Borodin



> On 18 Jul 2022, at 22:55, Robert Haas  wrote:
> 
> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak  wrote:
>> Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
>> another GUC (to avoid many knobs). Ideally it would be nice if we had some 
>> advanced/deep/hidden parameters , but there isn't such thing.
>> Maybe another option would be to use (N * maintenance_io_concurrency * 
>> XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default 
>> value, and still can be tweaked by enduser). Let's wait what others say?
> 
> I don't think adding more parameters is a problem intrinsically. A
> good question to ask, though, is how the user is supposed to know what
> value they should configure. If we don't have any idea what value is
> likely to be optimal, odds are users won't either.
We know that 128KB is optimal on some representative configuration and that 
changing value won't really affect performance much. 128KB is marginally better 
then 8KB and removes some theoretical concerns about extra syscalls.

> It's not very clear to me that we have any kind of agreement on what
> the basic approach should be here, though.

Actually, the only question is offset from current read position: should it be 
1 block or full readehead chunk. Again, this does not change anything, just a 
matter of choice.


Thanks!

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2022-08-02 Thread Andrey Borodin



> On 2 Aug 2022, at 20:18, Simon Riggs  wrote:
> 
> On Tue, 2 Aug 2022 at 12:32, Andrey Borodin  wrote:
> 
>> KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
>> still O(N*N).
> 
> Currently it is O(NlogS), not quite as bad as O(N^2).
Consider workload when we have a big number of simultaneously active xids. 
Number of calls to KnownAssignedXidsRemoveTree() is proportional to number of 
these xids.
And the complexity of KnownAssignedXidsRemoveTree() is proportional to the 
number of these xids, because each call to KnownAssignedXidsRemoveTree() might 
evenly run compression (which will not compress much).

Compression is not an answer to performance problems - because it might be 
burden itself. Instead we can make compression unneeded to make a snapshot's 
xids-in-progress list.


> Since each xid in the tree is always stored to the right, it should be
> possible to make that significantly better by starting each binary
> search from the next element, rather than the start of the array.
> Something like the attached might help, but we can probably make that
> cache conscious to improve things even more.

As Kyotaro-san correctly mentioned - performance degradation happened in 
KnownAssignedXidsGetAndSetXmin() which does not do binary search.



> On 3 Aug 2022, at 06:04, Kyotaro Horiguchi  wrote:
> 
> The original complaint is KnownAssignedXidsGetAndSetXmin can get very
> slow for large max_connections. I'm not sure what was happening on the
> KAXidsArray at the time precisely, but if the array starts with a
> large number of invalid entries (I guess it is likely), and the
> variable "start" were available to the function (that is, it were
> placed in procArray), that strategy seems to work for this case. With
> this strategy we can avoid compression if only the relatively narrow
> range in the array is occupied.

This applies to only one workload - all transactions are very short. If we have 
a tiny fraction of mid or long transactions - this heuristics does not help 
anymore.


Thank you!

Best regards, Andrey Borodin. 







Re: Slow standby snapshot

2022-08-02 Thread Andrey Borodin



> On 29 Jul 2022, at 20:08, Simon Riggs  wrote:
> 
> A simple patch like this seems to hit the main concern, aiming to keep
> the array from spreading out and impacting snapshot performance for
> SELECTs, yet not doing it so often that the startup process has a
> higher burden of work.

The idea to compress more often seem viable. But this might make some other 
workloads pathological.
Some KnownAssignedXids routines now can become quadratic in case of lots of 
subtransactions.

KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
still O(N*N).

IMO original patch (with next pointer) is much safer in terms of unexpected 
performance degradation.

Thanks!

Best regards, Andrey Borodin.



Re: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-28 Thread Andrey Borodin



> 27 июля 2022 г., в 00:26, Jacob Champion  написал(а):
> 
> - libpq compression
>  https://commitfest.postgresql.org/38/3499/
> 
>  Needs a rebase and response to feedback; mostly quiet since January.

Daniil is working on this, but currently he's on vacation.
I think we should not mark patch as RwF and move it to next CF instead.

Thank you!

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2022-07-26 Thread Andrey Borodin



> On 20 Jul 2022, at 02:12, Michail Nikolaev  wrote:
> 
>> I've looked into v5.
> Thanks!
> 
> Patch is updated accordingly your remarks.

The patch seems Ready for Committer from my POV.

Thanks!

Best regards, Andrey Borodin.



  1   2   3   4   5   6   7   >