Re: Use of additional index columns in rows filtering

2023-08-02 Thread Peter Geoghegan
On Wed, Aug 2, 2023 at 6:32 PM Peter Geoghegan  wrote:
> I don't dispute the fact that this can only happen when the planner
> believes (with good reason) that the expected cost will be lower. But
> I maintain that there is a novel risk to be concerned about, which is
> meaningfully distinct from the general risk of regressions that comes
> from making just about any change to the planner. The important
> principle here is that we should have a strong bias in the direction
> of making quals into true "Access Predicates" whenever practical.
>
> Yeah, technically the patch doesn't directly disrupt how existing
> index paths get generated. But it does have the potential to disrupt
> it indirectly, by providing an alternative very-similar index path
> that's likely to outcompete the existing one in these cases. I think
> that we should have just one index path that does everything well
> instead.

You can see this for yourself, quite easily. Start by running the
relevant query from the regression tests, which is:

SELECT * FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous
= 3 OR tenthous = 42);

EXPLAIN (ANALYZE, BUFFERS) confirms that the patch makes the query
slightly faster, as expected. I see 7 buffer hits for the bitmap index
scan plan on master, versus only 4 buffer hits for the patch's index
scan. Obviously, this is because we go from multiple index scans
(multiple bitmap index scans) to only one.

But, if I run this insert statement and try the same thing again,
things look very different:

insert into tenk1 (thousand, tenthous) select 42, i from
generate_series(43, 1000) i;

(Bear in mind that we've inserted rows that don't actually need to be
selected by the query in question.)

Now the master branch's plan works in just the same way as before --
it has exactly the same overhead (7 buffer hits). Whereas the patch
still gets the same risky plan -- which now blows up. The plan now
loses by far more than it could ever really hope to win by: 336 buffer
hits. (It could be a lot higher than this, even, but you get the
picture.)

Sure, it's difficult to imagine a very general model that captures
this sort of risk, in the general case. But you don't need a degree in
actuarial science to understand that it's inherently a bad idea to
juggle chainsaws -- no matter what your general risk tolerance happens
to be. Some things you just don't do.

-- 
Peter Geoghegan




Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-08-02 Thread Amul Sul
On Wed, Aug 2, 2023 at 9:16 PM jian he  wrote:

> On Wed, Aug 2, 2023 at 6:36 PM Amul Sul  wrote:
> >
> > Hi,
> >
> > Currently, we have an option to drop the expression of stored generated
> columns
> > as:
> >
> > ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ]
> >
> > But don't have support to update that expression. The attached patch
> provides
> > that as:
> >
> > ALTER [ COLUMN ] column_name SET EXPRESSION expression
> >
> > Note that this form of ALTER is meant to work for the column which is
> already
> > generated. It then changes the generation expression in the catalog and
> rewrite
> > the table, using the existing table rewrite facilities for ALTER TABLE.
> > Otherwise, an error will be reported.
> >
> > To keep the code flow simple, I have renamed the existing function that
> was in
> > use for DROP EXPRESSION so that it can be used for SET EXPRESSION as
> well,
> > which is a similar design as SET/DROP DEFAULT. I kept this renaming code
> > changes in a separate patch to minimize the diff in the main patch.
> >
> > Demo:
> > -- Create table
> > CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED);
> > INSERT INTO t1 VALUES(generate_series(1,3));
> >
> > -- Check the generated data
> > SELECT * FROM t1;
> >  x | y
> > ---+---
> >  1 | 2
> >  2 | 4
> >  3 | 6
> > (3 rows)
> >
> > -- Alter the expression
> > ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4);
> >
> > -- Check the new data
> > SELECT * FROM t1;
> >  x | y
> > ---+
> >  1 |  4
> >  2 |  8
> >  3 | 12
> > (3 rows)
> >
> > Thank you.
> > --
> > Regards,
> > Amul Sul
> > EDB: http://www.enterprisedb.com
> -
> setup.
>
> BEGIN;
> set search_path = test;
> DROP TABLE if exists gtest_parent, gtest_child;
>
> CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint
> GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1);
>
> CREATE TABLE gtest_child PARTITION OF gtest_parent
>   FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');  -- inherits gen expr
>
> CREATE TABLE gtest_child2 PARTITION OF gtest_parent (
> f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 22) STORED  -- overrides gen
> expr
> ) FOR VALUES FROM ('2016-08-01') TO ('2016-09-01');
>
> CREATE TABLE gtest_child3 (f1 date NOT NULL, f2 bigint, f3 bigint
> GENERATED ALWAYS AS (f2 * 33) STORED);
> ALTER TABLE gtest_parent ATTACH PARTITION gtest_child3 FOR VALUES FROM
> ('2016-09-01') TO ('2016-10-01');
>
> INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 1);
> INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 2);
> INSERT INTO gtest_parent (f1, f2) VALUES ('2016-08-15', 3);
> UPDATE gtest_parent SET f1 = f1 + 60 WHERE f2 = 1;
>
> ALTER TABLE ONLY gtest_parent ALTER COLUMN f3 SET EXPRESSION (f2 * 4);
> ALTER TABLE gtest_child ALTER COLUMN f3 SET EXPRESSION (f2 * 10);
> COMMIT;
>
> set search_path = test;
> SELECT table_name, column_name, is_generated, generation_expression
> FROM information_schema.columns
> WHERE table_name  in ('gtest_child','gtest_child1',
> 'gtest_child2','gtest_child3')
> order by 1,2;
> result:
>   table_name  | column_name | is_generated | generation_expression
> --+-+--+---
>  gtest_child  | f1  | NEVER|
>  gtest_child  | f1  | NEVER|
>  gtest_child  | f2  | NEVER|
>  gtest_child  | f2  | NEVER|
>  gtest_child  | f3  | ALWAYS   | (f2 * 2)
>  gtest_child  | f3  | ALWAYS   | (f2 * 10)
>  gtest_child2 | f1  | NEVER|
>  gtest_child2 | f1  | NEVER|
>  gtest_child2 | f2  | NEVER|
>  gtest_child2 | f2  | NEVER|
>  gtest_child2 | f3  | ALWAYS   | (f2 * 22)
>  gtest_child2 | f3  | ALWAYS   | (f2 * 2)
>  gtest_child3 | f1  | NEVER|
>  gtest_child3 | f1  | NEVER|
>  gtest_child3 | f2  | NEVER|
>  gtest_child3 | f2  | NEVER|
>  gtest_child3 | f3  | ALWAYS   | (f2 * 2)
>  gtest_child3 | f3  | ALWAYS   | (f2 * 33)
> (18 rows)
>
> one partition, one column 2 generated expression. Is this the expected
> behavior?


That is not expected & acceptable. But, somehow, I am not able to reproduce
this behavior. Could you please retry this experiment by adding
"table_schema"
in your output query?

Thank you.

Regards,
Amul


Re: Extract numeric filed in JSONB more effectively

2023-08-02 Thread Pavel Stehule
Hi

čt 3. 8. 2023 v 2:51 odesílatel Andy Fan  napsal:

> Hi Jian:
>
>
>> return PointerGetDatum(v->val.numeric);
>> should be something like
>> PG_RETURN_NUMERIC(v->val.numeric);
>> ?
>>
>
> Thanks for this reminder, a new patch is attached.  and commitfest
> entry is added as well[1]. For recording purposes,  I compared the
> new operator with all the existing operators.
>
> select 1 from tb where (a->'a')::numeric = 2;   30.56ms
> select 1 from tb where (a->>'a')::numeric = 2; 29.43ms
> select 1 from tb where (a@->'a') = 2;  14.80ms
>
> [1] https://commitfest.postgresql.org/44/4476/
>
>
I don't like this solution because it is bloating  operators and it is not
extra readable. For completeness you should implement cast for date, int,
boolean too. Next, the same problem is with XML or hstore type (probably
with any types that are containers).

It is strange so only casting is 2x slower. I don't like the idea so using
a special operator is 2x faster than common syntax for casting. It is a
signal, so there is a space for optimization. Black magic with special
operators is not user friendly for relatively common problems.

Maybe we can introduce some *internal operator* "extract to type", and in
rewrite stage we can the pattern (x->'field')::type transform to OP(x,
'field', typid)

Regards

Pavel


-- 
> Best Regards
> Andy Fan
>


Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

2023-08-02 Thread David Rowley
On Wed, 31 May 2023 at 12:59, David Rowley  wrote:
>
> On Wed, 12 Apr 2023 at 21:03, David Rowley  wrote:
> > I'll add this to the "Older bugs affecting stable branches" section of
> > the PG 16 open items list
>
> When I wrote the above, it was very soon after the feature freeze for
> PG16. I wondered, since we tend not to do cost changes as part of bug
> fixes due to not wanting to destabilise plans between minor versions
> if we could instead just fix it in PG16 given the freeze had *just*
> started.   That's no longer the case, so I'm just going to move this
> out from where I added it in the PG16 Open items "Live issues" section
> and just add a July CF entry for it instead.

I'm keen to move this patch along.  It's not a particularly
interesting patch and don't expect much interest in it, but I feel
it's pretty important to have the planner not accidentally choose a
cheap startup plan when a WindowAgg is going to fetch the entire
subplan's tuples.

I've made another pass over the patch and made a bunch of cosmetic
changes.  As far as mechanical changes, I only changed the EXCLUDE
TIES and EXCLUDE GROUP behaviour when there is no ORDER BY clause in
the WindowClause. If there's no ORDER BY then subtracting 1.0 rows
seems like the right thing to do rather than what the previous patch
did.

I (temporarily) left the DEBUG1 elog in there if anyone wants to test
for themselves (saves debugger use). In the absence of that, I'm
planning on just pushing it to master only tomorrow.  It seems fairly
low risk and unlikely to attract too much interest since it only
affects startup costs of WindowAgg nodes. I'm currently thinking it's
a bad idea to backpatch this but I'd consider it more if someone else
thought it was a good idea or if more people came along complaining
about poor plan choice in plans containing WindowAggs. Currently, it
seems better not to destabilise plans in the back branches. (CC'd Tim,
who reported #17862, as he may have an opinion on this)

The only thought I had while looking at this again aside from what I
changed was if get_windowclause_startup_tuples() should go in
selfuncs.c.  I wondered if it would be neater to use
convert_numeric_to_scalar() instead of the code I had to add to
convert the (SMALL|BIG)INT Consts in  FOLLOWING to double.
Aside from that reason, it seems we don't have many usages of
DEFAULT_INEQ_SEL outside of selfuncs.c. I didn't feel strongly enough
about this to actually move the function.

The updated patch is attached.

Here are the results of my testing (note the DEBUG message matches the
COUNT(*) result in all cases apart from one case where COUNT(*)
returns 0 and the estimated tuples is 1.0).

create table ab (a int, b int);
insert into ab select a,b from generate_series(1,100) a,
generate_series(1,100) b;
analyze ab;
set client_min_messages=debug1;

# select count(*) over () from ab limit 1;
DEBUG:  startup_tuples = 1
 count
---
 1
(1 row)


# select count(*) over (partition by a) from ab limit 1;
DEBUG:  startup_tuples = 100
 count
---
   100
(1 row)


# select count(*) over (partition by a order by b) from ab limit 1;
DEBUG:  startup_tuples = 1
 count
---
 1
(1 row)


# select count(*) over (partition by a order by b rows between current
row and unbounded following) from ab limit 1;
DEBUG:  startup_tuples = 100
 count
---
   100
(1 row)


# select count(*) over (partition by a order by b rows between current
row and 10 following) from ab limit 1;
DEBUG:  startup_tuples = 11
 count
---
11
(1 row)


# select count(*) over (partition by a order by b rows between current
row and 10 following exclude current row) from ab limit 1;
DEBUG:  startup_tuples = 10
 count
---
10
(1 row)


# select count(*) over (partition by a order by b rows between current
row and 10 following exclude ties) from ab limit 1;
DEBUG:  startup_tuples = 11
 count
---
11
(1 row)


# select count(*) over (partition by a order by b range between
current row and 10 following exclude ties) from ab limit 1;
DEBUG:  startup_tuples = 11
 count
---
11
(1 row)


# select count(*) over (partition by a order by b range between
current row and unbounded following exclude ties) from ab limit 1;
DEBUG:  startup_tuples = 100
 count
---
   100
(1 row)


# select count(*) over (partition by a order by b range between
current row and unbounded following exclude group) from ab limit 1;
DEBUG:  startup_tuples = 99
 count
---
99
(1 row)


# select count(*) over (partition by a order by b groups between
current row and unbounded following exclude group) from ab limit 1;
DEBUG:  startup_tuples = 99
 count
---
99
(1 row)


# select count(*) over (partition by a rows between current row and
unbounded following exclude group) from ab limit 1;
DEBUG:  startup_tuples = 1
 count
---
 0
(1 row)


# select count(*) over (partition by a rows between current row and
unbounded following exclude ties) from ab limit 1;
DEBUG:  startup_tuples = 

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-02 Thread Ashutosh Bapat
On Tue, Aug 1, 2023 at 6:50 PM Tom Lane  wrote:

>
> Alternatively, could we postpone the reparameterization until
> createplan.c?  Having to build reparameterized expressions we might
> not use seems expensive, and it would also contribute to the memory
> bloat being complained of in nearby threads.
>

As long as the translation happens only once, it should be fine. It's
the repeated translations which should be avoided.

-- 
Best Wishes,
Ashutosh Bapat




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-02 Thread Peter Smith
On Wed, Aug 2, 2023 at 11:19 PM Amit Kapila  wrote:
>
> On Wed, Aug 2, 2023 at 4:09 PM Melih Mutlu  wrote:
> >
> > PFA an updated version with some of the earlier reviews addressed.
> > Forgot to include them in the previous email.
> >
>
> It is always better to explicitly tell which reviews are addressed but
> anyway, I have done some minor cleanup in the 0001 patch including
> removing includes which didn't seem necessary, modified a few
> comments, and ran pgindent. I also thought of modifying some variable
> names based on suggestions by Peter Smith in an email [1] but didn't
> find many of them any better than the current ones so modified just a
> few of those. If you guys are okay with this then let's commit it and
> then we can focus more on the remaining patches.
>

I checked the latest patch v25-0001.

LGTM.

~~

BTW, I have re-tested many cases of HEAD versus HEAD+v25-0001 (using
current test scripts previously mentioned in this thread). Because
v25-0001 is only a refactoring patch we expect that the results should
be the same as for HEAD, and that is what I observed.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Fix bogus Asserts in calc_non_nestloop_required_outer

2023-08-02 Thread Richard Guo
As stated in [1], all paths arriving here are parameterized by top
parents, so we should check against top_parent_relids if it exists in
the two Asserts.

Attached is a patch fixing that.

[1]
https://www.postgresql.org/message-id/CAMbWs4_UoVcCwkVMfi9TjSC%3Do5U6BRHUNZiVhrvSbDfU2HaeDA%40mail.gmail.com

Thanks
Richard


v1-0001-Fix-bogus-Asserts-in-calc_non_nestloop_required_outer.patch
Description: Binary data


Re: Use of additional index columns in rows filtering

2023-08-02 Thread Peter Geoghegan
On Wed, Aug 2, 2023 at 6:48 AM Tomas Vondra
 wrote:
> How come we don't know that until the execution time? Surely when
> building the paths/plans, we match the clauses to the index keys, no? Or
> are you saying that just having a scan key is not enough for it to be
> "access predicate"?

In principle we can and probably should recognize the difference
between "Access Predicates" and "Index Filter Predicates" much earlier
on, during planning. Probably when index paths are first built.

It seems quite likely that there will turn out to be at least 2 or 3
reasons to do this. The EXPLAIN usability issue might be enough of a
reason on its own, though.

> Anyway, this patch is mostly about "Index Cond" mixing two types of
> predicates. But the patch is really about "Filter" predicates - moving
> some of them from table to index. So quite similar to the "index filter
> predicates" except that those are handled by the index AM.

I understand that that's how the patch is structured. It is
nevertheless possible (as things stand) that the patch will make the
planner shift from a plan that uses "Access Predicates" to the maximum
extent possible when scanning a composite index, to a similar plan
that has a similar index scan, for the same index, but with fewer
"Access Predicates" in total. In effect, the patched planner will swap
one type of predicate for another type because doing so enables the
executor to scan an index once instead of scanning it several times.

I don't dispute the fact that this can only happen when the planner
believes (with good reason) that the expected cost will be lower. But
I maintain that there is a novel risk to be concerned about, which is
meaningfully distinct from the general risk of regressions that comes
from making just about any change to the planner. The important
principle here is that we should have a strong bias in the direction
of making quals into true "Access Predicates" whenever practical.

Yeah, technically the patch doesn't directly disrupt how existing
index paths get generated. But it does have the potential to disrupt
it indirectly, by providing an alternative very-similar index path
that's likely to outcompete the existing one in these cases. I think
that we should have just one index path that does everything well
instead.

> But differentiating between access and filter predicates (at the index
> AM level) seems rather independent of what this patch aims to do.

My concern is directly related to the question of "access predicates
versus filter predicates", and the planner's current ignorance on
which is which. The difference may not matter too much right now, but
ISTM that your patch makes it matter a lot more. And so in that
indirect sense it does seem relevant.

The planner has always had a strong bias in the direction of making
clauses that can be index quals into index quals, rather than filter
predicates. It makes sense to do that even when they aren't very
selective. This is a similar though distinct principle.

It's awkward to discuss the issue, since we don't really have official
names for these things just yet (I'm just going with what Markus calls
them for now). And because there is more than one type of "index
filter predicate" in play with the patch (namely those in the index
AM, and those in the index scan executor node). But my concern boils
down to access predicates being strictly better than equivalent index
filter predicates.

> I'm not following. Why couldn't there be some second-order effects?
> Maybe it's obvious / implied by something said earlier, but I think
> every time we decide between multiple choices, there's a risk of picking
> wrong.

Naturally, I agree that some amount of risk is inherent. I believe
that the risk I have identified is qualitatively different to the
standard, inherent kind of risk.

> Anyway, is this still about this patch or rather about your SAOP patch?

It's mostly not about my SAOP patch. It's just that my SAOP patch will
do exactly the right thing in this particular case (at least once
combined with Alena Rybakina's OR-to-SAOP transformation patch). It is
therefore quite clear that a better plan is possible in principle.
Clearly we *can* have the benefit of only one single index scan (i.e.
no BitmapOr node combining multiple similar index scans), without
accepting any novel new risk to get that benefit. We should just have
one index path that does it all, while avoiding duplicative index
paths that embody a distinction that really shouldn't exist -- it
should be dealt with at runtime instead.

> True. IMHO the danger or "index filter" predicates is that people assume
> index conditions eliminate large parts of the index - which is not
> necessarily the case here. If we can improve this, cool.

I think that we'd find a way to use the same information in the
planner, too. It's not just users that should care about the
difference. And I don't think that it's particularly likely to be
limited to SAOP/MDAM stuff. As 

Re: Extract numeric filed in JSONB more effectively

2023-08-02 Thread Andy Fan
Hi Jian:


> return PointerGetDatum(v->val.numeric);
> should be something like
> PG_RETURN_NUMERIC(v->val.numeric);
> ?
>

Thanks for this reminder, a new patch is attached.  and commitfest
entry is added as well[1]. For recording purposes,  I compared the
new operator with all the existing operators.

select 1 from tb where (a->'a')::numeric = 2;   30.56ms
select 1 from tb where (a->>'a')::numeric = 2; 29.43ms
select 1 from tb where (a@->'a') = 2;  14.80ms

[1] https://commitfest.postgresql.org/44/4476/

-- 
Best Regards
Andy Fan


v2-0001-Add-jsonb-operator-to-return-a-numeric-directly.patch
Description: Binary data


Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-02 Thread Matt Smiley
I thought it might be helpful to share some more details from one of the
case studies behind Nik's suggestion.

Bursty contention on lock_manager lwlocks recently became a recurring cause
of query throughput drops for GitLab.com, and we got to study the behavior
via USDT and uprobe instrumentation along with more conventional
observations (see
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301).  This
turned up some interesting finds, and I thought sharing some of that
research might be helpful.

Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would have
a much larger positive impact than any other mitigation strategy we have
evaluated.  Rather than reducing hold duration or collision rate, adding
fastpath slots reduces the frequency of even having to acquire those
lock_manager lwlocks.  I suspect this would be helpful for many other
workloads, particularly those having high frequency queries whose tables
collectively have more than about 16 or indexes.

Lowering the lock_manager lwlock acquisition rate means lowering its
contention rate (and probably also its contention duration, since exclusive
mode forces concurrent lockers to queue).

I'm confident this would help our workload, and I strongly suspect it would
be generally helpful by letting queries use fastpath locking more often.

> However, the lmgr/README says this is meant to alleviate contention on
> the lmgr partition locks. Wouldn't it be better to increase the number
> of those locks, without touching the PGPROC stuff?

That was my first thought too, but growing the lock_manager lwlock tranche
isn't nearly as helpful.

On the slowpath, each relation's lock tag deterministically hashes onto a
specific lock_manager lwlock, so growing the number of lock_manager lwlocks
just makes it less likely for two or more frequently locked relations to
hash onto the same lock_manager.

In contrast, growing the number of fastpath slots completely avoids calls
to the slowpath (i.e. no need to acquire a lock_manager lwlock).

The saturation condition we'd like to solve is heavy contention on one or
more of the lock_manager lwlocks.  Since that is driven by the slowpath
acquisition rate of heavyweight locks, avoiding the slowpath is better than
just moderately reducing the contention on the slowpath.

To be fair, increasing the number of lock_manager locks definitely can help
to a certain extent, but it doesn't cover an important general case.  As a
thought experiment, suppose we increase the lock_manager tranche to some
arbitrarily large size, larger than the number of relations in the db.
This unrealistically large size means we have the best case for avoiding
collisions -- each relation maps uniquely onto its own lock_manager
lwlock.  That helps a lot in the case where the workload is spread among
many non-overlapping sets of relations.  But it doesn't help a workload
where any one table is accessed frequently via slowpath locking.

Continuing the thought experiment, if that frequently queried table has 16
or more indexes, or if it is joined to other tables that collectively add
up to over 16 relations, then each of those queries is guaranteed to have
to use the slowpath and acquire the deterministically associated
lock_manager lwlocks.

So growing the tranche of lock_manager lwlocks would help for some
workloads, while other workloads would not be helped much at all.  (As a
concrete example, a workload at GitLab has several frequently queried
tables with over 16 indexes that consequently always use at least some
slowpath locks.)

For additional context:

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
Summarizes the pathology and its current mitigations.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
Documents the supporting research methodology.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
What code paths require an exclusive mode lwlock for lock_manager?

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
Comparison of fastpath vs. slowpath locking, including quantifying the rate
difference.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
Confirms the acquisition rate of lock_manager locks is not uniform.  The
sampled workload has a 3x difference in the most vs. least frequently
acquired lock_manager lock, corresponding to the workload's most frequently
accessed relations.

> Well, that has a cost too, as it makes PGPROC larger, right? At the
> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
> would make it +192B / +3 cachelines. I doubt that won't impact other
> common workloads ...

That's true; growing the data structure may affect L2/L3 cache hit rates
when touching PGPROC.  Is that cost worth the benefit of using fastpath for
a higher percentage of table locks?  The answer may be workload- and

First draft of back-branch release notes is done

2023-08-02 Thread Tom Lane
... at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c6344d7686f3e3c8243c2c6771996cfc63e71eae

This is a bit earlier than usual because I will be tied up with
$LIFE for the next couple of days.  I will catch up with any
subsequent back-branch commits over the weekend.

As usual, please send comments/corrections by Sunday.

regards, tom lane




Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-08-02 Thread Daniel Gustafsson
> On 10 Jul 2023, at 14:27, Daniel Gustafsson  wrote:

> This patch is Waiting on Author in the current commitfest with no new patch
> presented following the discussion here.  Is there an update ready or should 
> we
> close it in this CF in favour of a future one?

Since the thread stalled here with the patch waiting on author since May I will
go ahead and mark it returned with feedback in this CF.  Please feel free to
re-open a new entry in a future CF.

--
Daniel Gustafsson





Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2023-08-02 Thread Daniel Gustafsson
> On 30 Jan 2023, at 06:55, Bharath Rupireddy 
>  wrote:

> I started to spend time on this feature again. Thanks all for your
> comments so far.

Since there hasn't been any updates for the past six months, and the patch
hasn't applied for a few months, I am marking this returned with feedback for
now.  Please feel free to open a new entry in a future CF for this patch when
there is a new version.

--
Daniel Gustafsson





Re: POC, WIP: OR-clause support for indexes

2023-08-02 Thread Peter Geoghegan
On Wed, Aug 2, 2023 at 8:58 AM Alena Rybakina  wrote:
> No, I haven't thought about it yet. I studied the example and it would
> really be nice to add optimization here. I didn't notice any problems
> with its implementation. I also have an obvious example with the "or"
> operator, for example
> , select * from multi_test, where (a, b ) = ( 1, 1 ) or (a, b ) = ( 2, 1
> ) ...;
>
> Although I think such a case will be used less often.

Right. As I said, I don't particularly care about the row constructor
syntax -- it's not essential.

In my experience patches like this one that ultimately don't succeed
usually *don't* have specific problems that cannot be fixed. The real
problem tends to be ambiguity about the general high level design. So
more than anything else, ambiguity is the thing that you need to
minimize to be successful here. This is the #1 practical problem, by
far. This may be the only thing about your patch that I feel 100% sure
of.

In my experience it can actually be easier to expand the scope of a
project, and to come up with a more general solution:

https://en.wikipedia.org/wiki/Inventor%27s_paradox

I'm not trying to make your work more difficult by expanding its
scope. I'm actually trying to make your work *easier* by expanding its
scope. I don't claim to know what the specific scope of your patch
should be at all. Just that it might be possible to get a much clearer
picture of what the ideal scope really is by *trying* to generalize it
further -- that understanding is what we lack right now. Even if this
exercise fails in some way, it won't really have been a failure. The
reasons why it fails will still be interesting and practically
relevant.

> It seems to me the most difficult thing is to notice problematic cases
> where the transformations are incorrect, but I think it can be implemented.

Right. To be clear, I am sure that it won't be practical to come up
with a 100% theoretically pure approach. If for no other reason than
this: normalizing to CNF in all cases will run into problems with very
complex predicates. It might even be computationally intractable
(could just be very slow). So there is a clear need to keep
theoretical purity in balance with practical considerations. There is
a need for some kind of negotiation between those two things. Probably
some set of heuristics will ultimately be required to keep costs and
benefits in balance.

> I agree with your position, but I still don't understand how to consider
> transformations to generalized cases without relying on special cases.

Me neither. I wish I could say something a bit less vague here.

I don't expect you to determine what set of heuristics will ultimately
be required to determine when and how to perform CNF conversions, in
the general case. But having at least some vague idea seems like it
might increase confidence in your design.

> As I understand it, you assume that it is possible to apply
> transformations at the index creation stage, but there I came across the
> selectivity overestimation problem.
>
> I still haven't found a solution for this problem.

Do you think that this problem is just an accidental side-effect? It
isn't necessarily the responsibility of your patch to fix such things.
If it's even possible for selectivity estimates to change, then it's
already certain that sometimes they'll be worse than before -- if only
because of chance interactions. The optimizer is often right for the
wrong reasons, and wrong for the right reasons -- we cannot really
expect any patch to completely avoid problems like that.

> To be honest, I think that in your examples I understand better what you
> mean by normalization to the conjunctive norm, because I only had a
> theoretical idea from the logic course.
>
> Hence, yes, normalization/security checks - now I understand why they
> are necessary.

As I explained to Jim, I am trying to put things in this area on a
more rigorous footing. For example, I have said that the way that the
nbtree code executes SAOP quals is equivalent to DNF. That is
basically true, but it's also my own slightly optimistic
interpretation of history and of the design. That's a good start, but
it's not enough on its own.

My interpretation might still be wrong in some subtle way, that I have
yet to discover. That's really what I'm concerned about with your
patch, too. I'm currently trying to solve a problem that I don't yet
fully understand, so for me "getting a properly working flow of
information" seems like a good practical exercise. I'm trying to
generalize the design of my own patch as far as I can, to see what
breaks, and why it breaks. My intuition is that this will help me with
my own patch by forcing me to gain a truly rigorous understanding of
the problem.

My suggestion about generalizing your approach to cover RowCompareExpr
cases is what I would do, if I were you, and this was my patch. That's
almost exactly what I'm doing with my own patch already, in fact.

--
Peter Geoghegan




Re: Synchronizing slots from primary to standby

2023-08-02 Thread Bharath Rupireddy
On Tue, Aug 1, 2023 at 5:01 PM shveta malik  wrote:
>
> > The work division amongst the sync workers can
> > be simple, the logical replication launcher builds a shared memory
> > structure based on number of slots to sync and starts the sync workers
> > dynamically, and each sync worker picks {dboid, slot name, conninfo}
> > from the shared memory, syncs it and proceeds with other slots.
>
> Do you mean  the logical replication launcher builds a shared memory
> structure based
> on the number of 'dbs' to sync as I understood from your initial comment?

Yes. I haven't looked at the 0003 patch posted upthread. However, the
standby must do the following at a minimum:

- Make GUCs synchronize_slot_names and max_slot_sync_workers of
PGC_POSTMASTER type needing postmaster restart when changed as they
affect the number of slot sync workers.
- LR (logical replication) launcher connects to primary to fetch the
logical slots specified in synchronize_slot_names. This is a one-time
task.
- LR launcher prepares a dynamic shared memory (created via
dsm_create) with some state like locks for IPC and an array of
{slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum
number of elements in the array is the number of slots specified in
synchronize_slot_names. This is a one-time task.
- LR launcher decides the *best* number of slot sync workers - (based
on some perf numbers) it can just launch, say, one worker per 2 or 4
or 8 etc. slots.
- Each slot sync worker then picks up a slot from the DSM, connects to
primary using primary conn info, syncs it, and moves to another slot.

Not having the capability of on-demand stop/launch of slot sync
workers makes the above design simple IMO.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: add timing information to pg_upgrade

2023-08-02 Thread Nathan Bossart
On Wed, Aug 02, 2023 at 09:09:14AM -0700, Nathan Bossart wrote:
> On Wed, Aug 02, 2023 at 01:02:53PM +0530, Bharath Rupireddy wrote:
>> On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut  
>> wrote:
>>> I think we should change the output format to be more like initdb, like
>>>
>>>  Doing something ... ok
>>>
>>> without horizontally aligning all the "ok"s.
>> 
>> While this looks simple, we might end up with a lot of diff and
>> changes after removing MESSAGE_WIDTH. There's a significant part of
>> pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's
>> worth the effort. Therefore, I'd prefer the simplest possible fix -
>> change the message to '"Checking for  \"aclitem\" data type in user
>> tables". It may be an overkill, but we can consider adding
>> Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions
>> to not encourage new messages to end up in the same formatting issue.
> 
> I don't think it's that difficult.  ІMO the bigger question is whether we
> want to back-patch such a change to v16 at this point.

Here is a work-in-progress patch that seems to get things pretty close to
what Peter is suggesting.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1a8aab9c9c86c8fa140ccdb53d3479a77de0e762 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 2 Aug 2023 10:20:11 -0700
Subject: [PATCH 1/1] work-in-progress: fix pg_upgrade output

---
 src/bin/pg_upgrade/util.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c
index 21ba4c8f12..bea6d7289d 100644
--- a/src/bin/pg_upgrade/util.c
+++ b/src/bin/pg_upgrade/util.c
@@ -50,10 +50,10 @@ end_progress_output(void)
 	if (log_opts.isatty)
 	{
 		printf("\r");
-		pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, "");
+		pg_log(PG_REPORT_NONL, "%-*s\r  ", MESSAGE_WIDTH, "");
 	}
 	else if (log_opts.verbose)
-		pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, "");
+		pg_log(PG_REPORT_NONL, "  ");
 }
 
 /*
@@ -114,9 +114,6 @@ cleanup_output_dirs(void)
  * prep_status
  *
  *	Displays a message that describes an operation we are about to begin.
- *	We pad the message out to MESSAGE_WIDTH characters so that all of the
- *	"ok" and "failed" indicators line up nicely.  (Overlength messages
- *	will be truncated, so don't get too verbose.)
  *
  *	A typical sequence would look like this:
  *		prep_status("about to flarb the next %d files", fileCount);
@@ -135,8 +132,7 @@ prep_status(const char *fmt,...)
 	vsnprintf(message, sizeof(message), fmt, args);
 	va_end(args);
 
-	/* trim strings */
-	pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+	pg_log(PG_REPORT_NONL, "%s ... ", message);
 }
 
 /*
@@ -167,9 +163,9 @@ prep_status_progress(const char *fmt,...)
 	 * put the individual progress items onto the next line.
 	 */
 	if (log_opts.isatty || log_opts.verbose)
-		pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message);
+		pg_log(PG_REPORT, "%s ...", message);
 	else
-		pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+		pg_log(PG_REPORT_NONL, "%s ... ", message);
 }
 
 static void
-- 
2.25.1



Re: SIGQUIT handling, redux

2023-08-02 Thread Andres Freund
Hi,

On 2023-08-02 12:35:19 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> >> It's simple enough that maybe we could back-patch it, once it's
> >> aged awhile in HEAD.  OTOH, given the lack of field reports of
> >> trouble here, I'm not sure back-patching is worth the risk.
> 
> > FWIW, looking at collected stack traces in azure, there's a slow but steady
> > stream of crashes below StartupPacketTimeoutHandler. ...
> > Unsurprisingly just in versions before 14, where this change went in.
> > I think that might be enough evidence for backpatching the commit? I've not
> > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().
> 
> I'd be willing to take a look at this in a few weeks when $real_life
> is a bit less demanding.

Cool.


> Right before minor releases would likely be a bad idea anyway.

Agreed. I had just waded through the stacks, so I thought it'd be worth
bringing up, didn't intend to imply we should backpatch immediately.


Aside: Analyzing this kind of thing at scale is made considerably more painful
by "expected" ereport(PANIC)s (say out of disk space during WAL writes)
calling abort() and dumping core... While there are other PANICs you really
want cores of.

Greetings,

Andres Freund




Re: SIGQUIT handling, redux

2023-08-02 Thread Tom Lane
Andres Freund  writes:
> On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
>> It's simple enough that maybe we could back-patch it, once it's
>> aged awhile in HEAD.  OTOH, given the lack of field reports of
>> trouble here, I'm not sure back-patching is worth the risk.

> FWIW, looking at collected stack traces in azure, there's a slow but steady
> stream of crashes below StartupPacketTimeoutHandler. ...
> Unsurprisingly just in versions before 14, where this change went in.
> I think that might be enough evidence for backpatching the commit? I've not
> heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().

I'd be willing to take a look at this in a few weeks when $real_life
is a bit less demanding.  Right before minor releases would likely be
a bad idea anyway.

regards, tom lane




Re: SIGQUIT handling, redux

2023-08-02 Thread Andres Freund
Hi,

On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> It's simple enough that maybe we could back-patch it, once it's
> aged awhile in HEAD.  OTOH, given the lack of field reports of
> trouble here, I'm not sure back-patching is worth the risk.

FWIW, looking at collected stack traces in azure, there's a slow but steady
stream of crashes below StartupPacketTimeoutHandler. Most seem to be things
like
libcrypto->malloc->StartupPacketTimeoutHandler->proc_exit->socket_close->free->crash
there's a few other variants, some where the stack apparently was not
decipherable for the relevant tooling.

Note that this wouldn't even include cases where this caused hangs - which is
quite common IME.


Unsurprisingly just in versions before 14, where this change went in.

I think that might be enough evidence for backpatching the commit? I've not
heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().

Greetings,

Andres Freund




Re: pg_upgrade fails with in-place tablespace

2023-08-02 Thread Rui Zhao
> I don't think that your solution is the correct move. Having
> pg_tablespace_location() return the physical location of the
> tablespace is very useful because that's the location where the
> physical files are, and client applications don't need to know the
> logic behind the way a path is built.
I understand your concern. I suggest defining the standard behavior of the 
in-place tablespace to match that of the pg_default and pg_global tablespaces. 
The location of the pg_default and pg_global tablespaces is not a concern 
because they have default paths. Similarly, the location of the in-place 
tablespace is not a concern because they are all pg_tblspc/.
Another reason is that, up until now, we have been creating in-place 
tablespaces using the following command:
CREATE TABLESPACE space_test LOCATION ''
However, when using pg_dump to back up this in-place tablespace, it is dumped 
with a different path:
CREATE TABLESPACE space_test LOCATION 'pg_tblspc/'
This can be confusing because it does not match the initial path that we 
created. Additionally, pg_restore cannot restore this in-place tablespace 
because the CREATE TABLESPACE command only supports an empty path for creating 
in-place tablespaces.
> That may not work on Windows where the driver letter is appended at
> the beginning of the path, no? There is is_absolute_path() to do this
> job in a more portable way.
You are correct. I have made the necessary modifications to ensure 
compatibility with in-place tablespaces. Please find the updated code attached.
--
Best regards,
Rui Zhao
--
From:Michael Paquier 
Sent At:2023 Aug. 1 (Tue.) 08:57
To:Mark 
Cc:pgsql-hackers 
Subject:Re: pg_upgrade fails with in-place tablespace
On Sat, Jul 29, 2023 at 11:10:22PM +0800, Rui Zhao wrote:
> 2) Only check the tablespace with an absolute path in pg_upgrade.
> There are also other solutions, such as supporting the creation of
> relative-path tablespace in function CreateTableSpace. But do we
> really need relative-path tablespace? I think in-place tablespace
> is enough by now. My solution may be more lightweight and
> harmless.
+ /* The path of the in-place tablespace is always pg_tblspc/. */
 if (!S_ISLNK(st.st_mode))
- PG_RETURN_TEXT_P(cstring_to_text(sourcepath));
+ PG_RETURN_TEXT_P(cstring_to_text(""));
I don't think that your solution is the correct move. Having
pg_tablespace_location() return the physical location of the
tablespace is very useful because that's the location where the
physical files are, and client applications don't need to know the
logic behind the way a path is built.
- " spcname != 'pg_global'");
+ " spcname != 'pg_global' AND "
+ " pg_catalog.pg_tablespace_location(oid) ~ '^/'");
That may not work on Windows where the driver letter is appended at
the beginning of the path, no? There is is_absolute_path() to do this
job in a more portable way.
--
Michael


0001-Fix-pg_upgrade-with-in-place-tablespaces.patch
Description: Binary data


Re: add timing information to pg_upgrade

2023-08-02 Thread Nathan Bossart
On Wed, Aug 02, 2023 at 01:02:53PM +0530, Bharath Rupireddy wrote:
> On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut  wrote:
>> I think we should change the output format to be more like initdb, like
>>
>>  Doing something ... ok
>>
>> without horizontally aligning all the "ok"s.
> 
> While this looks simple, we might end up with a lot of diff and
> changes after removing MESSAGE_WIDTH. There's a significant part of
> pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's
> worth the effort. Therefore, I'd prefer the simplest possible fix -
> change the message to '"Checking for  \"aclitem\" data type in user
> tables". It may be an overkill, but we can consider adding
> Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions
> to not encourage new messages to end up in the same formatting issue.

I don't think it's that difficult.  ІMO the bigger question is whether we
want to back-patch such a change to v16 at this point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PATCH: Using BRIN indexes for sorted output

2023-08-02 Thread Tomas Vondra



On 8/2/23 17:25, Sergey Dudoladov wrote:
> Hello,
> 
>> Parallel version is not supported, but I think it should be possible.
> 
> @Tomas are you working on this ? If not, I would like to give it a try.
> 

Feel free to try. Just keep it in a separate part/patch, to make it
easier to combine the work later.

>> static void
>> AssertCheckRanges(BrinSortState *node)
>> {
>> #ifdef USE_ASSERT_CHECKING
>>
>> #endif
>> }
> 
> I guess it should not be empty at the ongoing development stage.
> 
> Attached a small modification of the patch with a draft of the docs.
> 

Thanks. FWIW it's generally better to always post the whole patch
series, otherwise the cfbot gets confused as it's unable to combine
stuff from different messages.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: add timing information to pg_upgrade

2023-08-02 Thread Nathan Bossart
On Wed, Aug 02, 2023 at 09:14:06AM +0200, Peter Eisentraut wrote:
> On 01.08.23 18:00, Nathan Bossart wrote:
>> One of the main purposes of this thread is to gauge interest.  I'm hoping
>> there are other developers who are interested in reducing
>> pg_upgrade-related downtime, and it seemed like it'd be nice to have
>> built-in functionality for measuring the step times instead of requiring
>> folks to apply this patch every time.  And I think users might also be
>> interested in this information, if for no other reason than to help us
>> pinpoint which steps are taking longer for various workloads.
> 
> If it's just for developers and expert users, perhaps existing shell-level
> functionality like "pg_upgrade | ts" would suffice?

Sure, we could just leave it at that unless anyone sees a reason to bake it
in.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-08-02 Thread jian he
On Wed, Aug 2, 2023 at 6:36 PM Amul Sul  wrote:
>
> Hi,
>
> Currently, we have an option to drop the expression of stored generated 
> columns
> as:
>
> ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ]
>
> But don't have support to update that expression. The attached patch provides
> that as:
>
> ALTER [ COLUMN ] column_name SET EXPRESSION expression
>
> Note that this form of ALTER is meant to work for the column which is already
> generated. It then changes the generation expression in the catalog and 
> rewrite
> the table, using the existing table rewrite facilities for ALTER TABLE.
> Otherwise, an error will be reported.
>
> To keep the code flow simple, I have renamed the existing function that was in
> use for DROP EXPRESSION so that it can be used for SET EXPRESSION as well,
> which is a similar design as SET/DROP DEFAULT. I kept this renaming code
> changes in a separate patch to minimize the diff in the main patch.
>
> Demo:
> -- Create table
> CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED);
> INSERT INTO t1 VALUES(generate_series(1,3));
>
> -- Check the generated data
> SELECT * FROM t1;
>  x | y
> ---+---
>  1 | 2
>  2 | 4
>  3 | 6
> (3 rows)
>
> -- Alter the expression
> ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4);
>
> -- Check the new data
> SELECT * FROM t1;
>  x | y
> ---+
>  1 |  4
>  2 |  8
>  3 | 12
> (3 rows)
>
> Thank you.
> --
> Regards,
> Amul Sul
> EDB: http://www.enterprisedb.com
-
setup.

BEGIN;
set search_path = test;
DROP TABLE if exists gtest_parent, gtest_child;

CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint
GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1);

CREATE TABLE gtest_child PARTITION OF gtest_parent
  FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');  -- inherits gen expr

CREATE TABLE gtest_child2 PARTITION OF gtest_parent (
f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 22) STORED  -- overrides gen expr
) FOR VALUES FROM ('2016-08-01') TO ('2016-09-01');

CREATE TABLE gtest_child3 (f1 date NOT NULL, f2 bigint, f3 bigint
GENERATED ALWAYS AS (f2 * 33) STORED);
ALTER TABLE gtest_parent ATTACH PARTITION gtest_child3 FOR VALUES FROM
('2016-09-01') TO ('2016-10-01');

INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 1);
INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 2);
INSERT INTO gtest_parent (f1, f2) VALUES ('2016-08-15', 3);
UPDATE gtest_parent SET f1 = f1 + 60 WHERE f2 = 1;

ALTER TABLE ONLY gtest_parent ALTER COLUMN f3 SET EXPRESSION (f2 * 4);
ALTER TABLE gtest_child ALTER COLUMN f3 SET EXPRESSION (f2 * 10);
COMMIT;

set search_path = test;
SELECT table_name, column_name, is_generated, generation_expression
FROM information_schema.columns
WHERE table_name  in ('gtest_child','gtest_child1',
'gtest_child2','gtest_child3')
order by 1,2;
result:
  table_name  | column_name | is_generated | generation_expression
--+-+--+---
 gtest_child  | f1  | NEVER|
 gtest_child  | f1  | NEVER|
 gtest_child  | f2  | NEVER|
 gtest_child  | f2  | NEVER|
 gtest_child  | f3  | ALWAYS   | (f2 * 2)
 gtest_child  | f3  | ALWAYS   | (f2 * 10)
 gtest_child2 | f1  | NEVER|
 gtest_child2 | f1  | NEVER|
 gtest_child2 | f2  | NEVER|
 gtest_child2 | f2  | NEVER|
 gtest_child2 | f3  | ALWAYS   | (f2 * 22)
 gtest_child2 | f3  | ALWAYS   | (f2 * 2)
 gtest_child3 | f1  | NEVER|
 gtest_child3 | f1  | NEVER|
 gtest_child3 | f2  | NEVER|
 gtest_child3 | f2  | NEVER|
 gtest_child3 | f3  | ALWAYS   | (f2 * 2)
 gtest_child3 | f3  | ALWAYS   | (f2 * 33)
(18 rows)

one partition, one column 2 generated expression. Is this the expected
behavior?

In the regress, you can replace  \d table_name to sql query (similar
to above) to get the generated expression meta data.
since here you want the meta data to be correct. then one select query
to valid generated expression behaviored sane or not.




Re: PATCH: Using BRIN indexes for sorted output

2023-08-02 Thread Sergey Dudoladov
Hello,

> Parallel version is not supported, but I think it should be possible.

@Tomas are you working on this ? If not, I would like to give it a try.

> static void
> AssertCheckRanges(BrinSortState *node)
> {
> #ifdef USE_ASSERT_CHECKING
>
> #endif
> }

I guess it should not be empty at the ongoing development stage.

Attached a small modification of the patch with a draft of the docs.

Regards,
Sergey Dudoladov
From d4050be4bfd0a518eba0ff0a7b561f0420be9861 Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Wed, 2 Aug 2023 16:47:35 +0200
Subject: [PATCH] BRIN sort: add docs / fix typos

---
 doc/src/sgml/brin.sgml  | 48 +
 src/backend/executor/nodeBrinSort.c |  9 +++---
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index 9c5ffcddf8..a76f26e032 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -810,6 +810,54 @@ LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was
 
 
 
+
+ BRIN Sort
+
+ 
+  This section provides an overview of BRIN sort that may be
+  useful to advanced users. See src/backend/executor/nodeBrinSort.c
+  in the source distribution for the implementation.
+ 
+
+ 
+  Overview
+  
+   The information about ranges in a BRIN index can be used
+   to split a table into parts and sort each part separately. A BRIN
+   sort is an index scan that uses the idea to incrementally sort the data. On
+   large tables BRIN index scan bridges the performance gap 
+   between having a large B-tree index that supports ordered retrieval, and having
+   no index at all.
+  
+ 
+
+ 
+  Implementation
+  
+
+  BRIN sort fetches a list of page ranges from the BRIN
+  index and sorts them by their summary info depending on the operator class and
+  sort direction. It then determines the watermark: the next summary value from
+  the list. Tuples with summary values less than the watermark are sorted and
+  outputted directly; tuples with values greater than the watermark are spilled
+  into the next iteration. The process continues until either all ranges are
+  processed or the desired number of tuples has been retrieved.
+  
+ 
+
+ 
+  Limitations
+  
+   The performance of the BRIN sort depends on the degree of
+   correlation between the values of the indexed column and their physical location
+   on disk: in the case of high correlation, there will be few overlapping ranges,
+   and performance will be best. Another limitation is that BRIN
+   sort only works for indexes built on a single column.
+  
+ 
+
+
+
 
  Extensibility
 
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index d1b6b4e1ed..c640aea2d5 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -731,8 +731,9 @@ brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
 static bool
 brinsort_next_range(BrinSortState *node, bool asc)
 {
-	/* FIXME free the current bs_range, if any */
-	node->bs_range = NULL;
+	/* free the current bs_range, if any */
+	if (node->bs_range != NULL)
+		pfree(node->bs_range);
 
 	/*
 	 * Mark the position, so that we can restore it in case we reach the
@@ -1227,7 +1228,7 @@ IndexNext(BrinSortState *node)
 
 			case BRINSORT_PROCESS_RANGE:
 
-elog(DEBUG1, "phase BRINSORT_PROCESS_RANGE");
+elog(DEBUG1, "phase = BRINSORT_PROCESS_RANGE");
 
 slot = node->ss.ps.ps_ResultTupleSlot;
 
@@ -1263,7 +1264,7 @@ IndexNext(BrinSortState *node)
 }
 else
 {
-	/* updte the watermark and try reading more ranges */
+	/* update the watermark and try reading more ranges */
 	node->bs_phase = BRINSORT_LOAD_RANGE;
 	brinsort_update_watermark(node, false, asc);
 }
-- 
2.34.1



Re: Improve const use in zlib-using code

2023-08-02 Thread Tristan Partin

Peter,

I like the idea. Though the way you have it implemented at the moment 
seems like a trap in that any time zlib.h is included someone also has 
to remember to add this define. I would recommend adding the define to 
the build systems instead.


Since you put in the work to find the version of zlib that added this, 
You might also want to add `required: '>= 1.2.5.2'` to the 
`dependency('zlib')` call in the main meson.build. I am wondering if we 
could remove the `z_streamp` check too. The version that it was added 
isn't obvious.


--
Tristan Partin
Neon (https://neon.tech)




Re: Use of additional index columns in rows filtering

2023-08-02 Thread Tomas Vondra
On 8/2/23 02:50, Peter Geoghegan wrote:
> On Mon, Jul 24, 2023 at 11:59 AM Peter Geoghegan  wrote:
>>> That might be true but I'm not sure what to do about that unless we
>>> incorporate some "robustness" measure into the costing. If every
>>> measure we have says one plan is better, don't we have to choose it?
>>
>> I'm mostly concerned about the possibility itself -- it's not a matter
>> of tuning the cost. I agree that that approach would probably be
>> hopeless.
> 
> This seems related to the fact that EXPLAIN doesn't expose the
> difference between what Markus Winand calls "Access Predicates" and
> "Index Filter Predicates", as explained here:
> 
> https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates
> 
> That is, both "Access Predicates" and "Index Filter Predicates" are
> shown after an "Index Cond: " entry in Postgres EXPLAIN output, in
> general. Even though these are two very different things. I believe
> that the underlying problem for the implementation (the reason why we
> can't easily break this out further in EXPLAIN output) is that we
> don't actually know what kind of predicate it is ourselves -- at least
> not until execution time. We wait until then to do nbtree
> preprocessing/scan setup. Though perhaps we should do more of this
> during planning instead [1], for several reasons (fixing this is just
> one of those reasons).
> 

How come we don't know that until the execution time? Surely when
building the paths/plans, we match the clauses to the index keys, no? Or
are you saying that just having a scan key is not enough for it to be
"access predicate"?

Anyway, this patch is mostly about "Index Cond" mixing two types of
predicates. But the patch is really about "Filter" predicates - moving
some of them from table to index. So quite similar to the "index filter
predicates" except that those are handled by the index AM.

> The risk to "robustness" for cases like the one I drew attention to on
> this thread would probably have been obvious all along if EXPLAIN
> output were more like what Markus would have us do -- he certainly has
> a good point here, in general.
> 
> Breaking things out in EXPLAIN output along these lines might also
> give us a better general sense of when a similar plan shift like this
> was actually okay -- even according to something like my
> non-traditional "query robustness" criteria. It's much harder for me
> to argue that a shift in plans from what Markus calls an "Index Filter
> Predicate" to what the patch will show under "Index Filter:" is a
> novel new risk. That would be a much less consequential difference,
> because those two things are fairly similar anyway.
> 

But differentiating between access and filter predicates (at the index
AM level) seems rather independent of what this patch aims to do.

FWIW I agree we should make the differences visible in the explain. That
seems fairly useful for non-trivial index access paths, and it does not
change the execution at all. I think it'd be fine to do that only for
VERBOSE mode, and only for EXPLAIN ANALYZE (if we only do this at
execution time for now).

> Besides, such a shift in plan would have to "legitimately win" for
> things to work out like this. If we're essentially picking between two
> different subtypes of "Index Filter Predicate", then there can't be
> the same weird second order effects that we see when an "Access
> Predicate" is out-competed by an "Index Filter Predicate". It's
> possible that expression evaluation of a small-ish conjunctive
> predicate like "Index Filter: ((tenthous = 1) OR (tenthous = 3) OR
> (tenthous = 42))" will be faster than a natively executed SAOP. You
> can't do that kind of expression evaluation in the index AM itself
> (assuming that there is an opclass for nbtree to use in the first
> place, which there might not be in the case of any non-key INCLUDE
> columns). With the patch, you can do all this. And I think that you
> can derisk it without resorting to the overly conservative approach of
> limiting ourselves to non-key columns from INCLUDE indexes.
> 

I'm not following. Why couldn't there be some second-order effects?
Maybe it's obvious / implied by something said earlier, but I think
every time we decide between multiple choices, there's a risk of picking
wrong.

Anyway, is this still about this patch or rather about your SAOP patch?

> To summarize: As Markus says on the same page. "Index filter
> predicates give a false sense of safety; even though an index is used,
> the performance degrades rapidly on a growing data volume or system
> load". That's essentially what I want to avoid here. I'm much less
> concerned about competition between what are really "Index Filter
> Predicate" subtypes. Allowing that competition to take place is not
> entirely risk-free, of course, but it seems about as risky as anything
> else in this area.
> 

True. IMHO the danger or "index filter" predicates is that people assume
index conditions eliminate 

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-02 Thread Amit Kapila
On Wed, Aug 2, 2023 at 4:09 PM Melih Mutlu  wrote:
>
> PFA an updated version with some of the earlier reviews addressed.
> Forgot to include them in the previous email.
>

It is always better to explicitly tell which reviews are addressed but
anyway, I have done some minor cleanup in the 0001 patch including
removing includes which didn't seem necessary, modified a few
comments, and ran pgindent. I also thought of modifying some variable
names based on suggestions by Peter Smith in an email [1] but didn't
find many of them any better than the current ones so modified just a
few of those. If you guys are okay with this then let's commit it and
then we can focus more on the remaining patches.

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPs3Du9JFmhecWY8%2BVFD11VLOkSmB36t_xWHHQJNMpdA-A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.


v25-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch
Description: Binary data


Re: Adding a LogicalRepWorker type field

2023-08-02 Thread Melih Mutlu
Hi,

Peter Smith , 2 Ağu 2023 Çar, 09:27 tarihinde şunu
yazdı:

> On Wed, Aug 2, 2023 at 3:35 PM Masahiko Sawada 
> wrote:
> >
> > I can see the problem you stated and it's true that the worker's type
> > never changes during its lifetime. But I'm not sure we really need to
> > add a new variable to LogicalRepWorker since we already have enough
> > information there to distinguish the worker's type.
> >
> > Currently, we deduce the worker's type by checking some fields that
> > never change such as relid and leader_piid. It's fine to me as long as
> > we encapcel the deduction of worker's type by macros (or inline
> > functions). Even with the proposed patch (0001 patch), we still need a
> > similar logic to determine the worker's type.
>
> Thanks for the feedback.
>
> I agree that the calling code will not look very different
> with/without this patch 0001, because everything is hidden by the
> macros/functions. But those HEAD macros/functions are also hiding the
> inefficiencies of the type deduction -- e.g. IMO it is quite strange
> that we can only recognize the worker is an "apply worker" by
> eliminating the other 2 possibilities.
>

Isn't the apply worker type still decided by eliminating the other choices
even with the patch?

  /* Prepare the worker slot. */
+ if (OidIsValid(relid))
+ worker->type = TYPE_TABLESYNC_WORKER;
+ else if (is_parallel_apply_worker)
+ worker->type = TYPE_PARALLEL_APPLY_WORKER;
+ else
+ worker->type = TYPE_APPLY_WORKER;


> 3.
> +#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_APPLY_WORKER)
> +#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_PARALLEL_APPLY_WORKER)
> +#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_TABLESYNC_WORKER)
>
> My thoughts were around removing am_XXX_worker() and
> IsXXXWorker() functions and just have IsXXXWorker()
> alone with using IsXXXWorker(MyLogicalRepWorker) in places where
> am_XXX_worker() is used. If implementing this leads to something like
> the above with is_worker_type, -1 from me.
>

I agree that having both is_worker_type and IsXXXWorker makes the code more
confusing. Especially both type check ways are used in the same
file/function as follows:

 logicalrep_worker_detach(void)
>  {
>   /* Stop the parallel apply workers. */
> - if (am_leader_apply_worker())
> + if (IsLeaderApplyWorker())
>   {
>   List   *workers;
>   ListCell   *lc;
> @@ -749,7 +756,7 @@ logicalrep_worker_detach(void)
>   {
>   LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
>
> - if (isParallelApplyWorker(w))
> + if (is_worker_type(w, TYPE_PARALLEL_APPLY_WORKER))
>   logicalrep_worker_stop_internal(w, SIGTERM);
>}



Regards,
-- 
Melih Mutlu
Microsoft


Re: Potential memory leak in contrib/intarray's g_intbig_compress

2023-08-02 Thread Matthias van de Meent
On Fri, 14 Jul 2023 at 07:57, Michael Paquier  wrote:
>
> On Thu, Jul 13, 2023 at 06:28:39PM +0200, Matthias van de Meent wrote:
> > There are similar pfree calls in the _int_gist.c file's g_int_compress
> > function, which made me think we do need to clean up after use, but
> > indeed these pfrees are useless (or even harmful if bug #17888 can be
> > trusted)
>
> Indeed, all these are in a GiST temporary context.  So you'd mean
> something like the attached perhaps, for both the decompress and
> compress paths?

Yes, looks good to me. Thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-02 Thread Richard Guo
On Tue, Aug 1, 2023 at 9:20 PM Tom Lane  wrote:

> Richard Guo  writes:
> > So what I'm thinking is that maybe we can add a new type of path, named
> > SampleScanPath, to have the TableSampleClause per path.  Then we can
> > safely reparameterize the TableSampleClause as needed for each
> > SampleScanPath.  That's what the attached patch does.
>
> Alternatively, could we postpone the reparameterization until
> createplan.c?  Having to build reparameterized expressions we might
> not use seems expensive, and it would also contribute to the memory
> bloat being complained of in nearby threads.


I did think about this option but figured out that it seems beyond the
scope of just fixing SampleScan.  But if we want to optimize the
reparameterization mechanism besides fixing this crash, I think this
option is much better.  I drafted a patch as attached.

Thanks
Richard


v2-0001-Postpone-reparameterization-of-paths-until-when-creating-plans.patch
Description: Binary data


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-02 Thread Melih Mutlu
Hi,

>
PFA an updated version with some of the earlier reviews addressed.
Forgot to include them in the previous email.

Thanks,
-- 
Melih Mutlu
Microsoft


v24-0003-Reuse-connection-when-tablesync-workers-change-t.patch
Description: Binary data


v24-0002-Reuse-Tablesync-Workers.patch
Description: Binary data


v24-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch
Description: Binary data


ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-08-02 Thread Amul Sul
Hi,

Currently, we have an option to drop the expression of stored generated
columns
as:

ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ]

But don't have support to update that expression. The attached patch
provides
that as:

ALTER [ COLUMN ] column_name SET EXPRESSION expression

Note that this form of ALTER is meant to work for the column which is
already
generated. It then changes the generation expression in the catalog and
rewrite
the table, using the existing table rewrite facilities for ALTER TABLE.
Otherwise, an error will be reported.

To keep the code flow simple, I have renamed the existing function that was
in
use for DROP EXPRESSION so that it can be used for SET EXPRESSION as well,
which is a similar design as SET/DROP DEFAULT. I kept this renaming code
changes in a separate patch to minimize the diff in the main patch.

Demo:
-- Create table
CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED);
INSERT INTO t1 VALUES(generate_series(1,3));

-- Check the generated data
SELECT * FROM t1;
 x | y
---+---
 1 | 2
 2 | 4
 3 | 6
(3 rows)

-- Alter the expression
ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4);

-- Check the new data
SELECT * FROM t1;
 x | y
---+
 1 |  4
 2 |  8
 3 | 12
(3 rows)

Thank you.
-- 
Regards,
Amul Sul
EDB: http://www.enterprisedb.com
From ef1448f7852000d5b701f9e3c7fe88737670740a Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Mon, 31 Jul 2023 15:43:51 +0530
Subject: [PATCH v1 2/2] Allow to change generated column expression

---
 doc/src/sgml/ref/alter_table.sgml   |  14 +-
 src/backend/commands/tablecmds.c|  88 +
 src/backend/parser/gram.y   |  10 ++
 src/bin/psql/tab-complete.c |   2 +-
 src/test/regress/expected/generated.out | 167 
 src/test/regress/sql/generated.sql  |  36 -
 6 files changed, 262 insertions(+), 55 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d4d93eeb7c6..1b68dea8d9b 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -46,6 +46,7 @@ ALTER TABLE [ IF EXISTS ] name
 ALTER [ COLUMN ] column_name SET DEFAULT expression
 ALTER [ COLUMN ] column_name DROP DEFAULT
 ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL
+ALTER [ COLUMN ] column_name SET EXPRESSION expression
 ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ]
 ALTER [ COLUMN ] column_name ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ]
 ALTER [ COLUMN ] column_name { SET GENERATED { ALWAYS | BY DEFAULT } | SET sequence_option | RESTART [ [ WITH ] restart ] } [...]
@@ -255,13 +256,18 @@ WITH ( MODULUS numeric_literal, REM
 

 
-   
+   
 DROP EXPRESSION [ IF EXISTS ]
 
  
-  This form turns a stored generated column into a normal base column.
-  Existing data in the columns is retained, but future changes will no
-  longer apply the generation expression.
+  The SET form replaces stored generated value for a
+  column.  Existing data in the columns is rewritten and all the future
+  changes will apply the new generation expression.
+ 
+ 
+  The DROP form turns a stored generated column into a
+  normal base column.  Existing data in the columns is retained, but future
+  changes will no longer apply the generation expression.
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3b499fc0d8e..df26afe16cb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -456,7 +456,8 @@ static ObjectAddress ATExecSetIdentity(Relation rel, const char *colName,
 static ObjectAddress ATExecDropIdentity(Relation rel, const char *colName, bool missing_ok, LOCKMODE lockmode);
 static void ATPrepColumnExpression(Relation rel, AlterTableCmd *cmd,
    bool recurse, bool recursing, LOCKMODE lockmode);
-static ObjectAddress ATExecColumnExpression(Relation rel, const char *colName,
+static ObjectAddress ATExecColumnExpression(AlteredTableInfo *tab, Relation rel,
+			const char *colName, Node *newDefault,
 			bool missing_ok, LOCKMODE lockmode);
 static ObjectAddress ATExecSetStatistics(Relation rel, const char *colName, int16 colNum,
 		 Node *newValue, LOCKMODE lockmode);
@@ -5056,7 +5057,8 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			ATExecCheckNotNull(tab, rel, cmd->name, lockmode);
 			break;
 		case AT_ColumnExpression:
-			address = ATExecColumnExpression(rel, cmd->name, cmd->missing_ok, lockmode);
+			address = ATExecColumnExpression(tab, rel, cmd->name, cmd->def,
+			 cmd->missing_ok, lockmode);
 			break;
 		case AT_SetStatistics:	/* ALTER COLUMN SET STATISTICS */
 			address = ATExecSetStatistics(rel, cmd->name, cmd->num, cmd->def, lockmode);
@@ -8015,16 +8017,21 @@ static void
 ATPrepColumnExpression(Relation rel, AlterTableCmd *cmd, bool recurse, 

Re: pgbnech: allow to cancel queries during benchmark

2023-08-02 Thread Yugo NAGATA
On Wed, 2 Aug 2023 16:37:53 +0900
Yugo NAGATA  wrote:

> Hello Fabien,
> 
> On Fri, 14 Jul 2023 20:32:01 +0900
> Yugo NAGATA  wrote:
> 
> I attached the updated patch.

I'm sorry. I forgot to attach the patch.

Regards,
Yugo Nagata

> 
> > Hello Fabien,
> > 
> > Thank you for your review!
> > 
> > On Mon, 3 Jul 2023 20:39:23 +0200 (CEST)
> > Fabien COELHO  wrote:
> > 
> > > 
> > > Yugo-san,
> > > 
> > > Some feedback about v1 of this patch.
> > > 
> > > Patch applies cleanly, compiles.
> > > 
> > > There are no tests, could there be one? ISTM that one could be done with 
> > > a 
> > > "SELECT pg_sleep(...)" script??
> > 
> > Agreed. I will add the test.
> 
> I added a TAP test.
> 
> > 
> > > The global name "all_state" is quite ambiguous, what about 
> > > "client_states" 
> > > instead? Or maybe it could be avoided, see below.
> > > 
> > > Instead of renaming "state" to "all_state" (or client_states as suggested 
> > > above), I'd suggest to minimize the patch by letting "state" inside the 
> > > main and adding a "client_states = state;" just after the allocation, or 
> > > another approach, see below.
> > 
> > Ok, I'll fix to add a global variable "client_states" and make this point to
> > "state" instead of changing "state" to global.
> 
> Done.
> 
> >  
> > > Should PQfreeCancel be called on deconnections, in finishCon? I think 
> > > that 
> > > there may be a memory leak with the current implementation??
> > 
> > Agreed. I'll fix.
> 
> Done.
> 
> Regards,
> Yugo Nagata
> 
> >  
> > > Maybe it should check that cancel is not NULL before calling PQcancel?
> > 
> > I think this is already checked as below, but am I missing something?
> > 
> > +   if (all_state[i].cancel != NULL)
> > +   (void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf));
> > 
> > > After looking at the code, I'm very unclear whether they may be some 
> > > underlying race conditions, or not, depending on when the cancel is 
> > > triggered. I think that some race conditions are still likely given the 
> > > current thread 0 implementation, and dealing with them with a barrier or 
> > > whatever is not desirable at all.
> > > 
> > > In order to work around this issue, ISTM that we should go away from the 
> > > simple and straightforward thread 0 approach, and the only clean way is 
> > > that the cancelation should be managed by each thread for its own client.
> > > 
> > > I'd suggest to have the advanceState to call PQcancel when 
> > > CancelRequested 
> > > is set and switch to CSTATE_ABORTED to end properly. This means that 
> > > there 
> > > would be no need for the global client states pointer, so the patch 
> > > should 
> > > be smaller and simpler. Possibly there would be some shortcuts added here 
> > > and there to avoid lingering after the control-C, in threadRun.
> > 
> > I am not sure this approach is simpler than mine. 
> > 
> > In multi-threads, only one thread can catches the signal and other threads
> > continue to run. Therefore, if Ctrl+C is pressed while threads are waiting
> > responses from the backend in wait_on_socket_set, only one thread can be
> > interrupted and return, but other threads will continue to wait and cannot
> > check CancelRequested. So, for implementing your suggestion, we need any 
> > hack
> > to make all threads return from wait_on_socket_set when the event occurs, 
> > but
> > I don't have idea to do this in simpler way. 
> > 
> > In my patch, all threads can return from wait_on_socket_set at Ctrl+C
> > because when thread #0 cancels all connections, the following error is
> > sent to all sessions:
> > 
> >  ERROR:  canceling statement due to user request
> > 
> > and all threads will receive the response from the backend.
> > 
> > Regards,
> > Yugo Nagata
> > 
> > -- 
> > Yugo NAGATA 
> > 
> > 
> 
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 539c2795e2..68278d2b18 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -596,6 +596,7 @@ typedef enum
 typedef struct
 {
 	PGconn	   *con;			/* connection handle to DB */
+	PGcancel   *cancel;			/* query cancel */
 	int			id;/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
 	ConditionalStack cstack;	/* enclosing conditionals state */
@@ -638,6 +639,8 @@ typedef struct
  * here */
 } CState;
 
+CState	*client_states;		/* status of all clients */
+
 /*
  * Thread state
  */
@@ -3631,6 +3634,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 		st->state = CSTATE_ABORTED;
 		break;
 	}
+	st->cancel = PQgetCancel(st->con);
 
 	/* reset now after connection */
 	now = pg_time_now();
@@ -4662,6 +4666,18 @@ disconnect_all(CState *state, int length)
 		finishCon([i]);
 }
 
+/* send cancel requests to all connections */
+static void
+cancel_all()
+{
+	for (int i = 0; i < nclients; i++)
+	{
+		char errbuf[1];
+		if (client_states[i].cancel != NULL)

Improve const use in zlib-using code

2023-08-02 Thread Peter Eisentraut
Now that we have effectively de-supported CentOS 6, by removing support 
for its OpenSSL version, I think we could also modernize the use of some 
other libraries, such as zlib.


If we define ZLIB_CONST before including zlib.h, zlib augments some
interfaces with const decorations.  By doing that we can keep our own
interfaces cleaner and can remove some unconstify calls.

ZLIB_CONST was introduced in zlib 1.2.5.2 (17 Dec 2011); CentOS 6 has 
zlib-1.2.3-29.el6.x86_64.


Note that if you use this patch and compile on CentOS 6, it still works, 
you just get a few compiler warnings about discarding qualifiers.  Old 
environments tend to produce more compiler warnings anyway, so this 
doesn't seem so bad.From 324e37adff4c51f4f10807386c0c098cb8d21608 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 2 Aug 2023 11:01:27 +0200
Subject: [PATCH] Improve const use in zlib-using code

If we define ZLIB_CONST before including zlib.h, zlib augments some
interfaces with const decorations.  By doing that we can keep our own
interfaces cleaner and can remove some unconstify calls.

ZLIB_CONST was introduced in zlib 1.2.5.2 (17 Dec 2011).

XXX CentOS 6 has zlib-1.2.3-29.el6.x86_64
---
 contrib/pgcrypto/pgp-compress.c | 3 ++-
 src/backend/backup/basebackup_gzip.c| 1 +
 src/bin/pg_basebackup/bbstreamer_gzip.c | 3 ++-
 src/bin/pg_basebackup/pg_basebackup.c   | 1 +
 src/bin/pg_basebackup/pg_receivewal.c   | 1 +
 src/bin/pg_basebackup/walmethods.c  | 6 +++---
 src/bin/pg_dump/compress_gzip.c | 1 +
 src/common/compression.c| 1 +
 8 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/contrib/pgcrypto/pgp-compress.c b/contrib/pgcrypto/pgp-compress.c
index 086bec31ae..5615f1acce 100644
--- a/contrib/pgcrypto/pgp-compress.c
+++ b/contrib/pgcrypto/pgp-compress.c
@@ -40,6 +40,7 @@
 
 #ifdef HAVE_LIBZ
 
+#define ZLIB_CONST
 #include 
 
 #define ZIP_OUT_BUF 8192
@@ -113,7 +114,7 @@ compress_process(PushFilter *next, void *priv, const uint8 
*data, int len)
/*
 * process data
 */
-   st->stream.next_in = unconstify(uint8 *, data);
+   st->stream.next_in = data;
st->stream.avail_in = len;
while (st->stream.avail_in > 0)
{
diff --git a/src/backend/backup/basebackup_gzip.c 
b/src/backend/backup/basebackup_gzip.c
index b2d5e19ad9..45210a2ff0 100644
--- a/src/backend/backup/basebackup_gzip.c
+++ b/src/backend/backup/basebackup_gzip.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #ifdef HAVE_LIBZ
+#define ZLIB_CONST
 #include 
 #endif
 
diff --git a/src/bin/pg_basebackup/bbstreamer_gzip.c 
b/src/bin/pg_basebackup/bbstreamer_gzip.c
index 3bdbfa0bc4..fa52392a48 100644
--- a/src/bin/pg_basebackup/bbstreamer_gzip.c
+++ b/src/bin/pg_basebackup/bbstreamer_gzip.c
@@ -14,6 +14,7 @@
 #include 
 
 #ifdef HAVE_LIBZ
+#define ZLIB_CONST
 #include 
 #endif
 
@@ -269,7 +270,7 @@ bbstreamer_gzip_decompressor_content(bbstreamer *streamer,
mystreamer = (bbstreamer_gzip_decompressor *) streamer;
 
zs = >zstream;
-   zs->next_in = (uint8 *) data;
+   zs->next_in = (const uint8 *) data;
zs->avail_in = len;
 
/* Process the current chunk */
diff --git a/src/bin/pg_basebackup/pg_basebackup.c 
b/src/bin/pg_basebackup/pg_basebackup.c
index 1dc8efe0cb..90fd044ec3 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #ifdef HAVE_LIBZ
+#define ZLIB_CONST
 #include 
 #endif
 
diff --git a/src/bin/pg_basebackup/pg_receivewal.c 
b/src/bin/pg_basebackup/pg_receivewal.c
index d0a4079d50..c0de77b5aa 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -24,6 +24,7 @@
 #include 
 #endif
 #ifdef HAVE_LIBZ
+#define ZLIB_CONST
 #include 
 #endif
 
diff --git a/src/bin/pg_basebackup/walmethods.c 
b/src/bin/pg_basebackup/walmethods.c
index 376ddf72b7..b663a3d978 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -19,6 +19,7 @@
 #include 
 #endif
 #ifdef HAVE_LIBZ
+#define ZLIB_CONST
 #include 
 #endif
 
@@ -705,7 +706,7 @@ typedef struct TarMethodData
 
 #ifdef HAVE_LIBZ
 static bool
-tar_write_compressed_data(TarMethodData *tar_data, void *buf, size_t count,
+tar_write_compressed_data(TarMethodData *tar_data, const void *buf, size_t 
count,
  bool flush)
 {
tar_data->zp->next_in = buf;
@@ -782,8 +783,7 @@ tar_write(Walfile *f, const void *buf, size_t count)
 #ifdef HAVE_LIBZ
else if (f->wwmethod->compression_algorithm == PG_COMPRESSION_GZIP)
{
-   if (!tar_write_compressed_data(tar_data, unconstify(void *, 
buf),
-  
count, false))
+   if (!tar_write_compressed_data(tar_data, buf, count, false))
return -1;
f->currpos += count;
return count;
diff 

Re: [PoC] Reducing planning time when tables have many partitions

2023-08-02 Thread Andrey Lepikhov

On 2/8/2023 13:40, Yuya Watari wrote:

As seen from the above, verifying iteration results was the cause of
the performance degradation. I agree that we should avoid such
degradation because it negatively affects the development of
PostgreSQL. Removing the verification when committing this patch is
one possible option.
You introduced list_ptr_cmp as an extern function of a List, but use it 
the only under USE_ASSERT_CHECKING ifdef.

Maybe you hide it under USE_ASSERT_CHECKING or remove all the stuff?

--
regards,
Andrey Lepikhov
Postgres Professional





Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-02 Thread Melih Mutlu
Hi,

Amit Kapila , 2 Ağu 2023 Çar, 12:01 tarihinde şunu
yazdı:

> I think we are getting the error (ERROR:  could not find logical
> decoding starting point) because we wouldn't have waited for WAL to
> become available before reading it. It could happen due to the
> following code:
> WalSndWaitForWal()
> {
> ...
> if (streamingDoneReceiving && streamingDoneSending &&
> !pq_is_send_pending())
> break;
> ..
> }
>
> Now, it seems that in 0003 patch, instead of resetting flags
> streamingDoneSending, and streamingDoneReceiving before start
> replication, we should reset before create logical slots because we
> need to read the WAL during that time as well to find the consistent
> point.
>

Thanks for the suggestion Amit. I've been looking into this recently and
couldn't figure out the cause until now.
I quickly made the fix in 0003. Seems like it resolved the "could not find
logical decoding starting point" errors.

vignesh C , 1 Ağu 2023 Sal, 09:32 tarihinde şunu yazdı:

> I agree that  "no copy in progress issue" issue has nothing to do with
> 0001 patch. This issue is present with the 0002 patch.
> In the case when the tablesync worker has to apply the transactions
> after the table is synced, the tablesync worker sends the feedback of
> writepos, applypos and flushpos which results in "No copy in progress"
> error as the stream has ended already. Fixed it by exiting the
> streaming loop if the tablesync worker is done with the
> synchronization. The attached 0004 patch has the changes for the same.
> The rest of v22 patches are the same patch that were posted by Melih
> in the earlier mail.


Thanks for the fix. I placed it into 0002 with a slight change as follows:

- send_feedback(last_received, false, false);
> + if (!MyLogicalRepWorker->relsync_completed)
> + send_feedback(last_received, false, false);


IMHO relsync_completed means simply the same with streaming_done, that's
why I wanted to check that flag instead of an additional goto statement.
Does it make sense to you as well?

Thanks,
-- 
Melih Mutlu
Microsoft


v23-0002-Reuse-Tablesync-Workers.patch
Description: Binary data


v23-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch
Description: Binary data


v23-0003-Reuse-connection-when-tablesync-workers-change-t.patch
Description: Binary data


Re: Support to define custom wait events for extensions

2023-08-02 Thread Masahiro Ikeda

On 2023-08-01 12:23, Andres Freund wrote:

Hi,

On 2023-08-01 12:14:49 +0900, Michael Paquier wrote:

On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote:
> Thanks for committing the main patch.
>
> In my understanding, the rest works are
> * to support WaitEventExtensionMultiple()
> * to replace WAIT_EVENT_EXTENSION to custom wait events
>
> Do someone already works for them? If not, I'll consider
> how to realize them.

Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have
no dependency to shared_preload_libraries.  Perhaps these could just
use a dynamic handling but that deserves a separate discussion because
of the fact that they'd need shared memory without being able to
request it.  autoprewarm.c is much simpler.


This is why the scheme as implemented doesn't really make sense to me. 
It'd be
much easier to use if we had a shared hashtable with the identifiers 
than

what's been merged now.

In plenty of cases it's not realistic for an extension library to run 
in each

backend, while still needing to wait for things.


OK, I'll try to make a PoC patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: multiple membership grants and information_schema.applicable_roles

2023-08-02 Thread Peter Eisentraut

On 24.07.23 08:42, Pavel Luzanov wrote:
I do see what seems like a different issue: the standard appears to 
expect

that indirect role grants should also be shown (via the recursive CTE),
and we are not doing that.


I noticed this, but the view stays unchanged so long time.
I thought it was done intentionally.


The implementation of the information_schema.applicable_roles view 
predates both indirect role grants and recursive query support.  So some 
updates might be in order.






Re: Adding a LogicalRepWorker type field

2023-08-02 Thread Amit Kapila
On Wed, Aug 2, 2023 at 2:46 PM Bharath Rupireddy
 wrote:
>
> On Wed, Aug 2, 2023 at 12:14 PM Peter Smith  wrote:
> >
> > We can't use the same names for both with/without-parameter functions
> > because there is no function overloading in C. In patch v3-0001 I've
> > replaced the "dual set of macros", with a single inline function of a
> > different name, and one set of space-saving macros.
> >
> > PSA v3
>
> Quick thoughts:
>
> 1.
> +typedef enum LogicalRepWorkerType
> +{
> +TYPE_UNKNOWN = 0,
> +TYPE_TABLESYNC_WORKER,
> +TYPE_APPLY_WORKER,
> +TYPE_PARALLEL_APPLY_WORKER
> +} LogicalRepWorkerType;
>
> -1 for these names starting with prefix TYPE_, in fact LR_ looked better.
>
> 2.
> +is_worker_type(LogicalRepWorker *worker, LogicalRepWorkerType type)
>
> This function name looks too generic, an element of  logical
> replication is better in the name.
>
> 3.
> +#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_APPLY_WORKER)
> +#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_PARALLEL_APPLY_WORKER)
> +#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker,
> TYPE_TABLESYNC_WORKER)
>
> My thoughts were around removing am_XXX_worker() and
> IsXXXWorker() functions and just have IsXXXWorker()
> alone with using IsXXXWorker(MyLogicalRepWorker) in places where
> am_XXX_worker() is used. If implementing this leads to something like
> the above with is_worker_type, -1 from me.
>

What I was suggesting to Peter to have something like:
static inline bool
am_tablesync_worker(void)
{
return (MyLogicalRepWorker->type == TYPE_APPLY_WORKER);
}

-- 
With Regards,
Amit Kapila.




Re: add timing information to pg_upgrade

2023-08-02 Thread Peter Eisentraut

On 02.08.23 10:30, Bharath Rupireddy wrote:

Moreover, the ts command gives me the timestamps for each
of the messages printed, so an extra step is required to calculate the
time taken for an operation.


There is "ts -i" for that.





Re: Adding a LogicalRepWorker type field

2023-08-02 Thread Bharath Rupireddy
On Wed, Aug 2, 2023 at 12:14 PM Peter Smith  wrote:
>
> We can't use the same names for both with/without-parameter functions
> because there is no function overloading in C. In patch v3-0001 I've
> replaced the "dual set of macros", with a single inline function of a
> different name, and one set of space-saving macros.
>
> PSA v3

Quick thoughts:

1.
+typedef enum LogicalRepWorkerType
+{
+TYPE_UNKNOWN = 0,
+TYPE_TABLESYNC_WORKER,
+TYPE_APPLY_WORKER,
+TYPE_PARALLEL_APPLY_WORKER
+} LogicalRepWorkerType;

-1 for these names starting with prefix TYPE_, in fact LR_ looked better.

2.
+is_worker_type(LogicalRepWorker *worker, LogicalRepWorkerType type)

This function name looks too generic, an element of  logical
replication is better in the name.

3.
+#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker,
TYPE_APPLY_WORKER)
+#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker,
TYPE_PARALLEL_APPLY_WORKER)
+#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker,
TYPE_TABLESYNC_WORKER)

My thoughts were around removing am_XXX_worker() and
IsXXXWorker() functions and just have IsXXXWorker()
alone with using IsXXXWorker(MyLogicalRepWorker) in places where
am_XXX_worker() is used. If implementing this leads to something like
the above with is_worker_type, -1 from me.

> Done. I converged everything to the macro-naming style as suggested,
> but I chose not to pass MyLogicalRepWorker as a parameter for the
> normal (am_xxx case) otherwise I felt it would make the calling code
> unnecessarily verbose.

With is_worker_type, the calling code now looks even more verbose. I
don't think IsLeaderApplyWorker(MyLogicalRepWorker) is a bad idea.
This unifies multiple worker type functions into one and makes code
simple.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: PG_CATCH used without PG_RETHROW

2023-08-02 Thread 正华吕
Hi, I think this is a bug because this function's CATCH clause does not
restore memroycontext. Thus when leaving the function, the
CurrentMemroyContext
will be ErrorMemroyContext.

I see this part of code is refactored in master branch, but in
REL_16_STABLE, the code
is still there.

The following SQL might lead to PANIC (reference to memory that just reset)

zlyu=# select version();
   version
-
 PostgreSQL 15.3 on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
(1 row)

zlyu=# select xml_is_well_formed('') or
   array_length(array_append(array[random(), random()], case when
xml_is_well_formed('') then random() else random() end), 1) > 1;
ERROR:  cache lookup failed for type 2139062143
zlyu=#



Simple fix is to backport the change from master branch or restore the
context.

Tom Lane  于2023年8月2日周三 16:32写道:

> Greg Stark  writes:
> > My understanding is that PG_TRY/PG_CATCH doesn't save enough state to
> > avoid rethrowing errors and if you want to actually continue the
> > transaction you must use a subtransaction. As a result I was under the
> > impression it was mandatory to PG_RETHROW as a result.
>
> > If that's the case then I think I just came across a bug in
> > utils/adt/xml.c where there's no PG_RETHROW:
>
> The reason we think that's OK is that we assume libxml2 does not call back
> into the general backend code, so there is no PG state we'd have to undo.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-02 Thread Amit Kapila
On Tue, Aug 1, 2023 at 9:44 AM Peter Smith  wrote:
>
>
> FYI, here is some more information about ERRORs seen.
>
> The patches were re-tested -- applied in stages (and also against the
> different scripts) to identify where the problem was introduced. Below
> are the observations:
>
> ~~~
>
> Using original test scripts
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - no errors
>
> 3. Using patch v21-0001+0002+0003
> - no errors
>
> ~~~
>
> Using the "busy loop" test scripts for long transactions
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - gives errors for "no copy in progress issue"
> e.g. ERROR:  could not send data to WAL stream: no COPY in progress
>
> 3. Using patch v21-0001+0002+0003
> - gives the same "no copy in progress issue" errors as above
> e.g. ERROR:  could not send data to WAL stream: no COPY in progress
> - and also gives slot consistency point errors
> e.g. ERROR:  could not create replication slot
> "pg_16700_sync_16514_7261998170966054867": ERROR:  could not find
> logical decoding starting point
> e.g. LOG:  could not drop replication slot
> "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
> replication slot "pg_16700_sync_16454_7261998170966054867" does not
> exist
>

I think we are getting the error (ERROR:  could not find logical
decoding starting point) because we wouldn't have waited for WAL to
become available before reading it. It could happen due to the
following code:
WalSndWaitForWal()
{
...
if (streamingDoneReceiving && streamingDoneSending &&
!pq_is_send_pending())
break;
..
}

Now, it seems that in 0003 patch, instead of resetting flags
streamingDoneSending, and streamingDoneReceiving before start
replication, we should reset before create logical slots because we
need to read the WAL during that time as well to find the consistent
point.

-- 
With Regards,
Amit Kapila.




Re: add timing information to pg_upgrade

2023-08-02 Thread Bharath Rupireddy
On Wed, Aug 2, 2023 at 12:44 PM Peter Eisentraut  wrote:
>
> On 01.08.23 18:00, Nathan Bossart wrote:
> > One of the main purposes of this thread is to gauge interest.  I'm hoping
> > there are other developers who are interested in reducing
> > pg_upgrade-related downtime, and it seemed like it'd be nice to have
> > built-in functionality for measuring the step times instead of requiring
> > folks to apply this patch every time.  And I think users might also be
> > interested in this information, if for no other reason than to help us
> > pinpoint which steps are taking longer for various workloads.
>
> If it's just for developers and expert users, perhaps existing
> shell-level functionality like "pg_upgrade | ts" would suffice?

Interesting. I had to install moreutils package to get ts. And,
something like ts command may or may not be available on all
platforms. Moreover, the ts command gives me the timestamps for each
of the messages printed, so an extra step is required to calculate the
time taken for an operation. I think it'd be better if pg_upgrade can
calculate the time taken for each operation, and I'm okay if it is an
opt-in feature with --verbose option.

[1]
Aug 02 07:44:17 Sync data directory to disk ok
Aug 02 07:44:17 Creating script to delete old cluster   ok
Aug 02 07:44:17 Checking for extension updates  ok

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: cataloguing NOT NULL constraints

2023-08-02 Thread Peter Eisentraut

On 24.07.23 12:32, Alvaro Herrera wrote:

However, 11.16 ( as part of 11.12 ), says that DROP NOT NULL causes the indication of
the column as NOT NULL to be removed.  This, to me, says that if you do
have multiple such constraints, you'd better remove them all with that
command.  All in all, I lean towards allowing just one as best as we
can.


Another clue is in 11.15 , which says

1) Let C be the column identified by the  CN in the
containing . If the column descriptor of C
does not contain an indication that C is defined as NOT NULL, then:

[do things]

Otherwise it does nothing.  So there can only be one such constraint per 
table.






RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-02 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for giving comments! PSA new version patchset.

> 1. Do we really need 0001 patch after the latest change proposed by
> Vignesh in the 0004 patch?

I removed 0001 patch and revived old patch which serializes slots at shutdown.
This is because the problem which slots are not serialized to disk still remain 
[1]
and then confirmed_flush becomes behind, even if we implement the approach.

> 2.
> + if (dopt.logical_slots_only)
> + {
> + if (!dopt.binary_upgrade)
> + pg_fatal("options --logical-replication-slots-only requires option
> --binary-upgrade");
> +
> + if (dopt.dataOnly)
> + pg_fatal("options --logical-replication-slots-only and
> -a/--data-only cannot be used together");
> +
> + if (dopt.schemaOnly)
> + pg_fatal("options --logical-replication-slots-only and
> -s/--schema-only cannot be used together");
> 
> Can you please explain why the patch imposes these restrictions? I
> guess the binary_upgrade is because you want this option to be used
> for the upgrade. Do we want to avoid giving any other option with
> logical_slots, if so, are the above checks sufficient and why?

Regarding the --binary-upgrade, the motivation is same as you expected. I 
covered
up the --logical-replication-slots-only option from users, so it should not be
used not for upgrade. Additionaly, this option is not shown in help and 
document.

As for -{data|schema}-only options, I removed restrictions.
Firstly I set as excluded because it may be confused - as discussed at [2], 
slots
must be dumped after all the pg_resetwal is done and at that time all the 
definitions
are already dumped. to avoid duplicated definitions, we must ensure only slots 
are
written in the output file. I thought this requirement contradict descirptions 
of
these options (Dump only the A, not B).
But after considering more, I thought this might not be needed because it was 
not
opened to users - no one would be confused by using both them.
(Restriction for -c is also removed for the same motivation)

> 3.
> + /*
> + * Get replication slots.
> + *
> + * XXX: Which information must be extracted from old node? Currently three
> + * attributes are extracted because they are used by
> + * pg_create_logical_replication_slot().
> + */
> + appendPQExpBufferStr(query,
> + "SELECT slot_name, plugin, two_phase "
> + "FROM pg_catalog.pg_replication_slots "
> + "WHERE database = current_database() AND temporary = false "
> + "AND wal_status IN ('reserved', 'extended');");
> 
> Why are we ignoring the slots that have wal status as WALAVAIL_REMOVED
> or WALAVAIL_UNRESERVED? I think the slots where wal status is
> WALAVAIL_REMOVED, the corresponding slots are invalidated at some
> point. I think such slots can't be used for decoding but these will be
> dropped along with the subscription or when a user does it manually.
> So, if we don't copy such slots after the upgrade then there could be
> a problem in dropping the corresponding subscription. If we don't want
> to copy over such slots then we need to provide instructions on what
> users should do in such cases. OTOH, if we want to copy over such
> slots then we need to find a way to invalidate such slots after copy.
> Either way, this needs more analysis.

I considered again here. At least WALAVAIL_UNRESERVED should be supported 
because
the slot is still usable. It can return reserved or extended.

As for WALAVAIL_REMOVED, I don't think it should be so that I added a 
description
to the document.

This feature re-create slots which have same name/plugins as old ones, not 
replicate
its state. So if we copy them as-is slots become usable again. If subscribers 
refer
the slot and then connect again at that time, changes between 'WALAVAIL_REMOVED'
may be lost.

Based on above slots must be copied as WALAVAIL_REMOVED, but as you said, we do
not have a way to control that. the status is calculated by using restart_lsn,
but there are no function to modify directly. 

One approach is adding an SQL funciton which set restart_lsn to aritrary value
(or 0/0, invalid), but it seems dangerous.

> 4.
> + /*
> + * Check that all logical replication slots have reached the current WAL
> + * position.
> + */
> + res = executeQueryOrDie(conn,
> + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> + "WHERE (SELECT count(record_type) "
> + " FROM pg_catalog.pg_get_wal_records_content(confirmed_flush_lsn,
> pg_catalog.pg_current_wal_insert_lsn()) "
> + " WHERE record_type != 'CHECKPOINT_SHUTDOWN') <> 0 "
> + "AND temporary = false AND wal_status IN ('reserved', 'extended');");
> 
> I think this can unnecessarily lead to reading a lot of WAL data if
> the confirmed_flush_lsn for a slot is too much behind. Can we think of
> improving this by passing the number of records to read which in this
> case should be 1?

I checked and pg_wal_record_info() seemed to be used for the purpose. I tried to
move the functionality to core.

But this function raise an ERROR when there is no valid record after the 
specified

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-02 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thank you for making the PoC!

> Here is a patch which checks that there are no WAL records other than
> CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion
> from [1].

Basically I agreed your approach. Thanks!

> Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch
> 0003 exposes pg_get_wal_records_content to get the WAL records along
> with the WAL record type between start and end lsn. pg_walinspect
> contrib module already exposes a function for this requirement, I have
> moved this functionality to be exposed from the backend. Patch 0004
> has slight change in check function to check that there are no other
> records other than CHECKPOINT_SHUTDOWN to be consumed. The attached
> patch has the changes for the same.
> Thoughts?
> 
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6
> RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com

Few comments:

* Per comment from Amit [1], I used pg_get_wal_record_info() instead of 
pg_get_wal_records_info().
This function extract a next available WAL record, which can avoid huge scan if
the confirmed_flush is much behind.
* According to cfbot and my analysis, the 0001 cannot pass the test on macOS.
 So I revived Julien's patch [2] as 0002 once. AFAIS the 0001 is not so 
welcomed.

Next patch will be available soon.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1LWKkoyy-p-SAT0JTWa%3D6kXiMd%3Da6ZcArY9eU4a3g4TZg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/20230414061248.vdsxz2febjo3re6h%40jrouhaud

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [PATCH] [zh_CN.po] fix a typo in simplified Chinese translation file

2023-08-02 Thread Peter Eisentraut

On 01.08.23 10:09, Junwang Zhao wrote:

add the missing leading `l` for log_statement_sample_rate


I have committed this fix to the translations repository.






Re: pgbnech: allow to cancel queries during benchmark

2023-08-02 Thread Yugo NAGATA
Hello Fabien,

On Fri, 14 Jul 2023 20:32:01 +0900
Yugo NAGATA  wrote:

I attached the updated patch.

> Hello Fabien,
> 
> Thank you for your review!
> 
> On Mon, 3 Jul 2023 20:39:23 +0200 (CEST)
> Fabien COELHO  wrote:
> 
> > 
> > Yugo-san,
> > 
> > Some feedback about v1 of this patch.
> > 
> > Patch applies cleanly, compiles.
> > 
> > There are no tests, could there be one? ISTM that one could be done with a 
> > "SELECT pg_sleep(...)" script??
> 
> Agreed. I will add the test.

I added a TAP test.

> 
> > The global name "all_state" is quite ambiguous, what about "client_states" 
> > instead? Or maybe it could be avoided, see below.
> > 
> > Instead of renaming "state" to "all_state" (or client_states as suggested 
> > above), I'd suggest to minimize the patch by letting "state" inside the 
> > main and adding a "client_states = state;" just after the allocation, or 
> > another approach, see below.
> 
> Ok, I'll fix to add a global variable "client_states" and make this point to
> "state" instead of changing "state" to global.

Done.

>  
> > Should PQfreeCancel be called on deconnections, in finishCon? I think that 
> > there may be a memory leak with the current implementation??
> 
> Agreed. I'll fix.

Done.

Regards,
Yugo Nagata

>  
> > Maybe it should check that cancel is not NULL before calling PQcancel?
> 
> I think this is already checked as below, but am I missing something?
> 
> +   if (all_state[i].cancel != NULL)
> +   (void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf));
> 
> > After looking at the code, I'm very unclear whether they may be some 
> > underlying race conditions, or not, depending on when the cancel is 
> > triggered. I think that some race conditions are still likely given the 
> > current thread 0 implementation, and dealing with them with a barrier or 
> > whatever is not desirable at all.
> > 
> > In order to work around this issue, ISTM that we should go away from the 
> > simple and straightforward thread 0 approach, and the only clean way is 
> > that the cancelation should be managed by each thread for its own client.
> > 
> > I'd suggest to have the advanceState to call PQcancel when CancelRequested 
> > is set and switch to CSTATE_ABORTED to end properly. This means that there 
> > would be no need for the global client states pointer, so the patch should 
> > be smaller and simpler. Possibly there would be some shortcuts added here 
> > and there to avoid lingering after the control-C, in threadRun.
> 
> I am not sure this approach is simpler than mine. 
> 
> In multi-threads, only one thread can catches the signal and other threads
> continue to run. Therefore, if Ctrl+C is pressed while threads are waiting
> responses from the backend in wait_on_socket_set, only one thread can be
> interrupted and return, but other threads will continue to wait and cannot
> check CancelRequested. So, for implementing your suggestion, we need any hack
> to make all threads return from wait_on_socket_set when the event occurs, but
> I don't have idea to do this in simpler way. 
> 
> In my patch, all threads can return from wait_on_socket_set at Ctrl+C
> because when thread #0 cancels all connections, the following error is
> sent to all sessions:
> 
>  ERROR:  canceling statement due to user request
> 
> and all threads will receive the response from the backend.
> 
> Regards,
> Yugo Nagata
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 




Re: add timing information to pg_upgrade

2023-08-02 Thread Bharath Rupireddy
On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut  wrote:
>
> On 01.08.23 17:45, Nathan Bossart wrote:
> > The message is too long, so there's no space between it and the "ok"
> > message:
> >
> >   Checking for incompatible "aclitem" data type in user tablesok
> >
> > Instead of altering the messages, we could bump MESSAGE_WIDTH from 60 to
> > 62 or 64.  Do you prefer that approach?
>
> I think we should change the output format to be more like initdb, like
>
>  Doing something ... ok
>
> without horizontally aligning all the "ok"s.

While this looks simple, we might end up with a lot of diff and
changes after removing MESSAGE_WIDTH. There's a significant part of
pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's
worth the effort. Therefore, I'd prefer the simplest possible fix -
change the message to '"Checking for  \"aclitem\" data type in user
tables". It may be an overkill, but we can consider adding
Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions
to not encourage new messages to end up in the same formatting issue.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: add timing information to pg_upgrade

2023-08-02 Thread Peter Eisentraut

On 01.08.23 17:45, Nathan Bossart wrote:

The message is too long, so there's no space between it and the "ok"
message:

Checking for incompatible "aclitem" data type in user tablesok

Instead of altering the messages, we could bump MESSAGE_WIDTH from 60 to
62 or 64.  Do you prefer that approach?


I think we should change the output format to be more like initdb, like

Doing something ... ok

without horizontally aligning all the "ok"s.





Re: add timing information to pg_upgrade

2023-08-02 Thread Peter Eisentraut

On 01.08.23 18:00, Nathan Bossart wrote:

One of the main purposes of this thread is to gauge interest.  I'm hoping
there are other developers who are interested in reducing
pg_upgrade-related downtime, and it seemed like it'd be nice to have
built-in functionality for measuring the step times instead of requiring
folks to apply this patch every time.  And I think users might also be
interested in this information, if for no other reason than to help us
pinpoint which steps are taking longer for various workloads.


If it's just for developers and expert users, perhaps existing 
shell-level functionality like "pg_upgrade | ts" would suffice?





Re: Adding a LogicalRepWorker type field

2023-08-02 Thread Peter Smith
On Wed, Aug 2, 2023 at 1:00 PM Amit Kapila  wrote:
>
> On Wed, Aug 2, 2023 at 8:10 AM Peter Smith  wrote:
> >
> >
> > The am_xxx functions are removed now in the v2-0001 patch. See [1].
> >
> > The replacement set of macros (the ones with no arg) are not strictly
> > necessary, except I felt it would make the code unnecessarily verbose
> > if we insist to pass MyLogicalRepWorker everywhere from the callers in
> > worker.c / tablesync.c / applyparallelworker.c.
> >
>
> Agreed but having a dual set of macros is also not clean. Can we
> provide only a unique set of inline functions instead?
>

We can't use the same names for both with/without-parameter functions
because there is no function overloading in C. In patch v3-0001 I've
replaced the "dual set of macros", with a single inline function of a
different name, and one set of space-saving macros.

PSA v3

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v3-0001-Add-LogicalRepWorkerType-enum.patch
Description: Binary data


v3-0002-Switch-on-worker-type.patch
Description: Binary data


Re: [PoC] Reducing planning time when tables have many partitions

2023-08-02 Thread Yuya Watari
Hello,

I really appreciate sharing very useful scripts and benchmarking results.

On Fri, Jul 28, 2023 at 6:51 PM Ashutosh Bapat
 wrote:
> Given that most of the developers run assert enabled builds it would
> be good to bring down the degradation there while keeping the
> excellent speedup in non-assert builds.

>From my observation, this degradation in assert enabled build is
caused by verifying the iteration results of EquivalenceMembers. My
patch uses Bitmapset-based indexes to speed up the iteration. When
assertions are enabled, we verify that the result of the iteration is
the same with and without the indexes. This verification results in
executing a similar loop three times, which causes the degradation. I
measured planning time by using your script without this verification.
The results are as follows:

Master: 144.55 ms
Patched (v19): 529.85 ms
Patched (v19) without verification: 78.84 ms
(*) All runs are with assertions.

As seen from the above, verifying iteration results was the cause of
the performance degradation. I agree that we should avoid such
degradation because it negatively affects the development of
PostgreSQL. Removing the verification when committing this patch is
one possible option.

> Queries on partitioned tables eat a lot of memory anyways, increasing
> that further should be avoided.
>
> I have not studied the patches. But I think the memory increase has to
> do with our Bitmapset structure. It's space inefficient when there are
> thousands of partitions involved. See my comment at [2]

Thank you for pointing this out. I have never considered the memory
usage impact of this patch. As you say, the Bitmapset structure caused
this increase. I will try to look into this further.

-- 
Best regards,
Yuya Watari




Re: Adding a LogicalRepWorker type field

2023-08-02 Thread Peter Smith
On Wed, Aug 2, 2023 at 3:35 PM Masahiko Sawada  wrote:
>
> On Mon, Jul 31, 2023 at 10:47 AM Peter Smith  wrote:
> >
> > Hi hackers,
> >
> > BACKGROUND:
> >
> > The logical replication has different worker "types" (e.g. apply
> > leader, apply parallel, tablesync).
> >
> > They all use a common structure called LogicalRepWorker, but at times
> > it is necessary to know what "type" of worker a given LogicalRepWorker
> > represents.
> >
> > Once, there were just apply workers and tablesync workers - these were
> > easily distinguished because only tablesync workers had a valid
> > 'relid' field. Next, parallel-apply workers were introduced - these
> > are distinguishable from the apply leaders by the value of
> > 'leader_pid' field.
> >
> >
> > PROBLEM:
> >
> > IMO, deducing the worker's type by examining multiple different field
> > values seems a dubious way to do it. This maybe was reasonable enough
> > when there were only 2 types, but as more get added it becomes
> > increasingly complicated.
> >
> > SOLUTION:
> >
> > AFAIK none of the complications is necessary anyway; the worker type
> > is already known at launch time, and it never changes during the life
> > of the process.
> >
> > The attached Patch 0001 introduces a new enum 'type' field, which is
> > assigned during the launch.
> >
> > ~
> >
> > This change not only simplifies the code, but it also permits other
> > code optimizations, because now we can switch on the worker enum type,
> > instead of using cascading if/else.  (see Patch 0002).
> >
> > Thoughts?
>
> I can see the problem you stated and it's true that the worker's type
> never changes during its lifetime. But I'm not sure we really need to
> add a new variable to LogicalRepWorker since we already have enough
> information there to distinguish the worker's type.
>
> Currently, we deduce the worker's type by checking some fields that
> never change such as relid and leader_piid. It's fine to me as long as
> we encapcel the deduction of worker's type by macros (or inline
> functions). Even with the proposed patch (0001 patch), we still need a
> similar logic to determine the worker's type.

Thanks for the feedback.

I agree that the calling code will not look very different
with/without this patch 0001, because everything is hidden by the
macros/functions. But those HEAD macros/functions are also hiding the
inefficiencies of the type deduction -- e.g. IMO it is quite strange
that we can only recognize the worker is an "apply worker" by
eliminating the other 2 possibilities.

As further background, the idea for this patch came from considering
another thread to "re-use" workers [1]. For example, when thinking
about re-using tablesync workers the HEAD am_tablesync_worker()
function begins to fall apart -- ie. it does not let you sensibly
disassociate a tablesync worker from its relid (which you might want
to do if tablesync workers were "pooled") because in the HEAD code any
tablesync worker without a relid is (by definition) NOT recognized as
a tablesync worker anymore!

Those above issues just disappear when a type field is added.

>
> If we want to change both process_syncing_tables() and
> should_apply_changes_for_rel() to use the switch statement instead of
> using cascading if/else, I think we can have a function, say
> LogicalRepWorkerType(), that returns LogicalRepWorkerType of the given
> worker.
>

The point of that "switch" in patch 0002 was to demonstrate that with
a worker-type enum we can write more *efficient* code in some places
than the current cascading if/else. I think making a "switch" just for
the sake of it, by adding more functions that hide yet more work, is
going in the opposite direction.

--
[1] reuse workers -
https://www.postgresql.org/message-id/flat/CAGPVpCTq%3DrUDd4JUdaRc1XUWf4BrH2gdSNf3rtOMUGj9rPpfzQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Faster "SET search_path"

2023-08-02 Thread Jeff Davis
On Wed, 2023-08-02 at 01:14 -0400, Isaac Morland wrote:
> I don’t think the fact that an optimization might suddenly not work
> in a certain situation is a reason not to optimize. What would our
> query planner look like if we took that approach?

...

> Instead, we should try to find ways of making the performance more
> transparent. We already have some features for this, but maybe more
> can be done.

Exactly; for the planner we have EXPLAIN. We could try to expose GUC
switching costs that way.

Or maybe users can start thinking of search_path as a performance-
related setting to be careful with.

Or we could try harder to optimize the switching costs so that it's not
so bad. I just started and I already saw some nice speedups; with a few
of the easy fixes done, the remaining opportunities should be more
visible in the profiler.

Regards,
Jeff Davis




Re: Inaccurate comments in ReorderBufferCheckMemoryLimit()

2023-08-02 Thread Masahiko Sawada
On Wed, Aug 2, 2023 at 11:21 AM Amit Kapila  wrote:
>
> On Tue, Aug 1, 2023 at 2:06 PM Masahiko Sawada  wrote:
> >
> > On Tue, Aug 1, 2023 at 11:33 AM Amit Kapila  wrote:
> > >
> > > On Mon, Jul 31, 2023 at 8:46 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > While reading the code, I realized that the following code comments
> > > > might not be accurate:
> > > >
> > > > /*
> > > >  * Pick the largest transaction (or subtransaction) and evict 
> > > > it from
> > > >  * memory by streaming, if possible.  Otherwise, spill to disk.
> > > >  */
> > > > if (ReorderBufferCanStartStreaming(rb) &&
> > > > (txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
> > > > {
> > > > /* we know there has to be one, because the size is not 
> > > > zero */
> > > > Assert(txn && rbtxn_is_toptxn(txn));
> > > > Assert(txn->total_size > 0);
> > > > Assert(rb->size >= txn->total_size);
> > > >
> > > > ReorderBufferStreamTXN(rb, txn);
> > > > }
> > > >
> > > > AFAICS since ReorderBufferLargestStreamableTopTXN() returns only
> > > > top-level transactions, the comment above the if statement is not
> > > > right. It would not pick a subtransaction.
> > > >
> > >
> > > I think the subtransaction case is for the spill-to-disk case as both
> > > cases are explained in the same comment.
> > >
> > > > Also, I'm not sure that the second comment "we know there has to be
> > > > one, because the size is not zero" is right since there might not be
> > > > top-transactions that are streamable.
> > > >
> > >
> > > I think this comment is probably referring to asserts related to the
> > > size similar to spill to disk case.
> > >
> > > How about if we just remove (or subtransaction) from the following
> > > comment: "Pick the largest transaction (or subtransaction) and evict
> > > it from memory by streaming, if possible.  Otherwise, spill to disk."?
> > > Then by referring to streaming/spill-to-disk cases, one can understand
> > > in which cases only top-level xacts are involved and in which cases
> > > both are involved.
> >
> > Sounds good. I've updated the patch accordingly.
> >
>
> LGTM.

Thank you for reviewing the patch! Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Extract numeric filed in JSONB more effectively

2023-08-02 Thread jian he
On Tue, Aug 1, 2023 at 12:39 PM Andy Fan  wrote:
>
> Hi:
>
> Currently if we want to extract a numeric field in jsonb, we need to use
> the following expression:  cast (a->>'a' as numeric). It will turn a numeric
> to text first and then turn the text to numeric again. See
> jsonb_object_field_text and JsonbValueAsText.  However the binary format
> of numeric in JSONB is compatible with the numeric in SQL, so I think we
> can have an operator to extract the numeric directly. If the value of a given
> field is not a numeric data type, an error will be raised, this can be
> documented.
>
> In this patch, I added a new operator for this purpose, here is the
> performance gain because of this.
>
> create table tb (a jsonb);
> insert into tb select '{"a": 1}'::jsonb from generate_series(1, 10)i;
>
> current method:
> select count(*) from tb where cast (a->>'a' as numeric) = 2;
> 167ms.
>
> new method:
> select count(*) from tb where a@->'a' = 2;
> 65ms.
>
> Is this the right way to go? Testcase, document and catalog version are
> updated.
>
>
> --
> Best Regards
> Andy Fan


return PointerGetDatum(v->val.numeric);
should be something like
PG_RETURN_NUMERIC(v->val.numeric);
?