date:20221125

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-11-25 Thread Amit Kapila

On Fri, Nov 25, 2022 at 5:38 PM Amit Kapila  wrote:
>
> On Fri, Nov 25, 2022 at 1:35 PM Dilip Kumar  wrote:
> >
> > During DecodeCommit() for skipping a transaction we use ReadRecPtr to
> > check whether to skip this transaction or not.  Whereas in
> > ReorderBufferCanStartStreaming() we use EndRecPtr to check whether to
> > stream or not. Generally it will not create a problem but if the
> > commit record itself is adding some changes to the transaction(e.g.
> > snapshot) and if the "start_decoding_at" is in between ReadRecPtr and
> > EndRecPtr then streaming will decide to stream the transaction where
> > as DecodeCommit will decide to skip it.  And for handling this case in
> > ReorderBufferForget() we call stream_abort().
> >
>
> The other cases are probably where we don't have FilterByOrigin or
> dbid check, for example, XLOG_HEAP2_NEW_CID/XLOG_XACT_INVALIDATIONS.
> We anyway actually don't send anything for such cases except empty
> start/stop messages. Can we add some flag to txn which says that there
> is at least one change like DML that we want to stream?
>

We can probably think of using txn_flags for this purpose.

> Then we can
> use that flag to decide whether to stream or not.
>

The other possibility here is to use ReorderBufferTXN's base_snapshot.
Normally, we don't stream unless the base_snapshot is set. However,
there are a few challenges (a) the base_snapshot is set in
SnapBuildProcessChange which is called before DecodeInsert, and
similar APIs. So, now even if we filter out insert due origin or
db_id, the base_snapshot will be set. (b) we currently set it to
execute invalidations even when there are no changes to send to
downstream.

For (a), I guess we can split SnapBuildProcessChange, such that the
part where we set the base snapshot will be done after DecodeInsert
and similar APIs decide to queue the change. I am less sure about
point (b), ideally, if we don't need a snapshot to execute
invalidations, I think we can avoid setting base_snapshot even in that
case. Then we can use it as a check whether we want to stream
anything.

I feel this approach required a lot more changes and bit riskier as
compared to having a flag in txn_flags.

-- 
With Regards,
Amit Kapila.

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-11-25 Thread Dilip Kumar

On Fri, Nov 25, 2022 at 4:04 PM Ashutosh Bapat
 wrote:
>
> Excellent catch. We were looking at this code last week and wondered
> the purpose of this abort. Probably we should have some macro or
> function to decided whether to skip a transaction based on log record.
> That will avoid using different values in different places.

We do have a common function i.e. SnapBuildXactNeedsSkip() but there
are two problems 1) it has a dependency on the input parameter so the
result may vary based on the input 2) this is only checked based on
the LSN but there are other factors dbid and originid based on those
also transaction could be skipped during DecodeCommit.  So I think one
possible solution could be to remember a dbid and originid in
ReorderBufferTXN as soon as we get the first change which has valid
values for these parameters.  And now as you suggested have a common
function that will be used by streaming as well as by DecodeCommit to
decide on whether to skip or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-11-25 Thread Thomas Munro

On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro  wrote:
> On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis  wrote:
> > I'd vote for 1 on the grounds that it's easier to document and
> > understand a single collation version, which comes straight from
> > ucol_getVersion(). This approach makes it a separate problem to find
> > the collation version among whatever libraries the admin can provide;
> > but adding some observability into the search should mitigate any
> > confusion.
>
> OK, it sounds like I should code that up next.

Here's the first iteration.  The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
 icu_version | unicode_version | cldr_version
-+-+--
 67.1| 13.0| 37.0
 63.1| 11.0| 34.0
 57.1| 8.0 | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
 icu_version | uca_version | collator_version
-+-+--
 67.1| 13.0| 153.14.37
 63.1| 11.0| 153.88.34
 57.1| 8.0 | 153.64.29
(3 rows)

It's no longer necessary to put anything in PG_TEST_EXTRA to run
"meson test irc/020_multiversion" usefully.  It will find extra ICU
versions all by itself in your system library search path and SKIP if
it doesn't find a second major version.  I have tried to cover the
main scenarios that I expect users to encounter in the update TAP
tests, with commentary that I hope will be helpful to assess the
usability of this thing.

Other changes:

* now using RTLD_LOCAL instead of RTLD_GLOBAL (I guess the latter
might cause trouble for someone using --disable-renaming, but I
haven't tested that and am not an expert on linker/loader arcana)
* fixed library names on Windows (based on reading the manual, but I
haven't tested that)
* fixed failure on non-ICU builds (the reason CI was failing in v7,
some misplaced #ifdefs)
* various cleanup
* I've attached a throwaway patch to install a second ICU version on
Debian/amd64 on CI, since otherwise the new test would SKIP on all
systems

This is just a first cut, but enough to try out and see if we like it,
what needs to be improved, what edge cases we haven't thought about
etc.  Let me know what you think.
From 0d96bfbec02245ddce6c985250ff0f8d38e41df9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v8 1/2] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

Whenever creating a DATABASE or COLLATION object that uses ICU, we'll
use the "default" ICU library and record its ucol_getVersion() in the
catalog.  That's usually the one we're linked against but another can be
selected with the setting default_icu_version_library.

Whenever opening an existing DATABASE or COLLATION object that uses ICU,
we'll see the recorded [dat]collversion and try to find the ICU library
that provides that version.  If we can't, we'll fall back to using the
default ICU library with a warning that the user should either install
another ICU library version, or rebuild affected database objects and
REFRESH.

New GUCs:

icu_library_path

  A place to find ICU libraries, if not the default system library
  search path.

icu_library_versions

  A comma-separated list of ICU major or major.minor versions to make
  available to PostgreSQL, or * for every major version that can be
  found (the default).

default_icu_library_version

  The major or major.minor version to use for new objects and as a
  fallback (with warnings) if the right version can't be found.

Reviewed-by: Peter Eisentraut 
Reviewed-by: Jeff Davis 
Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c|  16 +-
 src/backend/commands/collationcmds.c  |  20 +
 src/backend/utils/adt/formatting.c|  53 +-
 src/backend/utils/adt/pg_locale.c | 748 +-
 src/backend/utils/adt/varchar.c   |  16 +-
 src/backend/utils/adt/varlena.c   |  56 +-
 src/backend/utils/init/postinit.c |  34 +-
 src/backend/utils/misc/guc_tables.c   |  40 +-
 src/backend/utils/misc/postgresql.conf.sample |  10 +
 src/include/catalog/pg_proc.dat   |  23 +
 src/include/utils/pg_locale.h |  85 +-
 src/test/icu/meson.build  |   1 +
 src/test/icu/t/020_multiversion.pl| 274 +++
 src/tools/pgindent/typedefs.list  |   4 +
 14 files changed, 1275 insertions(+), 105 deletions(-)
 create mode 100644

Re: [DOCS] Stats views and functions not in order?

2022-11-25 Thread David G. Johnston

On Wed, Nov 23, 2022 at 1:36 AM Peter Smith  wrote:

> On Thu, Nov 17, 2022 at 8:46 AM David G. Johnston
>  wrote:
>
> > Also, make it so each view ends up being its own separate page.
> >
>
> I did not do this. AFAIK those views of chapter 54 get rendered to
> separate pages only because they are top-level . So I do not
> know how to put all these stats views onto different pages without
> radically changing the document structure. Anyway – doing this would
> be incompatible with my  changes of patch 0006 (see above).
>
>
I did some experimentation and reading on this today.  Short answer - turn
each view into a refentry under a dedicated sect2 where the table resides.

David J.

[...]

[...]

Statistics Views
 Table of Statistics Views...

pg_stat_activityPurpose

  pg_stat_activity

   pg_stat_activity

I was doing quite a bit of experimentation and basically gutted the actual
page to make that easier.  The end result looked basically like below.

Chapter 28. Monitoring Database Activity

Table of Contents

28.1. Standard Unix Tools
28.2. The Cumulative Statistics System

28.2.1. Statistics Collection Configuration
28.2.2. Viewing Statistics
28.2.3. Statistics Views

A database administrator frequently wonders, “What is the system doing
right now?” This chapter discusses how to find that out.

Several tools are available for monitoring database activity and analyzing
performance. Most of this chapter is devoted to describing PostgreSQL's
cumulative statistics system, but one should not neglect regular Unix
monitoring programs such as ps, top, iostat, and vmstat. Also, once one has
identified a poorly-performing query, further investigation might be needed
using PostgreSQL's EXPLAIN command. Section 14.1 discusses EXPLAIN and
other methods for understanding the behavior of an individual query.

== Page for 28.2 (sect1) ==
28.2. The Cumulative Statistics System

28.2.1. Statistics Collection Configuration
28.2.2. Viewing Statistics
28.2.3. Statistics Views

PostgreSQL's cumulative statistics system supports collection and reporting
of information about server activity. Presently, accesses to tables and
indexes in both disk-block and individual-row terms are counted. The total
number of rows in each table, and information about vacuum and analyze
actions for each table are also counted. If enabled, calls to user-defined
functions and the total time spent in each one are counted as well.

PostgreSQL also supports reporting dynamic information about exactly what
is going on in the system right now, such as the exact command currently
being executed by other server processes, and which other connections exist
in the system. This facility is independent of the cumulative statistics
system.
28.2.1. Statistics Collection Configuration

Since collection of statistics adds some overhead to query execution, the
system can be configured to collect or not collect information. This is
controlled by configuration parameters that are normally set in
postgresql.conf. (See Chapter 20 for details about setting configuration
parameters.)

The parameter track_activities enables monitoring of the current command
being executed by any server process.

The parameter track_counts controls whether cumulative statistics are
collected about table and index accesses.

The parameter track_functions enables tracking of usage of user-defined
functions.

The parameter track_io_timing enables monitoring of block read and write
times.

The parameter track_wal_io_timing enables monitoring of WAL write times.

Normally these parameters are set in postgresql.conf so that they apply to
all server processes, but it is possible to turn them on or off in
individual sessions using the SET command. (To prevent ordinary users from
hiding their activity from the administrator, only superusers are allowed
to change these parameters with SET.)

Cumulative statistics are collected in shared memory. Every PostgreSQL
process collects statistics locally, then updates the shared data at
appropriate intervals. When a server, including a physical replica, shuts
down cleanly, a permanent copy of the statistics data is stored in the
pg_stat subdirectory, so that statistics can be retained across server
restarts. In contrast, when starting from an unclean shutdown (e.g., after
an immediate shutdown, a server crash, starting from a base backup, and
point-in-time recovery), all statistics counters are reset.
28.2.2. Viewing Statistics

test
28.2.3. Statistics Views

Table of Statistics Views...

===
file:///usr/local/pgsql/share/doc/html/monitoring-pg-stat-activity-view.html
=
(no ToC entry but the Next link in our footer does point to here)

pg_stat_activity

pg_stat_activity — Purpose
pg_stat_activity

The pg_stat_activity view will have one row per server process, showing
information related to the current activity of that process.

Here is an example of how wait events can be

Re: Small miscellaneous fixes

2022-11-25 Thread Michael Paquier

On Fri, Nov 25, 2022 at 01:15:40PM +0100, Peter Eisentraut wrote:
> Is this something you want to follow up on, since you were involved in that
> patch?  Is the redundant assignment simply to be deleted, or do you want to
> check the original patch again for context?

Most of the changes of this thread have been applied as of c42cd05c.
Remains the SIGALRM business with sig_atomic_t, and I wanted to check
that by myself first.
--
Michael


signature.asc
Description: PGP signature

Re: Patch: Global Unique Index

2022-11-25 Thread David Zhang


Hi Bruce,

Thank you for helping review the patches in such detail.

On 2022-11-25 9:48 a.m., Bruce Momjian wrote:

Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking.  I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes.  (Yeah, that sounds bad,
unless I am missing something.)


For the uniqueness check cross all partitions, we tried to follow the 
implementation of uniqueness check on a single partition, and added a 
loop to check uniqueness on other partitions after the index tuple has 
been inserted to current index partition but before this index tuple has 
been made visible. The uniqueness check will wait `XactLockTableWait` if 
there is a valid transaction in process, and performs the uniqueness 
check again after the in-process transaction finished.


We tried to simulate this duplicate value case in blow steps:

1) prepare the partitioned table,
CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);

2) having two psql consoles hooked up with gdbs and set break points 
after _bt_doinsert


result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);

inside btinsert function in nbtree.c file.

3) first, execute `INSERT INTO gidx_part values(1, 1, 'test');` on 
console-1, and then execute `INSERT INTO gidx_part values(11, 1, 
'test');` on console-2 (expect duplicated value '1' in the 2nd column to 
be detected),


The test results is that: console-2 query will have to wait until either 
console-1 committed or aborted. If console-1 committed, then console-2 
reports duplicated value already exists; if console-1 aborted, then 
console-2 will report insert successfully. If there is a deadlock, then 
the one detected this deadlock will error out to allow the other one 
continue.


I am not quite sure if this is a proper way to deal with a deadlock in 
this case. It would be so grateful if someone could help provide some 
cases/methods to verify this cross all partitions uniqueness.


Best regards,

David


HighGo Software Canada
www.highgo.ca

Re: MSVC vs Perl

2022-11-25 Thread Andres Freund

Hi,

On 2022-11-25 18:48:26 -0500, Andrew Dunstan wrote:
> I could download the installer from ActiveState, but they want me to
> sign up for an account before I do that, which I'm not happy about. And
> while I think our use probably comes within their license terms, IANAL
> and I'm not dead sure, whereas Strawberry is licensed under the usual
> perl license terms.

FWIW, my impression is that strawberry perl is of uh, dubious quality. It's
definitely rarely updated.  I think going with msys' ucrt perl might be the
best choice.

msys ucrt perl IIRC has the same issues mentioned in [1], but otherwise
worked.

I wonder if we ought to add a script that installs most of the windows build
dependencies from msys. I think there's a few where we can't use msys built
versions (due to too much gcc specific stuff ending up in headers), but IIRC
most things were usable. It's just too much work to have everyone do this
stuff manually.

Regards,

Andres

[1] 
https://www.postgresql.org/message-id/20220130221659.tlyr2lbw3wk22owg%40alap3.anarazel.de

Greetings,

Andres Freund

Re: MSVC vs Perl

2022-11-25 Thread Andrew Dunstan



On 2022-11-25 Fr 18:48, Andrew Dunstan wrote:
> For various reasons (see below) it's preferable to build on Windows with
> Strawberry Perl. This works OK if you're building with Msys2, I upgraded
> the instance on the machine that runs fairywren and drongo today, and
> fairywren seems fine. Not so drongo, however. First it encountered a
> build error, which I attempted to cure using something I found in the
> vim sources [1]. That worked, but then there's a failure at runtime like
> this:
>
> 2022-11-25 21:33:01.073 UTC [3764:3] pg_regress LOG:  statement: CREATE 
> EXTENSION IF NOT EXISTS "plperl"
> src/pl/plperl/Util.c: loadable library and perl binaries are mismatched (got 
> handshake key 12800080, needed 12900080)
> 2022-11-25 21:33:01.100 UTC [5048:5] LOG:  server process (PID 3764) exited 
> with exit code 1
>
> I don't know how to debug this. I'm not sure if there's some flag we
> could set which would cure it. It looks like a 1 bit difference, but I
> haven't found what that bit corresponds to.



I just saw Andres's post on -committers about this. Will see if that
helps me.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

MSVC vs Perl

2022-11-25 Thread Andrew Dunstan

For various reasons (see below) it's preferable to build on Windows with
Strawberry Perl. This works OK if you're building with Msys2, I upgraded
the instance on the machine that runs fairywren and drongo today, and
fairywren seems fine. Not so drongo, however. First it encountered a
build error, which I attempted to cure using something I found in the
vim sources [1]. That worked, but then there's a failure at runtime like
this:

2022-11-25 21:33:01.073 UTC [3764:3] pg_regress LOG:  statement: CREATE 
EXTENSION IF NOT EXISTS "plperl"
src/pl/plperl/Util.c: loadable library and perl binaries are mismatched (got 
handshake key 12800080, needed 12900080)
2022-11-25 21:33:01.100 UTC [5048:5] LOG:  server process (PID 3764) exited 
with exit code 1

I don't know how to debug this. I'm not sure if there's some flag we
could set which would cure it. It looks like a 1 bit difference, but I
haven't found what that bit corresponds to.

That leaves ActiveState Perl as the alternative. The chocolatey package
is no longer maintained, and the last maintained package is no longer
installable. The maintainer says [2]:

To everybody having problems with the ActivePerl installer. I'm
sorry, but I'm not maintaining the package anymore and the reason
for this is as follows. The Chocolatey moderation team requires,
that the download URL and checksum for the actual binary package are
static (for security reasons). However, nowadays ActiveState
provides the community edition download as a weekly build only. This
means that the checksum of the URL changes every week. (The
Chocolatey moderation team proposed that we would setup automation
that would update the version of the Chocolatey package weekly too.
While this would kind-of work, it would still mean that only the
latest version ever would work and every time when the version would
update there would be a short but annoying time window when even the
latest package would be broken. I think this is not the way to go.)
Thus I contacted ActiveState and asked for their support. They were
very friendly and promised to take over the maintenance of the
Chocolatey installer themselves and fix the problem. In my opinion
this is really the best solution and I hope that it will get fixed
soon by the new maintainers. So, for any further questions, it is
probably best to contact the ActiveState support directly.

However, nothing has actually been done along these lines AFAICT.

I could download the installer from ActiveState, but they want me to
sign up for an account before I do that, which I'm not happy about. And
while I think our use probably comes within their license terms, IANAL
and I'm not dead sure, whereas Strawberry is licensed under the usual
perl license terms.

The upshot of this is that I have disabled building drongo with perl for
now. There will possibly be some fallout with cross version upgrades,
which I will try to prevent.


cheers


andrew


[1]
https://git.postgresql.org/pg/commitdiff/171c7fffaa4a3f2b000f980ecb33c2f7441a9a03

[2] https://community.chocolatey.org/packages/ActivePerl#comment-5484577151

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-11-25 Thread Peter Geoghegan

On Sat, Aug 6, 2022 at 1:03 PM Peter Geoghegan  wrote:
> Attached patch teaches autovacuum.c to pass the information down to
> lazyvacuum.c, which includes the information in the autovacuum log.
> The approach I've taken is very similar to the existing approach with
> anti-wraparound autovacuum. It's pretty straightforward.

I have formally withdrawn this patch. I still think it's a good idea,
and I'm not abandoning the idea. The patch has just been superseded by
another patch of mine:

https://postgr.es/m/CAH2-Wz=hj-RCr6fOj_L3_0J1Ws8fOoxTQLmtM57gPc19beZz=q...@mail.gmail.com

This other patch has a much broader goal: it decouples
"antiwraparound-ness vs regular-ness" from the criteria that triggered
autovacuum (which includes a "table age" trigger criteria). Since the
other patch has to invent the whole concept of an autovacuum trigger
criteria (which it reports on via a line in autovacuum server log
reports), it seemed best to do everything together.

Thanks
-- 
Peter Geoghegan

Re: Introduce a new view for checkpointer related stats

2022-11-25 Thread Andres Freund

Hi,

On 2022-11-23 11:39:43 +0530, Bharath Rupireddy wrote:
> On Wed, Nov 23, 2022 at 2:23 AM Andres Freund  wrote:
> >
> > On 2022-11-22 18:08:28 +0530, Bharath Rupireddy wrote:
> > >
> > >  CREATE VIEW pg_stat_bgwriter AS
> > >  SELECT
> > > -pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
> > > -pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
> > > -pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
> > > -pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
> > > -pg_stat_get_bgwriter_buf_written_checkpoints() AS 
> > > buffers_checkpoint,
> > >  pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> > >  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > > -pg_stat_get_buf_written_backend() AS buffers_backend,
> > > -pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > >  pg_stat_get_buf_alloc() AS buffers_alloc,
> > >  pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
> >
> >
> > I think we should consider deprecating the pg_stat_bgwriter columns but
> > leaving them in place for a few years. New stuff should only be added to
> > pg_stat_checkpointer, but we don't need to break old monitoring queries.
> 
> May I know what it means to deprecate pg_stat_bgwriter columns?

Add a note to the docs saying that the columns will be removed.


> Are
> you suggesting to add deprecation warnings to corresponding functions
> pg_stat_get_bgwriter_buf_written_clean(),
> pg_stat_get_bgwriter_maxwritten_clean(), pg_stat_get_buf_alloc() and
> pg_stat_get_bgwriter_stat_reset_time() and in the docs?

I'm thinking of the checkpoint related columns in pg_stat_bgwriter.

If we move, rather than duplicate, the pg_stat_bgwriter columns to
pg_stat_checkpointer, everyone will have to update their monitoring scripts
when upgrading and will need to add version dependency if they monitor
multiple versions. If we instead keep the duplicated columns in
pg_stat_bgwriter for 5 years, users can reloy on pg_stat_checkpointer in all
supported versions.

To be clear, it isn't a very heavy burden for users to make these
adjustments. But if it only costs us a few lines to keep the old columns for a
bit, that seems worth it.


> And eventually do away with the bgwriter stats and the file
> pgstat_bgwriter.c? Aren't the bgwriter stats buf_written_clean,
> maxwritten_clean and buf_alloc useful?

Correct, I don't think we should remove those.

Greetings,

Andres Freund

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2022-11-25 Thread Peter Geoghegan

On Mon, Oct 24, 2022 at 9:00 AM Peter Geoghegan  wrote:
> Maybe it could be broken out into a separate "autovacuum triggered by:
> " line, seen only in the autovacuum log instrumentation (and not in
> the similar report output by a manual VACUUM VERBOSE). When we still
> end up "escalating" to an antiwraparound autovacuum, the "triggered
> by:" line would match whatever we'd show in the benign the
> non-cancellable-but-must-advance-relfrozenxid autovacuum case. The
> difference would be that we'd now be reporting on a different
> operation entirely (not just a regular autovacuum, a scary
> antiwraparound autovacuum).

Attached WIP patch invents the idea of a regular autovacuum that is
tasked with advancing relfrozenxid -- which is really just another
trigger criteria, reported on in the server log in its autovacuum
reports. Of course we retain the idea of antiwraparound autovacuums.
The only difference is that they are triggered when table age has
advanced by twice the usual amount, which is presumably only possible
because a regular autovacuum couldn't start or couldn't complete in
time (most likely due to continually being auto-cancelled).

As I said before, I think that the most important thing is to give
regular autovacuuming a chance to succeed. The exact approach taken
has a relatively large amount of slack, but that probably isn't
needed. So the new antiwraparound cutoffs were chosen because they're
easy to understand and remember, which is fairly arbitrary.

Adding this to the upcoming CF.

-- 
Peter Geoghegan

v1-0001-Add-table-age-trigger-concept-to-autovacuum.patch
Description: Binary data

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-11-25 Thread Andres Freund

Hi,

On 2022-11-22 23:43:29 -0600, Justin Pryzby wrote:
> I think there may be a problem/deficiency with hint bits:
> 
> |postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT 
> generate_series(1,99)a; SELECT pg_stat_reset_shared('io'); explain 
> (analyze,buffers) SELECT * FROM u2;
> |...
> | Seq Scan on u2  (cost=0.00..15708.75 rows=1128375 width=4) (actual 
> time=0.111..458.239 rows=99 loops=1)
> |   Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345
> 
> |postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM 
> pg_buffercache b LEFT JOIN pg_class c ON 
> pg_relation_filenode(c.oid)=b.relfilenode GROUP BY 2 ORDER BY 1 DESC LIMIT 11;
> | count | relname | count
> |---+-+---
> | 13619 | | 0
> |  2080 | u2  |  2080
> |   104 | pg_attribute| 4
> |71 | pg_statistic| 1
> |51 | pg_class| 1
> 
> It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
> associated with the new table in pg_buffercache.

Note that there's 2048 dirty buffers for u2 in shared_buffers before the
SELECT, despite the relation being 4425 blocks long, due to the CTAS using
BAS_BULKWRITE.

> |postgres=# SELECT * FROM pg_stat_io WHERE 
> backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background 
> worker' or true ORDER BY 2;
> |backend_type | io_context  |   io_object   | read | written | 
> extended | op_bytes | evicted | reused | files_synced |  stats_reset
> |...
> | client backend  | bulkread| relation  | 2377 |2345 |
>   | 8192 |   0 |   2345 |  | 2022-11-22 22:32:33.044552-06
> 
> I think it's a known behavior that hint bits do not use the strategy
> ring buffer.  For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
> there's 2080 dirty pages in the buffercache (~16MB).

I don't think there's any "circumvention" of the ringbuffer here. There's 2048
buffers for u2 in s_b before, all dirty, there's 2080 after, also all
dirty. So the ringbuffer restricted the increase in shared buffers used for u2
to 2080-2048=32 additional buffers.

The reason hint bits don't prevent pages from being written out here is that a
BAS_BULKREAD strategy doesn't cause all buffer writes to be rejected, it just
causes buffer writes to be rejected when the page LSN would require a WAL
flush. And that's not typically the case when you just set a hint bit, unless
you use wal_log_hint_bits = true.

If I turn on wal_log_hints=true and add a CHECKPOINT after the CTAS I see 0
reuses (and 4425 dirty buffers), which is what I'd expect.

> But the IO view says that 2345 of the pages were "reused", which seems
> misleading to me.  Maybe that just follows from the behavior and the view is
> fine.  If the view is fine, maybe this case should still be specifically
> mentioned in the docs.

I think that's just confusing due to the reset. 2048 + 2345 = 4393, but we
only have 2080 buffers for u2 in s_b.

Greetings,

Andres Freund

Re: Support logical replication of DDLs

2022-11-25 Thread Zheng Li

Hello,

Thanks for the feedback.

> I have been following this patch for a long time.
> Recently, I started to try to test it. I found several bugs
> here and want to give you feedback.
>
> 1. CREATE TABLE LIKE
>   I found that this case may be repication incorrectly.
>You can run the following SQL statement:
>```
>CREATE TABLE ctlt1 (a text CHECK (length(a) > 2) PRIMARY KEY, b text);
> ALTER TABLE ctlt1 ALTER COLUMN a SET STORAGE MAIN;
> ALTER TABLE ctlt1 ALTER COLUMN b SET STORAGE EXTERNAL;
> CREATE TABLE ctlt1_like (LIKE ctlt1 INCLUDING ALL);
>```
>The ctlt1_like table will not be able to correct the replication.
>   I think this is because create table like statement is captured by
>   the event trigger to a create table statement and multiple alter table 
> statements.
>   There are some overlaps between them, and an error is reported when 
> downstream replication occurs.

I looked into this case. The root cause is the statement

CREATE TABLE ctlt1_like (LIKE ctlt1 INCLUDING ALL);

is executed internally using 3 DDLs:
1. CREATE TABLE ctlt1_like (LIKE ctlt1 INCLUDING ALL); --The top level command
2. ALTER TABLE ctlt1_like ADD CONSTRAINT ctlt1_a_check CHECK
(length(a) > 2); --The first subcommand
3. CREATE UNIQUE INDEX ctlt1_like_pkey on ctlt1_like (a); --The second
subcommand that creates the primary key index

All three commands are captured by the event trigger. The first and
second command ends up getting deparsed, WAL-logged and
replayed on the subscriber. The replay of the ALTER TABLE command
causes a duplicate constraint error. The problem is that
while subcommands are captured by event triggers by default, they
don't need to be deparsed and WAL-logged for DDL replication.
To do that we can pass the isCompleteQuery variable in
ProcessUtilitySlow to EventTriggerCollectSimpleCommand() and
EventTriggerAlterTableEnd() and make this information available in
CollectedCommand so that any subcommands can be skipped.

Thoughts?
Zheng

Re: [PATCH] Simple code cleanup in tuplesort.c.

2022-11-25 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Removing "state->status = TSS_SORTEDINMEM" should be fine as it is already set 
in sort_bounded_heap(state) few lines before.

Cary Huang

HighGo Software Canada
www.highgo.ca

Re: Use fadvise in wal replay

2022-11-25 Thread Pavel Borisov

On Sat, 26 Nov 2022 at 01:10, Pavel Borisov  wrote:
>
> Hi, hackers!
>
> On Sun, 13 Nov 2022 at 02:02, Andrey Borodin  wrote:
> >
> > On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
> > >
> >
> > Hi everyone. The patch is 16 lines, looks harmless and with proven
> > benefits. I'm moving this into RfC.
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>
> On the other hand, the patch indeed is tiny and I don't think the
> proposed advise can ever make things bad. So, I've looked through the
> patch again and I agree it can be committed in the current state.

My mailer corrected my previous message. The right one is:
"On the other hand, the patch indeed is tiny and I don't think the
proposed _fadvise_ can ever make things bad".

Pavel.

Small miscellaneus fixes (Part II)

2022-11-25 Thread Ranier Vilela

 Hi.

There another assorted fixes to the head branch.

1. Avoid useless pointer increment
(src/backend/utils/activity/pgstat_shmem.c)
The pointer *p, is used in creation dsa memory,
not p + MAXALIGN(pgstat_dsa_init_size()).

2. Discard result unused (src/backend/access/transam/xlogrecovery.c)
Some compilers raise warnings about discarding return from strtoul.

3. Fix dead code (src/bin/pg_dump/pg_dump.c)
tbinfo->relkind == RELKIND_MATVIEW is always true, so "INDEX"
is never hit.
Per Coverity.

4. Fix dead code (src/backend/utils/adt/formatting.c)
Np->sign == '+', is different than "!= '-'" and is different than "!= '+'"
So the else is never hit.
Per Coverity.

5. Use boolean operator with boolean operands
(b/src/backend/commands/tablecmds.c)

regards,
Ranier Vilela


avoid_useless_pointer_increment.patch
Description: Binary data


discard_strtoul_unused_result.patch
Description: Binary data


fix_dead_code_pg_dump.patch
Description: Binary data


fix_dead_code_formatting.patch
Description: Binary data


use_boolean_operator_with_boolean_operands.patch
Description: Binary data

Re: Use fadvise in wal replay

2022-11-25 Thread Pavel Borisov

Hi, hackers!

On Sun, 13 Nov 2022 at 02:02, Andrey Borodin  wrote:
>
> On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin  wrote:
> >
>
> Hi everyone. The patch is 16 lines, looks harmless and with proven
> benefits. I'm moving this into RfC.

As I've written up in the thread we can not gain much from this
optimization. The results of Jakub shows around 2% difference:

>baseline, master, default Linux readahead (128kb):
>33.979, 0.478
>35.137, 0.504
>34.649, 0.518>

>master+patched, readahead disabled:
>34.338, 0.528
>34.568, 0.575
>34.007, 1.136

>master+patched, readahead enabled (as default):
>33.935, 0.523
>34.109, 0.501
>33.408, 0.557

On the other hand, the patch indeed is tiny and I don't think the
proposed advise can ever make things bad. So, I've looked through the
patch again and I agree it can be committed in the current state.

Kind regards,
Pavel Borisov,
Supabase

Re: Report roles in pg_upgrade pg_ prefix check

2022-11-25 Thread Bruce Momjian

On Thu, Nov 24, 2022 at 12:31:09PM +0100, Daniel Gustafsson wrote:
> Looking at a recent pg_upgrade thread I happened to notice that the check for
> roles with a pg_ prefix only reports the error, not the roles it found.  Other
> similar checks where the user is expected to alter the old cluster typically
> reports the found objects in a textfile.  The attached adds reporting to make
> that class of checks consistent (the check for prepared transactions which 
> also
> isn't reporting is different IMO as it doesn't expect ALTER commands).
> 
> As this check is only executed against the old cluster the patch removes the
> check when printing the error.

+1

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-25 Thread Alvaro Herrera

On 2022-Nov-25, Dimos Stamatakis wrote:

> So does this mean there is no race condition in this case and that
> this error is redundant?

No, it means I believe a bug exists but that I haven't spent enough time
on it to understand what it is.

Please do not top-post.
https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)

Rethinking the implementation of ts_headline()

2022-11-25 Thread Tom Lane

After further contemplation of bug #17691 [1], I've concluded that
what I did in commit c9b0c678d was largely misguided.  For one
thing, the new hlCover() algorithm no longer finds shortest-possible
cover strings: if your query is "x & y" and the text is like
"... x ... x ... y ...", then the selected cover string will run
from the first occurrence of x to the y, whereas the old algorithm
would have correctly selected "x ... y".  For another thing, the
maximum-cover-length hack that I added in 78e73e875 to band-aid
over the performance issues of the original c9b0c678d patch means
that various scenarios no longer work as well as they used to,
which is the proximate cause of the complaints in bug #17691.

What I'm now thinking is that the original hlCover() algorithm was
fine (if underdocumented) *as long as it's only asked to deal with
AND-like semantics*.  Given that restriction, its approach of
finding the latest first occurrence of any query term, and then
backing up to the earliest last occurrence, is visibly correct;
and we weren't hearing of any performance issues with it either.
The problem is that this approach fails miserably for tsqueries
that aren't pure ANDs, because it will insist on including query
terms that aren't required for a match and indeed could prevent a
match.

So what we need is to find a way to fold the other sorts of queries
into an AND context.  After a couple of false starts, I came up
with the attached patch.  It builds on the existing TS_phrase_execute
infrastructure, which already produces an ExecPhraseData structure
containing an exact list of the match locations for a phrase subquery.
It's easy to get an ExecPhraseData structure for a single-lexeme
match, too.  We can handle plain ANDs by forming lists of
ExecPhraseData structs (with implicit AND semantics across the list).
We can handle ORs by union'ing the component ExecPhraseData structs.
And we can handle NOTs by just dropping them, because we don't want
ts_headline to prioritize matches to such words.  There are some fine
points but it all seems to work pretty well.

Hence I propose the attached patchset.  0001 adds some test cases that
I felt necessary after examining code coverage reports; I split this
out mainly so that 0002 clearly shows the behavioral changes from
current code.  0002 adds TS_execute_locations() which does what I just
described, and rewrites hlCover() into something that's a spiritual
descendant of the original algorithm.  It can't be quite the same
as before because the location data that TS_execute_locations() returns
is measured in lexeme indexes not word indexes, so some translation is
needed.

Notable here is that a couple of regression test results revert
to what they were before c9b0c678d.  They're both instances of
preferring "x ... x ... y" to "x ... y", which I argued was okay
at the time, but I now see the error in that.

Although we back-patched c9b0c678d, I'm inclined to propose this
for HEAD only.  The misbehaviors it's fixing are less bad than what
we were dealing with in that patch.

BTW, while experimenting with this I realized that tsvector's limitation
of lexeme indexes to be at most 16383 is really quite disastrous for
phrase searches.  That limitation was arguably okay before we had phrase
searching, but now it seems untenable.  For example, tsvector entries like
"foo:16383 bar:16383" will not match "foo <-> bar" because the phrase
match code wants their lexeme positions to differ by one.  This basically
means that phrase searches do not work beyond ~20K words into a document.
I'm not sure what to do about that exactly, but I think we need to do
something.

Whatever we might do about that would be largely orthogonal to this
patch, in any case.  I'll stick this in the January CF.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/17691-93cef39a14663963%40postgresql.org

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index dc03f15499..e6a01cf985 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1804,6 +1804,123 @@ S. T. Coleridge (1772-1834)
  Water, water, every where
 (1 row)
 
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day & drink'));
+ts_headline
+---
+ Day after day, day after day,+
+   We stuck, nor breath nor motion,   +
+ As idle as a painted Ship+
+   Upon a painted Ocean.  +
+ Water, water, every where+
+   And all the boards did shrink;

Re: Lockless queue of waiters in LWLock

2022-11-25 Thread Pavel Borisov

Hi, hackers!
In the measurements above in the thread, I've been using LIFO wake
queue in a primary lockless patch (and it was attached as the previous
versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
the alternative benchmarking option. I think using the latter is more
fair and natural and provided they show no difference in the speed,
I'd make the main patch using it (attached as v6). No other changes
from v5, though.

Regards,
Pavel.


v6-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data

Re: Patch: Global Unique Index

2022-11-25 Thread Bruce Momjian

On Mon, Nov 21, 2022 at 12:33:30PM +, Simon Riggs wrote:
> On Thu, 17 Nov 2022 at 22:01, Cary Huang  wrote:
> >
> > Patch: Global Unique Index
> 
> Let me start by expressing severe doubt on the usefulness of such a
> feature, but also salute your efforts to contribute.
> 
> > In other words, a global unique index and a regular partitioned index are 
> > essentially the same in terms of their storage structure except that one 
> > can do cross-partition uniqueness check, the other cannot.
> 
> This is the only workable architecture, since it allows DETACH to be
> feasible, which is essential.

I had trouble understanding this feature so I spent some time thinking
about it.  I don't think this is really a global unique index, meaning
it is not one index with all the value in the index.  Rather it is the
enforcement of uniqueness across all of a partitioned table's indexes. 
I think global indexes have a limited enough use-case that this patch's
approach is as close as we are going to get to it in the foreseeable
future.

Second, I outlined the three values of global indexes in this blog
entry, based on a 2019 email thread:

https://momjian.us/main/blogs/pgblog/2020.html#July_1_2020

https://www.postgresql.org/message-id/ca+tgmob_j2m2+qkwrhg2njqekmewzntfd7a6ubg34fjuzpk...@mail.gmail.com

The three values are:

1.  The ability to reference partitioned tables as foreign keys
without requiring the partition key to be part of the foreign
key reference; Postgres 12 allows such foreign keys if they match
partition keys.

2. The ability to add a uniqueness constraint to a partitioned
table where the unique columns are not part of the partition key.

3.  The ability to index values that only appear in a few
partitions, and are not part of the partition key.

This patch should help with #1 and #2, but not #3.  The uniqueness
guarantee allows, on average, half of the partitioned table's indexes to
be checked if there is a match, and all partitioned table's indexes if
not.  This is because once you find a match, you don't need to keep
checking because the value is unique.

Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking.  I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes.  (Yeah, that sounds bad,
unless I am missing something.)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-25 Thread Tom Lane

Israel Barth Rubio  writes:
> It would be great if we can back-patch this to all supported versions,
> as the issue itself is currently affecting them all.

In my mind, this is waiting for Peter to opine on whether it satisfies
his concern.

I'm also looking for input on whether to reject if

if ((MyXactFlags & XACT_FLAGS_PIPELINING) &&
GetTopTransactionIdIfAny() != InvalidTransactionId)

rather than just the bare

if (MyXactFlags & XACT_FLAGS_PIPELINING)

tests in the patch-as-posted.

regards, tom lane

Re: postgres_fdw: batch inserts vs. before row triggers

2022-11-25 Thread Tom Lane

Etsuro Fujita  writes:
> I have committed the patch.

Apologies for not having paid attention to this thread, but ...

I don't think the committed patch is acceptable at all, at least
not in the back branches, because it creates a severe ABI break.
Specifically, by adding a field to ResultRelInfo you have changed
the array stride of es_result_relations, and that will break any
previously-compiled extension code that accesses that array.

I'm not terribly pleased with it having added a field to EState
either.  That seems much more global than what we need here.
Couldn't we add the field to ModifyTableState, instead?

regards, tom lane

Re: checking rd_rules in RelationBuildDesc

2022-11-25 Thread Ted Yu

On Fri, Nov 25, 2022 at 8:17 AM Tom Lane  wrote:

> Ted Yu  writes:
> > I wonder if we should check relation->rd_rules after the call
> > to RelationBuildRuleLock().
>
> That patch is both pointless and wrong.  There is some
> value in updating relhasrules in the catalog, so that future
> relcache loads don't uselessly call RelationBuildRuleLock;
> but we certainly can't try to do so right there.  That being
> the case, making the relcache be out of sync with what's on
> disk cannot have any good consequences.  The most likely
> effect is that it would block later logic from fixing things
> correctly.  There is logic in VACUUM to clean out obsolete
> relhasrules flags (see vac_update_relstats), but I suspect
> that would no longer work properly if we did this.
>
> regards, tom lane
>
Hi,
Thanks for evaluating the patch.

The change was originating from what we have in
RelationCacheInitializePhase3():

if (relation->rd_rel->relhasrules && relation->rd_rules ==
NULL)
{
RelationBuildRuleLock(relation);
if (relation->rd_rules == NULL)
relation->rd_rel->relhasrules = false;

FYI

Re: Bug in row_number() optimization

2022-11-25 Thread Tom Lane

Sergey Shinderuk  writes:
> What about user-defined operators? I created my own <= operator for int8 
> which returns true on null input, and put it in a btree operator class. 
> Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
> violate  the contract of the btree operator class or something? Didn't 
> find a clear answer in the docs.

It's pretty unlikely that this would work during an actual index scan.
I'm fairly sure that btree (and other index AMs) have hard-wired
assumptions that index operators are strict.

regards, tom lane

Re: checking rd_rules in RelationBuildDesc

2022-11-25 Thread Tom Lane

Ted Yu  writes:
> I wonder if we should check relation->rd_rules after the call
> to RelationBuildRuleLock().

That patch is both pointless and wrong.  There is some
value in updating relhasrules in the catalog, so that future
relcache loads don't uselessly call RelationBuildRuleLock;
but we certainly can't try to do so right there.  That being
the case, making the relcache be out of sync with what's on
disk cannot have any good consequences.  The most likely
effect is that it would block later logic from fixing things
correctly.  There is logic in VACUUM to clean out obsolete
relhasrules flags (see vac_update_relstats), but I suspect
that would no longer work properly if we did this.

regards, tom lane

Re: Non-decimal integer literals

2022-11-25 Thread Peter Eisentraut


On 24.11.22 10:13, David Rowley wrote:

On Thu, 24 Nov 2022 at 21:35, Peter Eisentraut
 wrote:

My code follows the style used for parsing the decimal integers.
Keeping that consistent is valuable I think.  I think the proposed
change makes the code significantly harder to understand.  Also, what
you are suggesting here would amount to an attempt to make parsing
hexadecimal integers even faster than parsing decimal integers.  Is that
useful?


Isn't it being faster one of the major use cases for this feature?


Never thought about it that way.


I
remember many years ago and several jobs ago when working with SQL
Server being able to speed up importing data using hexadecimal
DATETIMEs. I can't think why else you might want to represent a
DATETIME as a hexstring, so I assumed this was a large part of the use
case for INTs in PostgreSQL. Are you telling me that better
performance is not something anyone will want out of this feature?


This isn't about datetimes but about integers.

Re: Bug in row_number() optimization

2022-11-25 Thread Sergey Shinderuk


On 25.11.2022 15:46, Richard Guo wrote:

Considering the 'Filter' is a strict function, marking it as
NULL would do.  I think this is why this patch works.


What about user-defined operators? I created my own <= operator for int8 
which returns true on null input, and put it in a btree operator class. 
With my operator I get:


  depname  | empno | salary | enroll_date | c1 | rn | c2 | c3
---+---++-++++
 personnel | 5 |   3500 | 2007-12-10  |  2 |  1 |  2 |  2
 sales | 3 |   4800 | 2007-08-01  |  3 |  1 |  3 |  3
 sales | 4 |   4800 | 2007-08-08  |  3 |||  3
(3 rows)

Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
violate  the contract of the btree operator class or something? Didn't 
find a clear answer in the docs.


--
Sergey Shinderukhttps://postgrespro.com/

checking rd_rules in RelationBuildDesc

2022-11-25 Thread Ted Yu

Hi,
(In light of commit 7b2ccc5e03bf16d1e1bbabca25298108c839ec52)

In RelationBuildDesc(), we have:

if (relation->rd_rel->relhasrules)
RelationBuildRuleLock(relation);

I wonder if we should check relation->rd_rules after the call
to RelationBuildRuleLock().

Your comment is appreciated.


build-desc-check-rules.patch
Description: Binary data

Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-25 Thread Dimos Stamatakis

So does this mean there is no race condition in this case and that this error 
is redundant?

Thanks,
Dimos
[ServiceNow]


From: Alvaro Herrera 
Date: Thursday, 24. November 2022 at 19:24
To: Dimos Stamatakis 
Cc: Peter Geoghegan , simon.ri...@enterprisedb.com 
, pgsql-hackers@lists.postgresql.org 

Subject: Re: Fix for visibility check on 14.5 fails on tpcc with high 
concurrency
[External Email]

On 2022-Nov-24, Alvaro Herrera wrote:

> On 2022-Nov-24, Dimos Stamatakis wrote:
>
> > rmgr: MultiXact len (rec/tot): 54/ 54, tx: 248477, lsn: 0/66DB82A8, prev 
> > 0/66DB8260, desc: CREATE_ID 133 offset 265 nmembers 2: 248477 (nokeyupd) 
> > 248500 (keysh)
> > rmgr: Heap len (rec/tot): 70/ 70, tx: 248477, lsn: 0/66DB82E0, prev 
> > 0/66DB82A8, desc: HOT_UPDATE off 20 xmax 133 flags 0x20 IS_MULTI EXCL_LOCK 
> > ; new off 59 xmax 132, blkref #0: rel 1663/16384/16385 blk 422
> > rmgr: Transaction len (rec/tot): 34/ 34, tx: 248477, lsn: 0/66DBA710, prev 
> > 0/66DBA6D0, desc: ABORT 2022-11-23 20:33:03.712298 UTC
> > …
> > rmgr: Transaction len (rec/tot): 34/ 34, tx: 248645, lsn: 0/66DBB060, prev 
> > 0/66DBB020, desc: ABORT 2022-11-23 20:33:03.712388 UTC
>
> Ah, it seems clear enough: the transaction that aborted after having
> updated the tuple, is still considered live when doing the second
> update. That sounds wrong.

Hmm, if a transaction is aborted but its Xid is after latestCompletedXid,
it would be reported as still running. AFAICS that is only modified
with ProcArrayLock held in exclusive mode, and only read with it held in
share mode, so this should be safe.

I can see nothing else ATM.

--
Álvaro Herrera Breisgau, Deutschland — 
https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code. Too bad I can't do this at work (Oracle 8/9)." (Tom Allison)
http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-25 Thread Israel Barth Rubio

> Attached is a draft patch along the lines I speculated about above.
> It breaks backwards compatibility in that PreventInTransactionBlock
> commands will now be rejected if they're a non-first command in a
> pipeline.  I think that's okay, and arguably desirable, for HEAD
> but I'm pretty uncomfortable about back-patching it.

I attempted to run these using HEAD, and it fails:

parse: create temporary table t1 (a int) on commit drop
bind
execute
parse: analyze t1
bind
execute
parse: select * from t1
bind
execute
sync

It then works fine after applying your patch!

Just for some context, this was brought by Peter E. based on an issue
reported by a customer. They are using PostgreSQL 11, and the issue
was observed after upgrading to PostgreSQL 11.17, which includes the
commit 9e3e1ac458abcda5aa03fa2a136e6fa492d58bd6. As a workaround
they downgraded the binaries to 11.16.

It would be great if we can back-patch this to all supported versions,
as the issue itself is currently affecting them all.

Regards,
Israel.

Re: [DOCS] Stats views and functions not in order?

2022-11-25 Thread David G. Johnston

On Fri, Nov 25, 2022 at 5:09 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 23.11.22 09:36, Peter Smith wrote:
>

> v6-0005-Cleanup-view-name-hyperlinks-for-Tables-28.1-and-.patch
> v6-0006-Remove-all-stats-views-from-the-ToC-of-28.2.patch
>
> I wasn't sure yet whether these had been reviewed yet, sine they were
> late additions to the patch series.
>
>
They have not been reviewed.

If it's a matter of either-or I'd really prefer one page per grouping over
getting rid of the table-of-contents.  But I suspect there has to be some
way to add an sgml element to the markup to force a new page and would
prefer to confirm or refute that prior to committing 0006.

0005 seems a win either way though I haven't reviewed it yet.

David J.

Re: Bug in MERGE's test for tables with rules

2022-11-25 Thread Dean Rasheed

On Wed, 23 Nov 2022 at 15:32, Tom Lane  wrote:
>
> Not quite sure the added test case is worth the cycles.
>

No, probably not, for such a trivial change.

Pushed to HEAD and 15, without the test. Thanks for looking!

Regards,
Dean

Re: Bug in row_number() optimization

2022-11-25 Thread Richard Guo

On Fri, Nov 25, 2022 at 11:26 AM David Rowley  wrote:

> There are two different pass-through modes that the WindowAgg can move
> into when it detects that the run condition is no longer true:
>
> 1) WINDOWAGG_PASSTHROUGH and
> 2) WINDOWAGG_PASSTHROUGH_STRICT
>
> #2 is used when the WindowAgg is the top-level one in this query
> level. Remember we'll need multiple WindowAgg nodes when there are
> multiple different windows to evaluate.  The reason that we need #1 is
> that if there are multiple WindowAggs, then the top-level one (or just
> any WindowAgg above it) might need all the rows from a lower-level
> WindowAgg.

Thanks for the explanation!  I think now I understand pass-through modes
much better.

> What I mean by "WindowAggs above us cannot reference the result of
> another WindowAgg" is that the evaluation of sum(qty) over (order by
> date) can't see the "rn" column. SQL does not allow it.

I think I get your point.  Yeah, the 'rn' column is not needed for the
evaluation of WindowAggs above.  But ISTM it might be needed to evaluate
the quals of WindowAggs above.  Such as in the plan below

explain (costs off) SELECT * FROM
   (SELECT
   count(salary) OVER (PARTITION BY depname || '') c1, -- w1
   row_number() OVER (PARTITION BY depname) rn -- w2
FROM empsalary
) e WHERE rn <= 1;
QUERY PLAN
---
 Subquery Scan on e
   ->  WindowAgg
 Filter: ((row_number() OVER (?)) <= 1)
 ->  Sort
   Sort Key: (((empsalary.depname)::text || ''::text))
   ->  WindowAgg
 Run Condition: (row_number() OVER (?) <= 1)
 ->  Sort
   Sort Key: empsalary.depname
   ->  Seq Scan on empsalary
(10 rows)

The 'rn' column is calculated in the lower-level WindowAgg, and it is
needed to evaluate the 'Filter' of the upper-level WindowAgg.  In
pass-through mode, the lower-level WindowAgg would not be evaluated any
more, so we need to mark the 'rn' column to something that can false the
'Filter'.  Considering the 'Filter' is a strict function, marking it as
NULL would do.  I think this is why this patch works.

> Just thinking of the patch a bit more, what I wrote ends up
> continually zeroing the values and marking the columns as NULL. Likely
> we can just do this once when we do: winstate->status =
> WINDOWAGG_PASSTHROUGH;

Yes, I also think we can do this only once when we go into pass-through
mode.

Thanks
Richard

Re: xml2: add test for coverage

2022-11-25 Thread Peter Eisentraut


On 23.08.22 03:38, Dong Wook Lee wrote:

I made a small patch for xml2 to improve test coverage.
However, there was a problem using the functions below.

- xpath_number
- xpath_bool
- xpath_nodeset
- xpath_list

Do you have any advice on how to use this function correctly?
It would also be good to add an example of using the function to the document.


I can confirm that these functions could use more tests and more 
documentation and examples.  But given that you registered a patch in 
the commit fest, it should be you who provides a patch to solve those 
issues.  Are you still working on this, or were you just looking for 
help on how to solve this?

Re: [PATCH] Add initial xid/mxid/mxoff to initdb

2022-11-25 Thread Peter Eisentraut


On 05.05.22 17:47, Maxim Orlov wrote:
During work on 64-bit XID patch [1] we found handy to have initdb 
options to set initial xid/mxid/mxoff values to arbitrary non default 
values. It helps test different scenarios: related to wraparound, 
pg_upgrade from 32-bit XID to 64-bit XID, etc.


We realize, this patch can be singled out as an independent patch from 
the whole patchset in [1] and be useful irrespective of 64-bit XID in 
cases of testing of wraparound and so on.


In particular, we employed this patch to test recent changes in logical 
replication of subxacts [2] and found no problems in it near the point 
of publisher wraparound.


Just for completeness, over in the other thread the feedback was that 
this functionality is better put into pg_resetwal.

Re: Small miscellaneous fixes

2022-11-25 Thread Peter Eisentraut


On 04.10.22 06:18, Michael Paquier wrote:

On Mon, Oct 03, 2022 at 08:05:57AM -0300, Ranier Vilela wrote:

Em seg., 3 de out. de 2022 às 05:01, Masahiko Sawada 
escreveu:

On Fri, Sep 30, 2022 at 9:08 AM Ranier Vilela  wrote:

1. Avoid useless reassigning var _logsegno

(src/backend/access/transam/xlog.c)

Commit 7d70809 left a little oversight.
XLByteToPrevSeg and XLByteToSeg are macros, and both assign _logsegno.
So, the first assignment is lost and is useless.


Right, I have missed this one.  We do that now in
build_backup_content() when building the contents of the backup
history file.


Is this something you want to follow up on, since you were involved in 
that patch?  Is the redundant assignment simply to be deleted, or do you 
want to check the original patch again for context?

Re: Does pg_rman support PG15?

2022-11-25 Thread Daniel Gustafsson

> On 24 Nov 2022, at 17:31, T Adachi  wrote:

> To the developers working on pg_rman, are there any plans to support PG15
> and when might pg_rman source be released?
> The latest version of pg_rman, V1.3.14, appears to be incompatible with PG15.
> (In PG15, pg_start_backup()/pg_stop_backup() have been renamed.)

Discussions on individual extensions is best kept within the forum that the
extension use for discussions.  For pg_rman I recommend opening an issue on
their Github project page.

https://github.com/ossc-db/pg_rman

--
Daniel Gustafsson   https://vmware.com/

Re: [DOCS] Stats views and functions not in order?

2022-11-25 Thread Peter Eisentraut


On 23.11.22 09:36, Peter Smith wrote:

PSA new patches. Now there are 6 of them. If some of the earlier
patches are agreeable can those ones please be committed? (because I
think this patch may be susceptible to needing a big rebase if
anything in those tables changes).


I have committed

v6-0001-Re-order-sections-of-28.4.-Progress-Reporting.patch
v6-0003-Re-order-Table-28.12-Wait-Events-of-type-LWLock.patch
v6-0004-Re-order-Table-28.35-Per-Backend-Statistics-Funct.patch

which seemed to have clear consensus.

v6-0002-Re-order-Table-28.2-Collected-Statistics-Views.patch

This one also seems ok, need a bit more time to look it over.

v6-0005-Cleanup-view-name-hyperlinks-for-Tables-28.1-and-.patch
v6-0006-Remove-all-stats-views-from-the-ToC-of-28.2.patch

I wasn't sure yet whether these had been reviewed yet, sine they were 
late additions to the patch series.

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-11-25 Thread Amit Kapila

On Fri, Nov 25, 2022 at 1:35 PM Dilip Kumar  wrote:
>
> During DecodeCommit() for skipping a transaction we use ReadRecPtr to
> check whether to skip this transaction or not.  Whereas in
> ReorderBufferCanStartStreaming() we use EndRecPtr to check whether to
> stream or not. Generally it will not create a problem but if the
> commit record itself is adding some changes to the transaction(e.g.
> snapshot) and if the "start_decoding_at" is in between ReadRecPtr and
> EndRecPtr then streaming will decide to stream the transaction where
> as DecodeCommit will decide to skip it.  And for handling this case in
> ReorderBufferForget() we call stream_abort().
>

The other cases are probably where we don't have FilterByOrigin or
dbid check, for example, XLOG_HEAP2_NEW_CID/XLOG_XACT_INVALIDATIONS.
We anyway actually don't send anything for such cases except empty
start/stop messages. Can we add some flag to txn which says that there
is at least one change like DML that we want to stream? Then we can
use that flag to decide whether to stream or not.

-- 
With Regards,
Amit Kapila.

Re: Avoid distributing new catalog snapshot again for the transaction being decoded.

2022-11-25 Thread Ashutosh Bapat

Hi Hou,
Thanks for the patch. With a simple condition, we can eliminate the
need to queueing snapshot change in the current transaction and then
applying it. Saves some memory and computation. This looks useful.

When the queue snapshot change is processed in
ReorderBufferProcessTXN(), we call SetupHistoricSnapshot(). But I
didn't find a path through which SetupHistoricSnapshot() is called
from SnapBuildCommitTxn(). Either I missed some code path or it's not
needed. Can you please enlighten me?

-- 
Best Wishes,
Ashutosh Bapat

On Fri, Nov 25, 2022 at 2:28 PM houzj.f...@fujitsu.com
 wrote:
>
> Hi,
>
> When doing some other related work, I noticed that when decoding the COMMIT
> record via SnapBuildCommitTxn()-> SnapBuildDistributeNewCatalogSnapshot() we
> will add a new snapshot to all transactions including the one being 
> decoded(just committed one).
>
> But since we've already done required modifications in the snapshot for the
> current transaction being decoded(in SnapBuildCommitTxn()), so I think we can
> avoid adding a snapshot for it again.
>
> Here is the patch to improve this.
> Thoughts ?
>
> Best regards,
> Hou zhijie
>

Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish()

2022-11-25 Thread Bharath Rupireddy

On Fri, Nov 25, 2022 at 12:16 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-11-24 18:13:10 +0530, Bharath Rupireddy wrote:
> > With that said, here's a small improvement I can think of, that is, to
> > avoid calling LWLockWaitForVar() for the WAL insertion lock the caller
> > of WaitXLogInsertionsToFinish() currently holds. Since
> > LWLockWaitForVar() does a bunch of things - holds interrupts, does
> > atomic reads, acquires and releases wait list lock and so on, avoiding
> > it may be a good idea IMO.
>
> That doesn't seem like a big win. We're still going to call LWLockWaitForVar()
> for all the other locks.
>
> I think we could improve this code more significantly by avoiding the call to
> LWLockWaitForVar() for all locks that aren't acquired or don't have a
> conflicting insertingAt, that'd require just a bit more work to handle systems
> without tear-free 64bit writes/reads.
>
> The easiest way would probably be to just make insertingAt a 64bit atomic,
> that transparently does the required work to make even non-atomic read/writes
> tear free. Then we could trivially avoid the spinlock in
> LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
> work add a fastpath to LWLockUpdateVar(). We don't need to acquire the wait
> list lock if there aren't any waiters.
>
> I'd bet that start to have visible effects in a workload with many small
> records.

Thanks Andres! I quickly came up with the attached patch. I also ran
an insert test [1], below are the results. I also attached the results
graph. The cirrus-ci is happy with the patch -
https://github.com/BRupireddy/postgres/tree/wal_insertion_lock_improvements_v1_2.
Please let me know if the direction of the patch seems right.
clientsHEADPATCHED
113541499
214511464
430693073
857125797
161133111157
322202022074
644174242213
1287130076638
256103652118944
512111250161582
76899544161987
102496743164161
204872711156686
409654158135713

[1]
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
cd inst/bin
./pg_ctl -D data -l logfile stop
rm -rf data logfile
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "16GB";'
./psql -d postgres -c 'ALTER SYSTEM SET max_connections = "4096";'
./pg_ctl -D data -l logfile restart
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 10 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-WAL-Insertion-Lock-Improvements.patch
Description: Binary data

Re: Support logical replication of DDLs

2022-11-25 Thread vignesh C

On Fri, 25 Nov 2022 at 11:47, vignesh C  wrote:
>
> On Sun, 20 Nov 2022 at 09:29, vignesh C  wrote:
> >
> > On Fri, 11 Nov 2022 at 11:03, Peter Smith  wrote:
> > >
> > > On Fri, Nov 11, 2022 at 4:17 PM Peter Smith  wrote:
> > > >
> > > > On Fri, Nov 11, 2022 at 4:09 PM Peter Smith  
> > > > wrote:
> > > > >
> > > > > On Fri, Nov 11, 2022 at 3:47 PM Peter Smith  
> > > > > wrote:
> > > > > >
> > > > > > Here are more review comments for the v32-0001 file ddl_deparse.c
> > > > > >
> > > > > > *** NOTE - my review post became too big, so I split it into 
> > > > > > smaller parts.
> > > > >
> > > >
> > >
> > > THIS IS PART 4 OF 4.
> > >
> > > ===
> > >
> > > src/backend/commands/ddl_deparse.c
> >
> > Thanks for the comments, the attached v39 patch has the changes for the 
> > same.
>
> One comment:
> While fixing review comments, I found that default syntax is not
> handled for create domain:
> +   /*
> +* Verbose syntax
> +*
> +* CREATE DOMAIN %{identity}D AS %{type}T %{not_null}s %{constraints}s
> +* %{collation}s
> +*/
> +   createDomain = new_objtree("CREATE");
> +
> +   append_object_object(createDomain,
> +"DOMAIN %{identity}D AS",
> +
> new_objtree_for_qualname_id(TypeRelationId,
> +
>   objectId));
> +   append_object_object(createDomain,
> +"%{type}T",
> +
> new_objtree_for_type(typForm->typbasetype, typForm->typtypmod));

I have fixed this issue in the v40 patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm1WEnw2Oykb90PO1c4oDAVrAR%2B16W8Cm_F-KzgNvqmmKg%40mail.gmail.com

Regards,
Vignesh

Re: Logical Replication Custom Column Expression

2022-11-25 Thread Stavros Koureas

Yes, if the property is on the subscription side then it should be applied
for all the tables that the connected publication is exposing.
So if the property is enabled you should be sure that this origin column
exists to all of the tables that the publication is exposing...

Sure this is the complete idea, that the subscriber should match the PK of
origin, 
As the subscription table will contain same key values from different
origins, for example:

*For publisher1 database **table*
id pk integer | value character varying
1   | testA1
2   | testA2

*For publisher2 database **table*
id pk integer | value character varying
1   | testB1
2   | testB2

*For subscriber database table*
origin *pk *character varying | id *pk *integer | value character varying
publisher1   | 1   | testA1
publisher1   | 2   | testA2
publisher2   | 1   | testB1
publisher2   | 2   | testB2

All statements INSERT, UPDATE, DELETE should always include the predicate
of the origin.

Στις Παρ 25 Νοε 2022 στις 12:21 μ.μ., ο/η Ashutosh Bapat <
ashutosh.bapat@gmail.com> έγραψε:

> On Wed, Nov 23, 2022 at 4:54 AM Peter Smith  wrote:
> >
> > On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
> >  wrote:
> > >
> > > Reading more carefully what you described, I think you are interested
> in getting something you call origin from publishers, probably some
> metadata from the publications.
> > >
> > > This identifier in those metadata maybe does not have business value
> on the reporting side. The idea is to use a value which has specific
> meaning to the user at the end.
> > >
> > > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at
> the end based on a dimension table which holds this mapping the user would
> be able to filter the data. So programmatically the user can set the id
> value of the column plus creating the mapping table from an application
> let’s say and be able to distinguish the data.
> > >
> > > In addition this column should have the ability to be part of the
> primary key on the subscription table in order to not conflict with lines
> from other tenants having the same keys.
> > >
> > >
> >
> > I was wondering if a simpler syntax solution might also work here.
> >
> > Imagine another SUBSCRIPTION parameter that indicates to write the
> > *name* of the subscription to some pre-defined table column:
> > e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> > CONNECTION '...' WITH (subscription_column);
> >
> > Logical Replication already allows the subscriber table to have extra
> > columns, so you just need to manually create the extra 'subscription'
> > column up-front.
> >
> > Then...
> >
> > ~~
> >
> > On Publisher:
> >
> > test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> > CREATE TABLE
> >
> > test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> > INSERT 0 3
> >
> > test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> > CREATE PUBLICATION
> >
> > ~~
> >
> > On Subscriber:
> >
> > test_sub=# CREATE TABLE tab(id int, description varchar, subscription
> varchar);
> > CREATE TABLE
> >
> > test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> > dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> > CREATE SUBSCRIPTION
> >
> > test_sub=# SELECT * FROM tab;
> >  id | description | subscription
> > +-+--
> >   1 | one | sub_tenant1
> >   2 | two | sub_tenant1
> >   3 | three   | sub_tenant1
> > (3 rows)
> >
> > ~~
> >
> Thanks for the example. This is more concrete than just verbal description.
>
> In this example, do all the tables that a subscription subscribes to
> need that additional column or somehow the pglogical receiver will
> figure out which tables have that column and populate rows
> accordingly?
>
> My further fear is that the subscriber will also need to match the
> subscription column along with the rest of PK so as not to update rows
> from other subscriptions.
> --
> Best Wishes,
> Ashutosh Bapat
>

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-11-25 Thread Ashutosh Bapat

Excellent catch. We were looking at this code last week and wondered
the purpose of this abort. Probably we should have some macro or
function to decided whether to skip a transaction based on log record.
That will avoid using different values in different places.

On Fri, Nov 25, 2022 at 1:35 PM Dilip Kumar  wrote:
>
> During DecodeCommit() for skipping a transaction we use ReadRecPtr to
> check whether to skip this transaction or not.  Whereas in
> ReorderBufferCanStartStreaming() we use EndRecPtr to check whether to
> stream or not. Generally it will not create a problem but if the
> commit record itself is adding some changes to the transaction(e.g.
> snapshot) and if the "start_decoding_at" is in between ReadRecPtr and
> EndRecPtr then streaming will decide to stream the transaction where
> as DecodeCommit will decide to skip it.  And for handling this case in
> ReorderBufferForget() we call stream_abort().
>
> So ideally if we are planning to skip the transaction we should never
> stream it hence there is no need to stream abort such transaction in
> case of skip.
>
> In this patch I have fixed the skip condition in the streaming case
> and also added an assert inside ReorderBufferForget() to ensure that
> the transaction should have never been streamed.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Best Wishes,
Ashutosh Bapat

Re: Minimal logical decoding on standbys

2022-11-25 Thread Drouvot, Bertrand


Hi,

On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:

Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand  wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, 
then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.


Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the 
missing part to be able to tell whether or not an index is for a catalog is the 
rd_options->user_catalog_table value of its related heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of the 
related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an 
ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would 
make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea in mind 
to propagate the information whether a relcache entry is for a catalog from the 
table to the index?



I ended up with the attached proposal to propagate the catalog information to 
the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate whether 
or not the index is linked to a table that has the storage parameter user_catalog_table 
set to true.

Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" 
making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the previous 
patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to improve the way 
isusercatalog is propagated to the index(es) in case a reset is done on 
user_catalog_table on the table (currently in this POC patch, it's hardcoded to 
"false" which is the default value for user_catalog_table in boolRelOpts[]) (A 
better approach would be probably to retrieve the value from the table once the reset is 
done and then propagate it to the index(es).)


Please find attached a rebase to propagate the catalog information to the 
indexes.
It also takes care of the RESET on user_catalog_table (adding a new Macro 
"HEAP_DEFAULT_USER_CATALOG_TABLE") and adds a few tests in 
contrib/test_decoding/sql/ddl.sql.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 2de2b9917d43d598da56c996361d934c45e52df2 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS 
Date: Fri, 25 Nov 2022 09:42:40 +
Subject: [PATCH v27] Add info in WAL records in preparation for logical slot
 conflict handling.

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict.  While subsequent commits
will do the actual conflict handling, this commit adds a new field
onCatalogTable in such WAL records, that is true for catalog tables,
so as to arrange for conflict handling.

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de
Royes Mello
---
 contrib/test_decoding/expected/ddl.out  | 29 +
 contrib/test_decoding/sql/ddl.sql   |  7 
 doc/src/sgml/catalogs.sgml  | 11 +
 src/backend/access/common/reloptions.c  |  2 +-
 src/backend/access/gist/gistxlog.c  |  1 +
 src/backend/access/hash/hashinsert.c|  1 +
 src/backend/access/heap/heapam.c|  4 +-
 src/backend/access/heap/pruneheap.c |  1 +
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/access/nbtree/nbtpage.c |  2 +
 src/backend/access/spgist/spgvacuum.c   |  1 +
 src/backend/catalog/index.c | 14 +--
 src/backend/commands/tablecmds.c| 55 -
 src/include/access/gistxlog.h   |  2 +
 src/include/access/hash_xlog.h  |  1 +
 src/include/access/heapam_xlog.h|  5 ++-

Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-11-25 Thread Peter Eisentraut

Your patch moves the description of the subscriber-related configuration 
parameters from config.sgml to logical-replication.sgml.  But 
config.sgml is supposed to contain *all* configuration parameters.  If 
we're going to start splitting this up and moving things around then 
we'd need a more comprehensive plan than this individual patch.  (I'm 
not suggesting that we actually do this.)

Re: Logical Replication Custom Column Expression

2022-11-25 Thread Ashutosh Bapat

On Wed, Nov 23, 2022 at 4:54 AM Peter Smith  wrote:
>
> On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
>  wrote:
> >
> > Reading more carefully what you described, I think you are interested in 
> > getting something you call origin from publishers, probably some metadata 
> > from the publications.
> >
> > This identifier in those metadata maybe does not have business value on the 
> > reporting side. The idea is to use a value which has specific meaning to 
> > the user at the end.
> >
> > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end 
> > based on a dimension table which holds this mapping the user would be able 
> > to filter the data. So programmatically the user can set the id value of 
> > the column plus creating the mapping table from an application let’s say 
> > and be able to distinguish the data.
> >
> > In addition this column should have the ability to be part of the primary 
> > key on the subscription table in order to not conflict with lines from 
> > other tenants having the same keys.
> >
> >
>
> I was wondering if a simpler syntax solution might also work here.
>
> Imagine another SUBSCRIPTION parameter that indicates to write the
> *name* of the subscription to some pre-defined table column:
> e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> CONNECTION '...' WITH (subscription_column);
>
> Logical Replication already allows the subscriber table to have extra
> columns, so you just need to manually create the extra 'subscription'
> column up-front.
>
> Then...
>
> ~~
>
> On Publisher:
>
> test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> CREATE TABLE
>
> test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> INSERT 0 3
>
> test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> CREATE PUBLICATION
>
> ~~
>
> On Subscriber:
>
> test_sub=# CREATE TABLE tab(id int, description varchar, subscription 
> varchar);
> CREATE TABLE
>
> test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> CREATE SUBSCRIPTION
>
> test_sub=# SELECT * FROM tab;
>  id | description | subscription
> +-+--
>   1 | one | sub_tenant1
>   2 | two | sub_tenant1
>   3 | three   | sub_tenant1
> (3 rows)
>
> ~~
>
Thanks for the example. This is more concrete than just verbal description.

In this example, do all the tables that a subscription subscribes to
need that additional column or somehow the pglogical receiver will
figure out which tables have that column and populate rows
accordingly?

My further fear is that the subscriber will also need to match the
subscription column along with the rest of PK so as not to update rows
from other subscriptions.
-- 
Best Wishes,
Ashutosh Bapat

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> [v11]

There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Support logical replication of DDLs

2022-11-25 Thread li jie

Hi Developer，

I have been following this patch for a long time.
Recently, I started to try to test it. I found several bugs
here and want to give you feedback.

1. CREATE TABLE LIKE
  I found that this case may be repication incorrectly.
   You can run the following SQL statement:
   ```
   CREATE TABLE ctlt1 (a text CHECK (length(a) > 2) PRIMARY KEY, b text);
ALTER TABLE ctlt1 ALTER COLUMN a SET STORAGE MAIN;
ALTER TABLE ctlt1 ALTER COLUMN b SET STORAGE EXTERNAL;
CREATE TABLE ctlt1_like (LIKE ctlt1 INCLUDING ALL);
   ```
   The ctlt1_like table will not be able to correct the replication.
  I think this is because create table like statement is captured by
  the event trigger to a create table statement and multiple alter table
statements.
  There are some overlaps between them, and an error is reported when
downstream replication occurs.

2. ALTER TABLE (inherits)
case:
```
CREATE TABLE gtest30 (
a int,
b int GENERATED ALWAYS AS (a * 2) STORED
);
CREATE TABLE gtest30_1 () INHERITS (gtest30);
ALTER TABLE gtest30 ALTER COLUMN b DROP EXPRESSION;
```
After this case is executed in the publication, the following error occurs
in the subscription :

ERROR:  column "b" of relation "gtest30" is not a stored generated column
STATEMENT:  ALTER TABLE public.gtest30 ALTER COLUMN b DROP EXPRESSION, ALTER
COLUMN b DROP EXPRESSION

Obviously, the column modifications of the inherited table were also
captured,

and then deparse the wrong statement.

I believe that such errors may also occur in other alter table subcmd
scenarios where tables are inherited.

3. ALTER TABLE SET STATISTICS

case:

```
CREATE TABLE test_stat (a int);
ALTER TABLE test_stat ALTER a SET STATISTICS -1;

```

After this case is executed in the publication, the following error occurs
in the subscription :

syntax error at or near "4294967295" at character 60
STATEMENT: ALTER TABLE public.test_stat ALTER COLUMN a SET STATISTICS
4294967295

I guess this should be an overflow in the integer conversion process.

4.  json null string coredump

case:

```
CREATE OR REPLACE FUNCTION test_ddl_deparse_full()
RETURNS event_trigger LANGUAGE plpgsql AS
$$
DECLARE
r record;
deparsed_json text;
BEGIN
FOR r IN SELECT * FROM pg_event_trigger_ddl_commands()
LOOP
deparsed_json = ddl_deparse_to_json(r.command);
RAISE NOTICE 'deparsed json: %', deparsed_json;
RAISE NOTICE 're-formed command: %',
ddl_deparse_expand_command(deparsed_json);
END LOOP;
END;
$$;

CREATE EVENT TRIGGER test_ddl_deparse_full
ON ddl_command_end EXECUTE PROCEDURE test_ddl_deparse_full();

CREATE SCHEMA AUTHORIZATION postgres;

```

If the preceding case is executed, coredump occurs,

which is related to null string and can be reproduced.

I hope these feedbacks can be helpful to you.

We sincerely wish you complete the ddl Logical replication feature.

 Regards,  Adger

vignesh C  于2022年11月25日周五 14:18写道：

> On Sun, 20 Nov 2022 at 09:29, vignesh C  wrote:
> >
> > On Fri, 11 Nov 2022 at 11:03, Peter Smith  wrote:
> > >
> > > On Fri, Nov 11, 2022 at 4:17 PM Peter Smith 
> wrote:
> > > >
> > > > On Fri, Nov 11, 2022 at 4:09 PM Peter Smith 
> wrote:
> > > > >
> > > > > On Fri, Nov 11, 2022 at 3:47 PM Peter Smith 
> wrote:
> > > > > >
> > > > > > Here are more review comments for the v32-0001 file ddl_deparse.c
> > > > > >
> > > > > > *** NOTE - my review post became too big, so I split it into
> smaller parts.
> > > > >
> > > >
> > >
> > > THIS IS PART 4 OF 4.
> > >
> > > ===
> > >
> > > src/backend/commands/ddl_deparse.c
> >
> > Thanks for the comments, the attached v39 patch has the changes for the
> same.
>
> One comment:
> While fixing review comments, I found that default syntax is not
> handled for create domain:
> +   /*
> +* Verbose syntax
> +*
> +* CREATE DOMAIN %{identity}D AS %{type}T %{not_null}s
> %{constraints}s
> +* %{collation}s
> +*/
> +   createDomain = new_objtree("CREATE");
> +
> +   append_object_object(createDomain,
> +"DOMAIN %{identity}D AS",
> +
> new_objtree_for_qualname_id(TypeRelationId,
> +
>   objectId));
> +   append_object_object(createDomain,
> +"%{type}T",
> +
> new_objtree_for_type(typForm->typbasetype, typForm->typtypmod));
>
> Regards,
> Vignesh
>
>
>

Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-25 Thread Maxim Orlov

On Fri, 25 Nov 2022 at 09:40, Amit Kapila  wrote:

> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila 
> wrote:
> >
> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada 
> wrote:
> > >
> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila 
> wrote:
> > >
> > > Agreed not to have a test case for this.
> > >
> > > I've attached an updated patch. I've confirmed this patch works for
> > > all supported branches.
> > >
> >
> > I have slightly changed the checks used in the patch, otherwise looks
> > good to me. I am planning to push (v11-v15) the attached tomorrow
> > unless there are more comments.
> >
>
> Pushed.
>
A big thanks to you! Could you also, close the commitfest entry
https://commitfest.postgresql.org/41/4024/, please?

-- 
Best regards,
Maxim Orlov.

Re: postgres_fdw: batch inserts vs. before row triggers

2022-11-25 Thread Etsuro Fujita

On Thu, Nov 24, 2022 at 8:19 PM Etsuro Fujita  wrote:
> Here is an updated patch.  In the attached, I added an assertion to
> ExecInsert().  Also, I tweaked comments and test cases a little bit,
> for consistency.  Also, I noticed a copy-and-pasteo in a comment in
> ExecBatchInsert(), so I fixed it as well.
>
> Barring objections, I will commit the patch.

I have committed the patch.

Best regards,
Etsuro Fujita

Avoid distributing new catalog snapshot again for the transaction being decoded.

2022-11-25 Thread houzj.f...@fujitsu.com

Hi,

When doing some other related work, I noticed that when decoding the COMMIT
record via SnapBuildCommitTxn()-> SnapBuildDistributeNewCatalogSnapshot() we
will add a new snapshot to all transactions including the one being 
decoded(just committed one).

But since we've already done required modifications in the snapshot for the
current transaction being decoded(in SnapBuildCommitTxn()), so I think we can
avoid adding a snapshot for it again.

Here is the patch to improve this.
Thoughts ?

Best regards,
Hou zhijie



0001-Avoid-distributing-new-catalogsnapshot-for-the-trans.patch
Description:  0001-Avoid-distributing-new-catalogsnapshot-for-the-trans.patch

Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-25 Thread Jakub Wartak

Hi hackers,

Mr Lane, thank you for backporting this also to version 13. It seems
to be occuring in the wild (without debug_discard_caches) for real
user too when doing a lot of "CREATE  INDEX i ON
unlogged_table_truncated_after_crash (x,y)" which sometimes (rarely)
results in SIGSEGV11. I've reproduced it also on 13.9 recently thanks
to "break *btbuildempty / call InvalidateSystemCaches()".

I'm leaving partial stack trace so that other might find it (note the:
smgrwrite reln=0x0):

#0  smgrwrite (reln=0x0, forknum=INIT_FORKNUM, blocknum=0,
buffer=0xeef828 "", skipFsync=true) at smgr.c:516
#1  0x004e5492 in btbuildempty (index=0x7f201fc3c7e0) at nbtree.c:178
#2  0x005417f4 in index_build
(heapRelation=heapRelation@entry=0x7f201fc49dd0,
indexRelation=indexRelation@entry=0x7f201fc3c7e0,
indexInfo=indexInfo@entry=0x1159dd8,
#3  0x00542838 in index_create
(heapRelation=heapRelation@entry=0x7f201fc49dd0,
indexRelationName=indexRelationName@entry=0x1159f38 "",
indexRelationId=y..)
#4  0x005db9c8 in DefineIndex
(relationId=relationId@entry=1804880199, stmt=stmt@entry=0xf2fab8,
indexRelationId=indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0

-Jakub Wartak.

On Fri, Nov 25, 2022 at 9:48 AM Tomas Vondra
 wrote:
>
>
>
> On 11/18/22 15:43, Tom Lane wrote:
> > David Geier  writes:
> >> On a different note: are we frequently running our tests suites with
> >> debug_discard_caches=1 enabled?
> >> It doesn't seem like.
> >
> > Hmm.   Buildfarm members avocet and trilobite are supposed to be
> > doing that, but their runtimes of late put the lie to it.
> > Configuration option got lost somewhere?
> >
>
> Yup, my bad - I forgot to tweak CPPFLAGS when upgrading the buildfarm
> client to v12. Fixed, next run should be with
>
> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS',
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
>

Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-25 Thread Aleksander Alekseev

Hi hackers,

> I'm wondering whether the safest way to handle this is by creating a
> new TAM called "heap64", so that all storage changes happens there.

> Many current users see stability as one of the greatest strengths of
> Postgres, so while I very much support this move, I wonder if this
> gives us a way to have both stability and innovation at the same time?

That would be nice.

However from what I see TransactionId is a type used globally in
PostgresSQL. It is part of structures used by TAM interface, used in
WAL records, etc. So we will have to learn these components to work
with 64-bit XIDs anyway and then start thinking about cases like: when
a user runs a transaction affecting two tables, a heap32 one and
heap64 one and we will have to figure out which tuples are visible and
which are not. This perhaps is doable but the maintenance burden for
the project will be too high IMO.

It seems to me that the best option we can offer for the users looking
for stability is to use the latest PostgreSQL version with 32-bit
XIDs. Assuming these users care that much about this particular design
choice of course.

> The whole project seems to just ignore basic, pertinent questions.
> Questions like: why are we falling behind like this in the first
> place? And: If we don't catch up soon, why should we be able to catch
> up later on? Falling behind on freezing is still a huge problem with
> 64-bit XIDs.

Is the example I provided above wrong?

"""
Consider the case when you run a slow OLAP query that takes 12h to
complete and 100K TPS of fast OLTP-type queries on the same system.
The fast queries will consume all 32-bit XIDs in less than 12 hours,
while the OLAP query started 12 hours ago didn't finish yet and thus
its tuples can't be frozen.
"""

If it is, please let me know. I would very much like to know if my
understanding here is flawed.

-- 
Best regards,
Aleksander Alekseev

Avoid streaming the transaction which are skipped (in corner cases)

2022-11-25 Thread Dilip Kumar

During DecodeCommit() for skipping a transaction we use ReadRecPtr to
check whether to skip this transaction or not.  Whereas in
ReorderBufferCanStartStreaming() we use EndRecPtr to check whether to
stream or not. Generally it will not create a problem but if the
commit record itself is adding some changes to the transaction(e.g.
snapshot) and if the "start_decoding_at" is in between ReadRecPtr and
EndRecPtr then streaming will decide to stream the transaction where
as DecodeCommit will decide to skip it.  And for handling this case in
ReorderBufferForget() we call stream_abort().

So ideally if we are planning to skip the transaction we should never
stream it hence there is no need to stream abort such transaction in
case of skip.

In this patch I have fixed the skip condition in the streaming case
and also added an assert inside ReorderBufferForget() to ensure that
the transaction should have never been streamed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 20cc1084c4943bdaf23753f2a7d9add22097ed95 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 25 Nov 2022 13:11:44 +0530
Subject: [PATCH v1] Fix thinko in when to stream a transaction

Actually, during DecodeCommit() for skipping a transaction we use
ReadRecPtr to check whether to skip this transaction or not.  Whereas
in ReorderBufferCanStartStreaming() we use EndRecPtr to check whether
to stream or not. Generally it will not create a problem but if the
commit record itslef is adding some changes to the transaction(e.g. snapshot)
and if the start_decoding_at is in between ReadRecPtr and EndRecPtr then
streaming will decide to stream the transaction where as DecodeCommit will
decide to skip it.  And for handling this case in ReorderBufferForget() we
call stream_abort() in order to abort any streamed changes.  So ideally if
we are planning to skip the transaction we should never stream it hence there
is no need to stream abort such transaction in case of skip.

In this patch I have fixed the skip condition in streaming case and also
added an assert inside ReorderBufferForget() to ensure that the transaction
should have never been streamed.
---
 src/backend/replication/logical/reorderbuffer.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 31f7381..ddd5db0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2942,9 +2942,8 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	if (txn == NULL)
 		return;
 
-	/* For streamed transactions notify the remote node about the abort. */
-	if (rbtxn_is_streamed(txn))
-		rb->stream_abort(rb, txn, lsn);
+	/* the transaction which is being skipped shouldn't have been streamed */
+	Assert(!rbtxn_is_streamed(txn));
 
 	/* cosmetic... */
 	txn->final_lsn = lsn;
@@ -3919,7 +3918,7 @@ ReorderBufferCanStartStreaming(ReorderBuffer *rb)
 	 * restarting.
 	 */
 	if (ReorderBufferCanStream(rb) &&
-		!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
+		!SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
 		return true;
 
 	return false;
-- 
1.8.3.1

Re: Remove a unused argument from qual_is_pushdown_safe

2022-11-25 Thread Richard Guo

On Fri, Nov 25, 2022 at 2:27 PM Yugo NAGATA  wrote:

> I found that qual_is_pushdown_safe() has an argument "subquery"
> that is not used in the function.  This argument has not been
> referred to since the commit 964c0d0f80e485dd3a4073e073ddfd9bfdda90b2.
>
> I think we can remove this if there is no special reason.


+1.  In 964c0d0f the checks in qual_is_pushdown_safe() that need to
reference 'subquery' were moved to subquery_is_pushdown_safe(), so param
'subquery' is not used any more.  I think we can just remove it.

I wonder if we need to revise the comment atop qual_is_pushdown_safe()
too which says

 * rinfo is a restriction clause applying to the given subquery (whose RTE
 * has index rti in the parent query).

since there is no 'given subquery' after we remove it from the params.

Thanks
Richard

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I
have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is
committed. We have all the prerequisites in place, except for a benchmark
random enough to demonstrate benefit. I'm still not quite satisfied with
how the shared memory coding looked, and that is the only sticky problem we
still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any
given node is optional, and we can refine as needed.

> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.

Good.

> I'm going to check the load
> performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:

+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);

Since memset for expanding size class is now handled separately, these can
use the non-zeroing versions. When compiling MemoryContextAllocZero, the
compiler has no idea how big the size is, so it assumes the worst and
optimizes for large sizes. On x86-64, that means using "rep stos",
which calls microcode found in the CPU's ROM. This is slow for small sizes.
The "init" function should be always inline with const parameters where
possible. That way, memset can compile to a single instruction for the
smallest node kind. (More on alloc/init below)

Note, there is a wrinkle: As currently written inner_node128 searches the
child pointers for NULL when inserting, so when expanding from partial to
full size class, the new node must be zeroed (Worth fixing in the short
term. I thought of this while writing the proof-of-concept for size
classes, but didn't mention it.) Medium term, rather than special-casing
this, I actually want to rewrite the inner-node128 to be more similar to
the leaf, with an "isset" array, but accessed and tested differently. I
guarantee it's *really* slow now to load (maybe somewhat true even for
leaves), but I'll leave the details for later. Regarding node128 leaf, note
that it's slightly larger than a DSA size class, and we can trim it to fit:

node61:  6 + 256+(2) +16 +  61*8 =  768
node125: 6 + 256+(2) +16 + 125*8 = 1280

> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's
0002/4/5 can be squashed. Plus, some additional comments:

+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk,
bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}

I don't see the point of a function that just calls two functions.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node*newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and
additionally must set one member. This function should really be alloc-node
+ init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout.
It has no business touching "shift" and "chunk". The callers rt_new_root,
rt_set_extend, and rt_extend set some values of their own anyway, so let
them set those, too -- it might even improve readability.

-   if (n32->base.n.fanout ==
rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+   if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch

60 matches

Mail list logo