from:"Alex"

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-03-04 Thread Alex Friedman

Good points, thank you. I'm good with going ahead as you've suggested.


Best regards,

Alex Friedman

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-03-04 Thread Alex Friedman

>I decided to leave this out, since I just remembered that the most
> likely change is actually to move to 64-bit offsets, as was proposed
> here and has some enthusiastic support:
>
> https://www.postgresql.org/message-id/CACG=ezawg7_nt-8ey4akv2w9lculthhknwcawmbgeetnjrj...@mail.gmail.com

Thanks for the review and the draft, looks good to me, and I'm okay
with doing this without the code comments. However, it seems like that
thread is just the beginning of wider changes (if they indeed happen),
which may impact these calculations as well, and then maybe a doc
update reminder may come in useful?

Best regards,

Alex Friedman

Re: Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity

2025-03-02 Thread Alex Friedman


discrepancy will look like. What about we do something much more
simplified, such
as the below:

"""
To keep the reporting overhead low, the system does not attempt to synchronize
activity data for a backend. As a result, ephemeral discrepancies may
exist between
the view’s columns.
"""


Yes, I believe it makes sense to make it more generic. Attached v3 with a slight 
tweak:


+in the system. To keep the reporting overhead low, the system does not 
attempt to
+synchronize different aspects of activity data for a backend. As a result, 
ephemeral

+discrepancies may exist between the view's columns.


Best regards,

Alex FriedmanFrom 58de88469f6201ae698ee34debcdec028526a72a Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Wed, 26 Feb 2025 19:59:59 +0200
Subject: [PATCH v3] Clarify possibility of ephemeral discrepancies between
 state and wait_event in pg_stat_activity.

---
 doc/src/sgml/monitoring.sgml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9178f1d34ef..0e34b3509b8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1016,7 +1016,9 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
 it may or may not be waiting on some event.  If the 
state
 is active and wait_event is 
non-null, it
 means that a query is being executed, but is being blocked somewhere
-in the system.
+in the system. To keep the reporting overhead low, the system does not 
attempt to
+synchronize different aspects of activity data for a backend. As a result, 
ephemeral
+discrepancies may exist between the view's columns.

   
 
-- 
2.41.0

Re: Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity

2025-02-26 Thread Alex Friedman


On 26/02/2025 22:00, Sami Imseih wrote:

If we do need to document anything, which I am not convinced we should,
it should be more generic.


Thanks for the feedback, I've attached a v2 patch which has wording that's a bit 
more generic.


It's also worth noting that pg_locks already has a full paragraph explaining 
inconsistencies, so in my opinion it's worth it at least mentioning something 
similar here for pg_stat_activity.



Best regards,

Alex FriedmanFrom fbbfc623e16ed97176c0ccf0ebc534d118e9f252 Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Wed, 26 Feb 2025 19:59:59 +0200
Subject: [PATCH v2] Clarify possibility of ephemeral discrepancies between
 state and wait_event in pg_stat_activity.

---
 doc/src/sgml/monitoring.sgml | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9178f1d34ef..de49769d407 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1016,7 +1016,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
 it may or may not be waiting on some event.  If the 
state
 is active and wait_event is 
non-null, it
 means that a query is being executed, but is being blocked somewhere
-in the system.
+in the system. To keep the reporting low-overhead, the system uses very 
lightweight
+synchronization. As a result, ephemeral discrepancies between the view's 
columns,
+for example between wait_event and
+state, or between 
state and
+query_id, are possible by nature.

   
 
-- 
2.41.0

Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity

2025-02-26 Thread Alex Friedman


Hi,

This small doc change patch is following up on a past discussion about 
discrepancies between state and wait_event in pg_stat_activity:


https://www.postgresql.org/message-id/flat/ab1c0a7d-e789-5ef5-1180-42708ac6fe2d%40postgrespro.ru


As this kind of question is raised by PG users from time to time, the goal is to 
clarify that such discrepancies are to be expected. The attached patch reuses 
Robert Haas's eloquent wording from his response in the above thread. I've tried 
to keep it short and to the point, but it can be made more verbose if needed.



Best regards,

Alex FriedmanFrom 3cab620d67d200ff4ccb1870f63cbf75a50d0df6 Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Wed, 26 Feb 2025 19:59:59 +0200
Subject: [PATCH v1] Clarify possibility of ephemeral discrepancies between
 state and wait_event in pg_stat_activity.

---
 doc/src/sgml/monitoring.sgml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9178f1d34ef..57fcd8ab52b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1016,7 +1016,9 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
 it may or may not be waiting on some event.  If the 
state
 is active and wait_event is 
non-null, it
 means that a query is being executed, but is being blocked somewhere
-in the system.
+in the system. To keep the reporting low-overhead, the system uses very 
lightweight
+synchronization. As a result, ephemeral discrepancies between 
wait_event
+and state are possible by nature.

   
 
-- 
2.41.0

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-26 Thread Alex Friedman


Hi John,

Thanks for reviewing.


It seems at a minimum this one-line patch is sufficient for the correction:

- storage occupied by multixacts members exceeds 2GB, aggressive vacuum
+ storage occupied by multixacts members exceeds about 10GB,
aggressive vacuum

Commit c552e171d16e removed the percentage as part of a judgment call
on clarity, and I'm not sure that was wrong.

We could add the proposed language on "can grow up to about 20GB" at
the end of this paragraph, which seems more natural -- first mention
the amount that triggers aggressive vacuum, then the maximum size.


Yes, I believe this can work.



I'm on the fence about putting a hint in the C file, but the
computation has changed in the past, see commit b4d4ce1d50bbdf , so
it's a reasonable idea.


That's a good find about the change. Taken together with Bertrand's comments, 
I've added two reminders to multixact.c to update the docs, one for the 
threshold and another for the multixact storage scheme. Please see if it makes 
sense.

v4 patch attached.


Best regards,

Alex Friedman
From 0965413dbb0b85e4dd78f87a6ca3847dccdc78c7 Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Thu, 30 Jan 2025 17:19:07 +0200
Subject: [PATCH v4] Doc fix of aggressive vacuum threshold for multixact
 members storage.

---
 doc/src/sgml/maintenance.sgml  | 5 +++--
 src/backend/access/transam/multixact.c | 7 +--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 0be90bdc7ef..89040942be2 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -793,10 +793,11 @@ HINT:  Execute a database-wide VACUUM in that database.
  As a safety device, an aggressive vacuum scan will
  occur for any table whose multixact-age is greater than .  Also, if the
- storage occupied by multixacts members exceeds 2GB, aggressive vacuum
+ storage occupied by multixacts members exceeds about 10GB, aggressive 
vacuum
  scans will occur more often for all tables, starting with those that
  have the oldest multixact-age.  Both of these kinds of aggressive
- scans will occur even if autovacuum is nominally disabled.
+ scans will occur even if autovacuum is nominally disabled. The members 
storage
+ area can grow up to about 20GB before reaching wraparound.
 
 
 
diff --git a/src/backend/access/transam/multixact.c 
b/src/backend/access/transam/multixact.c
index 27ccdf9500f..66adb9995d0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -134,7 +134,8 @@ MultiXactIdToOffsetSegment(MultiXactId multi)
  * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and
  * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups
  * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and
- * performance) trumps space efficiency here.
+ * performance) trumps space efficiency here. If this computation changes, make
+ * sure to update the documentation.
  *
  * Note that the "offset" macros work with byte offset, not array indexes, so
  * arithmetic must be done using "char *" pointers.
@@ -212,7 +213,9 @@ MXOffsetToMemberOffset(MultiXactOffset offset)
member_in_group * sizeof(TransactionId);
 }
 
-/* Multixact members wraparound thresholds. */
+/* Multixact members wraparound thresholds.
+ * When changing the thresholds, make sure to update the documentation.
+ */
 #define MULTIXACT_MEMBER_SAFE_THRESHOLD(MaxMultiXactOffset / 2)
 #define MULTIXACT_MEMBER_DANGER_THRESHOLD  \
(MaxMultiXactOffset - MaxMultiXactOffset / 4)
-- 
2.41.0

A small correction to doc and comment of FSM for indexes

2025-02-25 Thread Alex Friedman


Hi,

This patch fixes a couple of small inaccuracies in the doc and the comment for 
FSM about index handling.

1. In the doc for pg_freespacemap, it currently says:


For indexes, what is tracked is entirely-unused pages, rather than free space 
within pages. Therefore, the values are not meaningful, just whether a page is 
full or empty.


However, as what is tracked is entirely-unused pages, the values mean whether a page is 
"in-use or empty", rather than "full or empty".


2. In indexfsm.c the header comment says:


 * This is similar to the FSM used for heap, in freespace.c, but instead
 * of tracking the amount of free space on pages, we only track whether
 * pages are completely free or in-use. We use the same FSM implementation
 * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused.


However, in the code we see that used pages are marked with 0:


/*
 * RecordUsedIndexPage - mark a page as used in the FSM
 */
void
RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
{
RecordPageWithFreeSpace(rel, usedBlock, 0);
}


And free pages are marked with BLCKSZ - 1:


/*
 * RecordFreeIndexPage - mark a page as free in the FSM
 */
void
RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
{
RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1);
}


And so, this patch also fixes the comment's "using BLCKSZ - 1 to denote used pages, and 0 for 
unused" to be "using 0 to denote used pages, and BLCKSZ - 1 for unused".

While these changes are minor, I've seen how this can cause a bit of confusion, 
and it would be good to clarify it.


Best regards,

Alex FriedmanFrom a1b78438343fca053aa0014687eaba34d5e160e0 Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Tue, 25 Feb 2025 19:12:53 +0200
Subject: [PATCH v1] A small correction to doc and comment of FSM for indexes.

---
 doc/src/sgml/pgfreespacemap.sgml | 2 +-
 src/backend/storage/freespace/indexfsm.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/pgfreespacemap.sgml b/doc/src/sgml/pgfreespacemap.sgml
index 829ad60f32f..3774a9f8c6b 100644
--- a/doc/src/sgml/pgfreespacemap.sgml
+++ b/doc/src/sgml/pgfreespacemap.sgml
@@ -67,7 +67,7 @@
   
For indexes, what is tracked is entirely-unused pages, rather than free
space within pages.  Therefore, the values are not meaningful, just
-   whether a page is full or empty.
+   whether a page is in-use or empty.
   
  
 
diff --git a/src/backend/storage/freespace/indexfsm.c 
b/src/backend/storage/freespace/indexfsm.c
index 1fc263892a7..3cd2437599d 100644
--- a/src/backend/storage/freespace/indexfsm.c
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -16,7 +16,7 @@
  * This is similar to the FSM used for heap, in freespace.c, but instead
  * of tracking the amount of free space on pages, we only track whether
  * pages are completely free or in-use. We use the same FSM implementation
- * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused.
+ * as for heaps, using 0 to denote used pages, and BLCKSZ - 1 for unused.
  *
  *-
  */
-- 
2.41.0

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-04 Thread Alex Friedman


A few paragraphs up the docs, there is this mention:

". There is a separate storage area which holds the list of members
in each multixact, which also uses a 32-bit counter and which must
also be managed."

Maybe we can add more to this paragraph, such as:

"also be managed. This member can grow to 20GB"

And then in the proposed correction:

"
Also, if the storage occupied by multixacts members area exceeds 10GB
(50% of the maximum the members area can grow), aggressive vacuum
scans will occur more often for all tables
"

What do you think?



Looks good to me, attached a v2 patch with small adjustments.


Thanks,

AlexFrom 3deda711bb4219089b32204c567e735b3d7a152b Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Thu, 30 Jan 2025 17:19:07 +0200
Subject: [PATCH v2] Doc fix of aggressive vacuum threshold for multixact
 members storage.

---
 doc/src/sgml/maintenance.sgml | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 0be90bdc7ef..f4f560bccc1 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -761,7 +761,8 @@ HINT:  Execute a database-wide VACUUM in that database.
  careful aging management, storage cleanup, and wraparound handling.
  There is a separate storage area which holds the list of members in
  each multixact, which also uses a 32-bit counter and which must also
- be managed.
+ be managed. This members storage area can grow up to about 20GB before 
+ reaching wraparound.
 
 
 
@@ -792,9 +793,9 @@ HINT:  Execute a database-wide VACUUM in that database.
 
  As a safety device, an aggressive vacuum scan will
  occur for any table whose multixact-age is greater than .  Also, if the
- storage occupied by multixacts members exceeds 2GB, aggressive vacuum
- scans will occur more often for all tables, starting with those that
+ linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the 
storage occupied
+ by multixacts members exceeds about 10GB (50% of the maximum the members 
area can grow),
+ aggressive vacuum scans will occur more often for all tables, starting 
with those that
  have the oldest multixact-age.  Both of these kinds of aggressive
  scans will occur even if autovacuum is nominally disabled.
 
-- 
2.41.0

Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-03 Thread Alex Friedman

Hi Sami,

Thanks for the feedback.

> 1/ Remove this as
> "(50% of the maximum, which is about 20GB),"
>
> [1] tried to avoid explaining this level of detail, and I
> agree with that.

I feel it is critical for users to know what is the hard limit of
multixact members. As PG doesn't (yet) expose how many multixact
members are in use, the only way for users to know the distance to
members wraparound is by monitoring the members directory space usage.
So it seems to me that the 20 GB number is very important to have in
the docs.

> 2/ c/"about 10GB"/"10GB" the "about" does not seem necessary here.

The threshold is actually ~10.015 GiB (due to the 12 bytes wasted per
8KB page), or ~10.75 GB, so to avoid confusion by users when
aggressive autovacuum doesn't trigger exactly at 10GB, I believe we
should either be exact, or say that we are not being exact. Being
exact is difficult as it depends on the block size. And as I looked
through the doc page in question, I noticed there are already several
cases using the "about" wording, e.g. "about 50MB of pg_xact storage"
and "about 2GB of pg_commit_ts storage", so here I went for
consistency with the rest of the doc.

Thanks,

Alex

Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-02 Thread Alex Friedman

Hi,

This patch suggests a correction to the doc page dealing with multixact
vacuuming, which, starting with PG 14, says that the multixact members
storage threshold for aggressive vacuum is 2 GB. However, I believe the
threshold is actually about 10 GB.

MultiXactMemberFreezeThreshold() defines the threshold as 2^32 (0x)
/ 2 or 2^31 multixact members. However, as discussed in multixact.c,
multixact members are stored in groups of 4, each group taking up 20 bytes,
meaning 5 bytes per member. (This is not quite exact as 12 bytes per 8 KB
page are wasted, but I believe it is close enough for the docs.)

This makes the threshold in bytes be 2^31 multixact members * 5 bytes per
member = 10 GiB. It was also confirmed by observing a live system (with an
admittedly unfortunate workload pattern). Also, the maximum storage size
for multixact members is 20 GiB (2^32 * 5), and it should be useful to call
this out in the doc as well.

For reference, the original commit which introduced the current wording is
c552e17, and the discussion was here:
https://www.postgresql.org/message-id/flat/162395467510.686.11947486273299446208%40wrigleys.postgresql.org

The attached patch is against master, but it should probably be backpatched
all the way through 14.

Best regards,

Alex Friedman


v1-0001-Doc-fix-of-aggressive-vacuum-threshold-for-multix.patch
Description: Binary data

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-21 Thread alex work

First of all thank you for looking into this.

At the moment we workaround the problem by altering `acc` ROLE into a SUPERUSER
in PostgreSQL 16 instances. It sidestep the problem and having the lowest cost
to implement for us. While at first we think this feels like opening a security
hole, it does not introduce side effects for **our use case** by the way our
application make use of this `acc` ROLE.

Of course we cannot recommend the workaround we took to others having similar
situation.

On Fri, Mar 22, 2024 at 7:59 AM Tom Lane  wrote:
>
> Nathan Bossart  writes:
> > On Thu, Mar 21, 2024 at 03:40:12PM -0500, Nathan Bossart wrote:
> >> On Thu, Mar 21, 2024 at 04:31:45PM -0400, Tom Lane wrote:
> >>> I don't think we have any really cheap way to de-duplicate the role
> >>> OIDs, especially seeing that it has to be done on-the-fly within the
> >>> collection loop, and the order of roles_list is at least potentially
> >>> interesting.  Not sure how to make further progress without a lot of
> >>> work.
>
> >> Assuming these are larger lists, this might benefit from optimizations
> >> involving SIMD intrinsics.
>
> > Never mind.  With the reproduction script, I'm only seeing a ~2%
> > improvement with my patches.
>
> Yeah, you cannot beat an O(N^2) problem by throwing SIMD at it.
>
> However ... I just remembered that we have a Bloom filter implementation
> in core now (src/backend/lib/bloomfilter.c).  How about using that
> to quickly reject (hopefully) most role OIDs, and only do the
> list_member_oid check if the filter passes?
>
> regards, tom lane

Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2023-01-05 Thread Alex Fan

There is discussion in
https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/13 to
change the abi default, but not much attention for some time. The consensus
seems to be set the abi and extension explicitly.

> I recommend proposing a patch for adding such an API to LLVM.

I would like to try some time later. Jitlink allows lots of flexibility to
inspect each linking process, I feel myself don't know enough use cases to
propose a good enough c-abi for it.

The thing I am thinking is these patch to llvm will take some time to land
especially for abi and extension default. But jitlink and orc for riscv is
very mature since llvm-15, and even llvm-14 with two minor patches. It
would be good to have these bits, though ugly, so that postgresql jit can
work with llvm-15 as most distros are still moving to it.

cheers,
Alex Fan

On Sun, Dec 25, 2022 at 11:02 PM Andres Freund  wrote:

> Hi,
>
> On 2022-11-23 21:13:04 +1100, Alex Fan wrote:
> > > @@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context)
> > > context->module = LLVMModuleCreateWithName("pg");
> > > LLVMSetTarget(context->module, llvm_triple);
> > > LLVMSetDataLayout(context->module, llvm_layout);
> > > +#ifdef __riscv
> > > +#if __riscv_xlen == 64
> > > +#ifdef __riscv_float_abi_double
> > > +   abiname = "lp64d";
> > > +#elif defined(__riscv_float_abi_single)
> > > +   abiname = "lp64f";
> > > +#else
> > > +   abiname = "lp64";
> > > +#endif
> > > +#elif __riscv_xlen == 32
> > > +#ifdef __riscv_float_abi_double
> > > +   abiname = "ilp32d";
> > > +#elif defined(__riscv_float_abi_single)
> > > +   abiname = "ilp32f";
> > > +#else
> > > +   abiname = "ilp32";
> > > +#endif
> > > +#else
> > > +   elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen);
> > > +#endif
> > > +   /*
> > > +* set this manually to avoid llvm defaulting to soft
> > > float and
> > > +* resulting in linker error: `can't link double-float
> > > modules
> > > +* with soft-float modules`
> > > +* we could set this for TargetMachine via MCOptions,
> but
> > > there
> > > +* is no C API for it
> > > +* ref:
>
> I think this is something that should go into the llvm code, rather than
> postgres.
>
>
> > > @@ -820,16 +861,21 @@ llvm_session_initialize(void)
> > > elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features
> \"%s\"",
> > >  cpu, features);
> > >
> > > +#ifdef __riscv
> > > +   reloc=LLVMRelocPIC;
> > > +   codemodel=LLVMCodeModelMedium;
> > > +#endif
>
> Same.
>
>
>
>
> > > +#ifdef USE_JITLINK
> > > +/*
> > > + * There is no public C API to create ObjectLinkingLayer for JITLINK,
> > > create our own
> > > + */
> > > +DEFINE_SIMPLE_CONVERSION_FUNCTIONS(llvm::orc::ExecutionSession,
> > > LLVMOrcExecutionSessionRef)
> > > +DEFINE_SIMPLE_CONVERSION_FUNCTIONS(llvm::orc::ObjectLayer,
> > > LLVMOrcObjectLayerRef)
>
> I recommend proposing a patch for adding such an API to LLVM.
>
>
>
> Greetings,
>
> Andres Freund
>

Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2023-01-05 Thread Alex Fan

cpp#L235>,
so there is no abi issue. Big end or little end is encoded in target triple
like ppc64 (big endian), ppc64le (little endian), and a recent riscv64be
patch <https://reviews.llvm.org/D128612>. I guess that is why there are no
endian issues.
--
*From:* Thomas Munro 
*Sent:* Thursday, December 15, 2022 9:59:39 AM
*To:* David Rowley 
*Cc:* Alex Fan ; pgsql-hack...@postgresql.org <
pgsql-hack...@postgresql.org>; and...@anarazel.de ;
geidav...@gmail.com ; l...@swarm64.com 
*Subject:* Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit
linker of old Rtdyld.

On Thu, Nov 24, 2022 at 12:08 AM David Rowley  wrote:
> On Wed, 23 Nov 2022 at 23:13, Alex Fan  wrote:
> > I am new to the postgres community and apologise for resending this as
the previous one didn't include patch properly and didn't cc reviewers
(maybe the reason it has been buried in mailing list for months)
>
> Welcome to the community!

+1

I don't know enough about LLVM or RISCV to have any strong opinions
here, but I have a couple of questions...  It looks like we have two
different things in this patch:

1.  Optionally use JITLink instead of RuntimeDyld for relocation.
>From what I can tell from some quick googling, that is necessary for
RISCV because they haven't got around to doing this yet:

https://reviews.llvm.org/D127842

Independently of that, it seems that
https://llvm.org/docs/JITLink.html is the future and RuntimeDyld will
eventually be obsolete, so one question I have is: why should we do
this only for riscv?

You mentioned that this change might be necessary to support COFF and
thus Windows.  I'm not a Windows user and I think it would be beyond
my pain threshold to try to get this working there by using CI alone,
but I'm just curious... wouldn't
https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldCOFF.cpp
work for that already?  (I haven't heard about anyone successfully
using PostgreSQL/LLVM on Windows; it would certainly be cool to hear
some more about what would be needed for that.)

2.  Manually adjust the CPU features and ABI/subtarget.

+#if defined(__riscv)
+/* getHostCPUName returns "generic-rv[32|64]", which lacks all
features */
+Features.AddFeature("m", true);
+Features.AddFeature("a", true);
+Features.AddFeature("c", true);
+# if defined(__riscv_float_abi_single)
+Features.AddFeature("f", true);
+# endif
+# if defined(__riscv_float_abi_double)
+Features.AddFeature("d", true);
+# endif
+#endif

I'm trying to understand this, and the ABI name selection logic.
Maybe there are two categories of features here?

The ABI bits, "f" and "d" are not just "which instructions can I
used", but they also affect the ABI (I guess something like: where
floats go in the calling convention), and they have to match the ABI
of the main executable to allow linking to succeed, right?  Probably a
stupid question: wouldn't the subtarget/ABI be the same as the one
that the LLVM library itself was compiled for (which must also match
the postgres executable), and doesn't it know that somewhere?  I guess
I'm confused about why we don't need to deal with this kind of manual
subtarget selection on any other architecture: for PPC it
automatically knows whether to be big endian/little endian, 32 or 64
bit, etc.

Then for "m", "a", "c", I guess these are code generation options -- I
think "c" is compressed instructions for example?  Can we get a
comment to say what they are?  Why do you think that all RISCV chips
have these features?  Perhaps these are features that are part of some
kind of server chip profile (ie features not present in a tiny
microcontroller chip found in a toaster, but expected in any system
that would actually run PostgreSQL) -- in which case can we get a
reference to explain that?

I remembered the specific reason why we have that
LLVMGethostCPUFeatures() call: it's because the list of default
features that would apply otherwise based on CPU "name" alone turned
out to assume that all x86 chips had AVX, but some low end parts
don't, so we have to check for AVX etc presence that way.  But your
patch seems to imply that LLVM is not able to get features reliably
for RISCV -- why not, immaturity or technical reason why it can't?

+assert(ES && "ES must not be null");

We use our own Assert() macro (capital A).
From 4e7a65d868de016cde7a8719c64eedcf66cc3405 Mon Sep 17 00:00:00 2001
From: Alex Fan 
Date: Mon, 29 Aug 2022 15:24:16 +0800
Subject: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker
 of old Rtdyld.

This brings the bonus of support jitting on riscv64 (included in this patch)
Assume gc(imafdac) extension since generally it is

Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2022-11-23 Thread Alex Fan

Hi,

I am new to the postgres community and apologise for resending this as the
previous one didn't include patch properly and didn't cc reviewers (maybe
the reason it has been buried in mailing list for months)

Adding to previous email, this patch exposes its own C API for creating
ObjectLinkingLayer in a similar fashion as
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager since orc
doesn't expose it yet.

Thanks and really appreciate if someone can offer a review to this and help
get it merged.

Cheers,
Alex


On Mon, Aug 29, 2022 at 5:46 PM Alex Fan  wrote:

> This brings the bonus of support jitting on riscv64 (included in this
> patch)
> and other platforms Rtdyld doesn't support, e.g. windows COFF.
>
> Currently, llvm doesn't expose jitlink (ObjectLinkingLayer) via C API, so
> a wrapper is added. This also adds minor llvm 15 compat fix that is needed
> ---
>  config/llvm.m4|  1 +
>  src/backend/jit/llvm/llvmjit.c| 67 +--
>  src/backend/jit/llvm/llvmjit_wrap.cpp | 35 ++
>  src/include/jit/llvmjit.h |  9 
>  4 files changed, 108 insertions(+), 4 deletions(-)
>
> diff --git a/config/llvm.m4 b/config/llvm.m4
> index 3a75cd8b4d..a31b8b304a 100644
> --- a/config/llvm.m4
> +++ b/config/llvm.m4
> @@ -75,6 +75,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
>engine) pgac_components="$pgac_components $pgac_component";;
>debuginfodwarf) pgac_components="$pgac_components $pgac_component";;
>orcjit) pgac_components="$pgac_components $pgac_component";;
> +  jitlink) pgac_components="$pgac_components $pgac_component";;
>passes) pgac_components="$pgac_components $pgac_component";;
>native) pgac_components="$pgac_components $pgac_component";;
>perfjitevents) pgac_components="$pgac_components $pgac_component";;
> diff --git a/src/backend/jit/llvm/llvmjit.c
> b/src/backend/jit/llvm/llvmjit.c
> index 6c72d43beb..d8b840da8c 100644
> --- a/src/backend/jit/llvm/llvmjit.c
> +++ b/src/backend/jit/llvm/llvmjit.c
> @@ -229,6 +229,11 @@ llvm_release_context(JitContext *context)
>  LLVMModuleRef
>  llvm_mutable_module(LLVMJitContext *context)
>  {
> +#ifdef __riscv
> +   const char* abiname;
> +   const char* target_abi = "target-abi";
> +   LLVMMetadataRef abi_metadata;
> +#endif
> llvm_assert_in_fatal_section();
>
> /*
> @@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context)
> context->module = LLVMModuleCreateWithName("pg");
> LLVMSetTarget(context->module, llvm_triple);
> LLVMSetDataLayout(context->module, llvm_layout);
> +#ifdef __riscv
> +#if __riscv_xlen == 64
> +#ifdef __riscv_float_abi_double
> +   abiname = "lp64d";
> +#elif defined(__riscv_float_abi_single)
> +   abiname = "lp64f";
> +#else
> +   abiname = "lp64";
> +#endif
> +#elif __riscv_xlen == 32
> +#ifdef __riscv_float_abi_double
> +   abiname = "ilp32d";
> +#elif defined(__riscv_float_abi_single)
> +   abiname = "ilp32f";
> +#else
> +   abiname = "ilp32";
> +#endif
> +#else
> +   elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen);
> +#endif
> +   /*
> +* set this manually to avoid llvm defaulting to soft
> float and
> +* resulting in linker error: `can't link double-float
> modules
> +* with soft-float modules`
> +* we could set this for TargetMachine via MCOptions, but
> there
> +* is no C API for it
> +* ref:
> https://github.com/llvm/llvm-project/blob/afa520ab34803c82587ea6759bfd352579f741b4/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp#L90
> +*/
> +   abi_metadata = LLVMMDStringInContext2(
> +   LLVMGetModuleContext(context->module),
> +   abiname, strlen(abiname));
> +   LLVMAddModuleFlag(context->module,
> LLVMModuleFlagBehaviorOverride,
> +   target_abi, strlen(target_abi), abi_metadata);
> +#endif
> }
>
> return context->module;
> @@ -786,6 +825,8 @@ llvm_session_initialize(void)
> char   *error = NULL;
> char   *cpu = NULL;
> char   *features = NULL;
> +   LLVMRelocMode reloc=LLVMRelocDefault;
> +   LLVMCodeModel codemodel=LLVMCodeModelJITDefault;
> LLVMTargetMachineRef opt0_tm;
>

[PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2022-08-29 Thread Alex Fan

This brings the bonus of support jitting on riscv64 (included in this patch)
and other platforms Rtdyld doesn't support, e.g. windows COFF.

Currently, llvm doesn't expose jitlink (ObjectLinkingLayer) via C API, so
a wrapper is added. This also adds minor llvm 15 compat fix that is needed
---
 config/llvm.m4|  1 +
 src/backend/jit/llvm/llvmjit.c| 67 +--
 src/backend/jit/llvm/llvmjit_wrap.cpp | 35 ++
 src/include/jit/llvmjit.h |  9 
 4 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/config/llvm.m4 b/config/llvm.m4
index 3a75cd8b4d..a31b8b304a 100644
--- a/config/llvm.m4
+++ b/config/llvm.m4
@@ -75,6 +75,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   engine) pgac_components="$pgac_components $pgac_component";;
   debuginfodwarf) pgac_components="$pgac_components $pgac_component";;
   orcjit) pgac_components="$pgac_components $pgac_component";;
+  jitlink) pgac_components="$pgac_components $pgac_component";;
   passes) pgac_components="$pgac_components $pgac_component";;
   native) pgac_components="$pgac_components $pgac_component";;
   perfjitevents) pgac_components="$pgac_components $pgac_component";;
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 6c72d43beb..d8b840da8c 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -229,6 +229,11 @@ llvm_release_context(JitContext *context)
 LLVMModuleRef
 llvm_mutable_module(LLVMJitContext *context)
 {
+#ifdef __riscv
+   const char* abiname;
+   const char* target_abi = "target-abi";
+   LLVMMetadataRef abi_metadata;
+#endif
llvm_assert_in_fatal_section();
 
/*
@@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context)
context->module = LLVMModuleCreateWithName("pg");
LLVMSetTarget(context->module, llvm_triple);
LLVMSetDataLayout(context->module, llvm_layout);
+#ifdef __riscv
+#if __riscv_xlen == 64
+#ifdef __riscv_float_abi_double
+   abiname = "lp64d";
+#elif defined(__riscv_float_abi_single)
+   abiname = "lp64f";
+#else
+   abiname = "lp64";
+#endif
+#elif __riscv_xlen == 32
+#ifdef __riscv_float_abi_double
+   abiname = "ilp32d";
+#elif defined(__riscv_float_abi_single)
+   abiname = "ilp32f";
+#else
+   abiname = "ilp32";
+#endif
+#else
+   elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen);
+#endif
+   /*
+* set this manually to avoid llvm defaulting to soft float and
+* resulting in linker error: `can't link double-float modules
+* with soft-float modules`
+* we could set this for TargetMachine via MCOptions, but there
+* is no C API for it
+* ref: 
https://github.com/llvm/llvm-project/blob/afa520ab34803c82587ea6759bfd352579f741b4/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp#L90
+*/
+   abi_metadata = LLVMMDStringInContext2(
+   LLVMGetModuleContext(context->module),
+   abiname, strlen(abiname));
+   LLVMAddModuleFlag(context->module, 
LLVMModuleFlagBehaviorOverride,
+   target_abi, strlen(target_abi), abi_metadata);
+#endif
}
 
return context->module;
@@ -786,6 +825,8 @@ llvm_session_initialize(void)
char   *error = NULL;
char   *cpu = NULL;
char   *features = NULL;
+   LLVMRelocMode reloc=LLVMRelocDefault;
+   LLVMCodeModel codemodel=LLVMCodeModelJITDefault;
LLVMTargetMachineRef opt0_tm;
LLVMTargetMachineRef opt3_tm;
 
@@ -820,16 +861,21 @@ llvm_session_initialize(void)
elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features \"%s\"",
 cpu, features);
 
+#ifdef __riscv
+   reloc=LLVMRelocPIC;
+   codemodel=LLVMCodeModelMedium;
+#endif
+
opt0_tm =
LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, 
features,

LLVMCodeGenLevelNone,
-   
LLVMRelocDefault,
-   
LLVMCodeModelJITDefault);
+   reloc,
+   codemodel);
opt3_tm =
LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, 
features,

LLVMCodeGenLevelAggressive,
-   
LLVMRelocDefault,
-   
LLVMCodeModelJITDefault);
+   reloc,
+

libpq async command processing methods are difficult to use with edge-triggered epoll

2020-12-02 Thread Alex Robinson

Hi,

I've been using libpq to access postgres from within a system that uses an
edge-triggered epoll in its event-loop. The instructions on
https://www.postgresql.org/docs/current/libpq-async.html are pretty good,
but I've run into a bit of an edge case that would be much easier to handle
if the interfaces exposed by libpq were a little more ergonomic. If it
makes a difference, I'm running with single-row mode enabled on the client.

Specifically, PQconsumeInput returns no information that allows the caller
to distinguish whether there might be more data available to be read on the
network socket if PQconsumeInput were to be called again (it just returns 0
on error and 1 otherwise), and PQisBusy doesn't return false until a full
row's worth of data has been read by PQconsumeInput. This is a bad
combination if a result set contains rows larger than PGconn's default
buffer size, since calling PQconsumeInput followed by PQisBusy can suggest
that we need to wait on the socket's readability even if there's already
more data available to be read on the socket.

If more detail helps, here's a slightly more detailed summary based on my
trawling through the code:

* The recommended pattern for processing responses to async commands is to
wait until the socket is readable, then call PQconsumeInput, then check
PQisBusy. If PQisBusy returns true, the docs suggest waiting on the socket
again.
* When PQconsumeInput is called, it doubles the PGconn's buffer size if
less than 8k of space is unused in it.
* PQconsumeInput will then read either until its remaining buffer space
fills up or until the socket has no more data in it ready to be read.
* If the buffer fills up, PQconsumeInput will return to the caller even if
there's still more data to be read
* PQconsumeInput's return value only indicates whether or not there was an
error, not whether any data was read.
* PQisBusy will return true unless the buffer contains an entire row; it
does not actually check the status of the socket.
* If the PGconn's buffer wasn't large enough to fit an entire row in it
when you called PQconsumeInput, PQisBusy will return true, suggesting that
you ought to wait on socket readability, when really the right thing to do
would be to call PQconsumeInput again (potentially multiple times) until
the buffer finally grows to be large enough to contain the whole row before
PQisBusy can return false.

This can be worked around by making a poll() syscall on the socket without
timeout 0 before handing the socket off to epoll, but libpq could make this
case easier to deal with with a slightly different interface.

The function that PQconsumeInput uses internally, pqReadData, has a much
more helpful interface -- it returns 1 if it successfully read at least 1
byte, 0 if no data was available, or -1 if there was an error. If
PQconsumeInput had a similar range of return values, this ability to
distinguish between whether we read data or not would be enough information
to know whether we ought to call PQconsumeInput again when PQisBusy returns
true.

As for what we can realistically do about it, I imagine it may be
disruptive to add an additional possible return value to PQconsumeInput
(e.g. return 1 if at least one byte was read or 2 if no data was
available). And I'm not sure edge-triggered polling is a use case the
community cares enough about to justify adding a separate method that does
basically the same thing as PQconsumeInput but with more information
conveyed by its return value.

So maybe there isn't any change worth making here, but I wanted to at least
mention the problem, since it was pretty disappointing to me that calling
PQconsumeInput followed by PQisBusy and then waiting on readability
occasionally caused a hang on large rows. I'd expect other developers using
edge-triggered epoll to run into problems like this too. If there's a
backwards-compatible way of making it less error prone, it'd be nice to do
so.

Alex

Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-04-06 Thread Alex Malek

On Thu, Apr 2, 2020 at 2:10 PM Andres Freund  wrote:

> Hi,
>
> On 2020-02-19 16:35:53 -0500, Alex Malek wrote:
> > We are having a reoccurring issue on 2 of our replicas where replication
> > stops due to this message:
> > "incorrect resource manager data checksum in record at ..."
>
> Could you show the *exact* log output please? Because this could
> temporarily occur without signalling anything bad, if e.g. the
> replication connection goes down.
>

Feb 23 00:02:02 wrds-pgdata10-2-w postgres[68329]: [12491-1] 5e4aac44.10ae9
(@) LOG:  incorrect resource manager data checksum in record at
39002/57AC0338

When it occurred replication stopped.  The only way to resume replication
was to stop server and remove bad WAL file.

>
>
> > Right before the issue started we did some upgrades and altered some
> > postgres configs and ZFS settings.
> > We have been slowly rolling back changes but so far the the issue
> continues.
> >
> > Some interesting data points while debugging:
> > We had lowered the ZFS recordsize from 128K to 32K and for that week the
> > issue started happening every other day.
> > Using xxd and diff we compared "good" and "bad" wal files and the
> > differences were not random bad bytes.
> >
> > The bad file either had a block of zeros that were not in the good file
> at
> > that position or other data.  Occasionally the bad data has contained
> > legible strings not in the good file at that position. At least one of
> > those exact strings has existed elsewhere in the files.
> > However I am not sure if that is the case for all of them.
> >
> > This made me think that maybe there was an issue w/ wal file recycling
> and
> > ZFS under heavy load, so we tried lowering
> > min_wal_size in order to "discourage" wal file recycling but my
> > understanding is a low value discourages recycling but it will still
> > happen (unless setting wal_recycle in psql 12).
>
> This sounds a lot more like a broken filesystem than anythingon the PG
> level.
>

Probably. In my recent updated comment turning off ZFS compression on
master seems to have fixed the issue.
However I will note that the WAL file stored on the master was always fine
upon inspection.

>
>
> > When using replication slots, what circumstances would cause the master
> to
> > not save the WAL file?
>
> What do you mean by "save the WAL file"?
>

Typically, when using replication slots, when replication stops the master
will save the next needed WAL file.
However once or twice when this error occurred the master recycled/removed
the WAL file needed.
I suspect perhaps b/c the replica had started to read the WAL file it sent
some signal to the master that the WAL
file was already consumed.  I am guessing, not knowing exactly what is
happening and w/ the caveat that this
situation was rare and not the norm.  It is also possible caused by a
different error.

Thanks.
Alex

Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-04-02 Thread Alex Malek

On Wed, Feb 19, 2020 at 4:35 PM Alex Malek  wrote:

>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file
> had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file
> systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>  and to https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> 
> 
>


An update in case someone else encounters the same issue.

About 5 weeks ago, on the master database server, we turned off ZFS
compression for the volume where the WAL log resides.
The error has not occurred on any replica since.

Best,
Alex

Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-26 Thread Alex Malek

On Thu, Feb 20, 2020 at 12:01 PM Alex Malek  wrote:

> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila  wrote:
>
>> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek  wrote:
>> >
>> >
>> > Hello Postgres Hackers -
>> >
>> > We are having a reoccurring issue on 2 of our replicas where
>> replication stops due to this message:
>> > "incorrect resource manager data checksum in record at ..."
>> > This has been occurring on average once every 1 to 2 weeks during large
>> data imports (100s of GBs being written)
>> > on one of two replicas.
>> > Fixing the issue has been relatively straight forward: shutdown
>> replica, remove the bad wal file, restart replica and
>> > the good wal file is retrieved from the master.
>> > We are doing streaming replication using replication slots.
>> > However twice now, the master had already removed the WAL file so the
>> file had to retrieved from the wal archive.
>> >
>> > The WAL log directories on the master and the replicas are on ZFS file
>> systems.
>> > All servers are running RHEL 7.7 (Maipo)
>> > PostgreSQL 10.11
>> > ZFS v0.7.13-1
>> >
>> > The issue seems similar to
>> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>> and to https://github.com/timescale/timescaledb/issues/1443
>> >
>> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so
>> ZFS sees our array as a single device.
>> >
>> > Right before the issue started we did some upgrades and altered some
>> postgres configs and ZFS settings.
>> > We have been slowly rolling back changes but so far the the issue
>> continues.
>> >
>> > Some interesting data points while debugging:
>> > We had lowered the ZFS recordsize from 128K to 32K and for that week
>> the issue started happening every other day.
>> > Using xxd and diff we compared "good" and "bad" wal files and the
>> differences were not random bad bytes.
>> >
>> > The bad file either had a block of zeros that were not in the good file
>> at that position or other data.  Occasionally the bad data has contained
>> legible strings not in the good file at that position. At least one of
>> those exact strings has existed elsewhere in the files.
>> > However I am not sure if that is the case for all of them.
>> >
>> > This made me think that maybe there was an issue w/ wal file recycling
>> and ZFS under heavy load, so we tried lowering
>> > min_wal_size in order to "discourage" wal file recycling but my
>> understanding is a low value discourages recycling but it will still
>> > happen (unless setting wal_recycle in psql 12).
>> >
>>
>> We do print a message "recycled write-ahead log file .." in DEBUG2
>> mode.  You either want to run the server with DEBUG2 or maybe change
>> the code to make it LOG and see if that is printed.  If you do that,
>> you can verify if the corrupted WAL is the same as a recycled one.
>>
>
> Are you suggesting having the master, the replicas or all in debug mode?
> How much extra logging would this generate?
> A replica typically consumes over 1 TB of WAL files before a bad wal file
> is encountered.
>
>
>
>> > There is a third replica where this bug has not (yet?) surfaced.
>> > This leads me to guess the bad data does not originate on the master.
>> > This replica is older than the other replicas, slower CPUs, less RAM,
>> and the WAL disk array is spinning disks.
>> > The OS, version of Postgres, and version of ZFS are the same as the
>> other replicas.
>> > This replica is not using a replication slot.
>> > This replica does not serve users so load/contention is much lower than
>> the others.
>> > The other replicas often have 100% utilization of the disk array that
>> houses the (non-wal) data.
>> >
>> > Any insight into the source of this bug or how to address it?
>> >
>> > Since the master has a good copy of the WAL file, can the replica
>> re-request  the file from the master? Or from the archive?
>> >
>>
>> I think we do check in the archive if we get the error during
>> streaming, but archive might also have the same data due to which this
>> problem happens.  Have you checked that the archive WAL file, is it
>> different from the bad WAL?  See the
>
>
> Typically the master, the archive and the other replicas all have a good
> copy of

Fwd: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-20 Thread Alex Malek

On Thu, Feb 20, 2020, 6:16 AM Amit Kapila  wrote:

> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek  wrote:
> >
> >
> > Hello Postgres Hackers -
> >
> > We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> > "incorrect resource manager data checksum in record at ..."
> > This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> > on one of two replicas.
> > Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> > the good wal file is retrieved from the master.
> > We are doing streaming replication using replication slots.
> > However twice now, the master had already removed the WAL file so the
> file had to retrieved from the wal archive.
> >
> > The WAL log directories on the master and the replicas are on ZFS file
> systems.
> > All servers are running RHEL 7.7 (Maipo)
> > PostgreSQL 10.11
> > ZFS v0.7.13-1
> >
> > The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> and to https://github.com/timescale/timescaledb/issues/1443
> >
> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> >
> > Right before the issue started we did some upgrades and altered some
> postgres configs and ZFS settings.
> > We have been slowly rolling back changes but so far the the issue
> continues.
> >
> > Some interesting data points while debugging:
> > We had lowered the ZFS recordsize from 128K to 32K and for that week the
> issue started happening every other day.
> > Using xxd and diff we compared "good" and "bad" wal files and the
> differences were not random bad bytes.
> >
> > The bad file either had a block of zeros that were not in the good file
> at that position or other data.  Occasionally the bad data has contained
> legible strings not in the good file at that position. At least one of
> those exact strings has existed elsewhere in the files.
> > However I am not sure if that is the case for all of them.
> >
> > This made me think that maybe there was an issue w/ wal file recycling
> and ZFS under heavy load, so we tried lowering
> > min_wal_size in order to "discourage" wal file recycling but my
> understanding is a low value discourages recycling but it will still
> > happen (unless setting wal_recycle in psql 12).
> >
>
> We do print a message "recycled write-ahead log file .." in DEBUG2
> mode.  You either want to run the server with DEBUG2 or maybe change
> the code to make it LOG and see if that is printed.  If you do that,
> you can verify if the corrupted WAL is the same as a recycled one.
>

Are you suggesting having the master, the replicas or all in debug mode?
How much extra logging would this generate?
A replica typically consumes over 1 TB of WAL files before a bad wal file
is encountered.



> > There is a third replica where this bug has not (yet?) surfaced.
> > This leads me to guess the bad data does not originate on the master.
> > This replica is older than the other replicas, slower CPUs, less RAM,
> and the WAL disk array is spinning disks.
> > The OS, version of Postgres, and version of ZFS are the same as the
> other replicas.
> > This replica is not using a replication slot.
> > This replica does not serve users so load/contention is much lower than
> the others.
> > The other replicas often have 100% utilization of the disk array that
> houses the (non-wal) data.
> >
> > Any insight into the source of this bug or how to address it?
> >
> > Since the master has a good copy of the WAL file, can the replica
> re-request  the file from the master? Or from the archive?
> >
>
> I think we do check in the archive if we get the error during
> streaming, but archive might also have the same data due to which this
> problem happens.  Have you checked that the archive WAL file, is it
> different from the bad WAL?  See the


Typically the master, the archive and the other replicas all have a good
copy of the WAL file.

relevant bits of code in
> WaitForWALToBecomeAvailable especially the code near below comment:
>
> "Failure while streaming. Most likely, we got here because streaming
> replication was terminated, or promotion was triggered. But we also
> get here if we find an invalid record in the WAL streamed from master,
> in which case something is seriously wrong. There's little chance that
> the problem will jus

bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-19 Thread Alex Malek

Hello Postgres Hackers -

We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large
data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,
remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file
had to retrieved from the wal archive.

The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1

The issue seems similar to
https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443

One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.

Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.

Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.

The bad file either had a block of zeros that were not in the good file at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.

This made me think that maybe there was an issue w/ wal file recycling and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).

There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM, and
the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the other
replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower than the
others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.

Any insight into the source of this bug or how to address it?

Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?

When using replication slots, what circumstances would cause the master to
not save the WAL file?
(I can't remember if it always had the next wal file or the one after that)

Thanks in advance,
Alex Malek

Re: Corruption with duplicate primary key

2020-01-15 Thread Alex Adriaanse

On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote:
>On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>>> At first I thought maybe this might be due to collations changing and
>>> breaking the index silently. What collation are you using?
>>
>>We're using en_US.utf8. We did not make any collation changes to my
>>knowledge.
>
>Well, the idea was more that glibc got updated and the collations
>changed because of that (without PostgreSQL having a chance to even
>notice that).

Closing the loop on this, I've investigated this some more and it turns out 
this is exactly what happened. As you suspected, the issue had nothing to do 
with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian 
Buster. The postgres:10 and postgres:11 images are based on Debian Stretch, 
whereas postgres:12 is based on Buster.

When I kept the database on an older version of Postgres (10 or 11) but 
switched from the older Docker image to the postgres:12 or debian:buster(-slim) 
image, manually installing older Postgres packages inside those images, I saw 
index corruption there too.

Thanks for the input!

Alex

Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse

On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote:
> If you have BEFORE triggers, and a BEFORE trigger signaled failure with 
> RETURN NULL, then this is one known (and documented) issue that I think could 
> cause the behavior you're reporting:
>     
> https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none
> 
> It's hard to say if this is the cause or not, but if you have any BEFORE 
> triggers that RETURN NULL, you might want to review the documentation very 
> carefully.

We do have a BEFORE INSERT trigger, but it should never return NULL. This 
trigger INSERTs into a different table using an ON CONFLICT DO NOTHING clause 
and then does a RETURN NEW.

Alex

Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse

On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
> At first I thought maybe this might be due to collations
> changing and breaking the index silently. What collation are you using?

We're using en_US.utf8. We did not make any collation changes to my knowledge.

> 1) When you do the queries, do they use index scan or sequential scan?
> Perhaps it does sequential scan, and if you force index scan (e.g. by
> rewriting the query) it'll only find one of those rows.

By default it used an index scan. When I re-ran the query today (and confirmed 
that the query used an index only scan) I did not see any duplicates. If I 
force a sequential scan using "SET enable_index[only]scan = false" the 
duplicates reappear.

However, using a backup from a week ago I see duplicates in both the query that 
uses an index only scan as well as the query that uses the sequential scan. So 
somehow over the past week the index got changed to eliminate duplicates.

> 2) Can you check in backups if this data corruption was present in the
> PG10 cluster, before running pg_upgrade? 

Sure. I just checked and did not see any corruption in the PG10 pre-upgrade 
backup. I also re-upgraded that PG10 backup to PG12, and right after the 
upgrade I did not see any corruption either. I checked using both index scans 
and sequential scans.

Alex

Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse

On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote:
> > We have a Postgres 10 database that we recently upgraded to Postgres 12 
> > using pg_upgrade. We recently discovered that there are rows in one of the 
> > tables that have duplicate primary keys:
> 
> What's the timeline here? In other words, does it look like these rows
> were updated and/or deleted before, around the same time as, or after
> the upgrade?

The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were 
modified after this upgrade (although some of the rows were originally inserted 
before then, before they were modified/duplicated).

> > This database runs inside Docker, with the data directory bind-mounted to a 
> > reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 
> > kernel inside an AWS EC2 instance. We have Debezium stream data from this 
> > database via pgoutput.
> 
> That seems suspicious, since reflink support for XFS is rather immature.

Good point. Looking at kernel commits since 4.19.16 it appears that there have 
been a few bug fixes in later kernel versions that address a few XFS corruption 
issues. Regardless of whether FS bugs are responsible of this corruption I'll 
plan on upgrading to a newer kernel.

> How did you invoke pg_upgrade? Did you use the --link (hard link) option?

Yes, we first created a backup using "cp -a --reflink=always", ran initdb on 
the new directory, and then upgraded using "pg_upgrade -b ... -B ... -d ... -D 
-k".

Alex

Corruption with duplicate primary key

2019-12-05 Thread Alex Adriaanse

We have a Postgres 10 database that we recently upgraded to Postgres 12 using 
pg_upgrade. We recently discovered that there are rows in one of the tables 
that have duplicate primary keys:

record_loader=# \d loader.sync
  Table "loader.sync"
  Column   |   Type   | Collation | Nullable | Default
---+--+---+--+-
 source| text |   | not null |
 natural_key   | text |   | not null |
 payload   | jsonb|   |  |
 dispatched| timestamp with time zone |   | not null | now()
 initial_load_id   | text |   |  |
 deleted_load_id   | text |   |  |
 created_timestamp | timestamp with time zone |   |  | now()
 updated_timestamp | timestamp with time zone |   |  | now()
 deleted_timestamp | timestamp with time zone |   |  |
Indexes:
"sync_pkey" PRIMARY KEY, btree (source, natural_key)
Publications:
"debezium"

This table is modified via triggers that fire off when a COPY command inserts 
many rows into another table.

Here are two example duplicate rows:

# SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, 
dispatched, created_timestamp, updated_timestamp, deleted_timestamp FROM 
loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY 
xmin::text::int, cmin::text::int;
-[ RECORD 1 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304404
cmax  | 5304404
source| ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched| 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |
-[ RECORD 2 ]-+-
xmin  | 116649
xmax  | 118583
cmin  | 5312208
cmax  | 5312208
source| ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched| 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-17 21:24:19.558219+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00

It appears that the second row was in place originally, then got updated by a 
trigger (and even deleted later on, although it doesn't appear that the delete 
transaction got committed), and then the first row was inserted within the same 
transaction that updated the second row.

Another example:
-[ RECORD 1 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304403
cmax  | 5304403
source| ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched| 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-31 06:00:33.765547+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00
-[ RECORD 2 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304404
cmax  | 5304404
source| ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched| 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |

Both examples have in common that the two duplicate rows were touched within 
the same transaction.

This database runs inside Docker, with the data directory bind-mounted to a 
reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 
kernel inside an AWS EC2 instance. We have Debezium stream data from this 
database via pgoutput.

Recreating the primary key confirms that the constraint doesn't (or at least 
shouldn't) permit these duplicate rows:

record_loader=# BEGIN;
BEGIN
record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey;
ALTER TABLE
record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY 
(source, natural_key);
ERROR:  could not create unique index "sync_pkey"
DETAIL:  Key (source, natural_key)=(ok_lease, ...) is duplicated.
CONTEXT:  parallel worker

Any ideas on what might cause this behavior?

Thanks,

Alex

TOAST corruption in standby database

2019-10-24 Thread Alex Adriaanse

 00 00 00 00 00  ||
*
0490  a6 07 7e 02 00 00 00 00  00 00 00 00 1b 00 61 5c  |..~...a\|
04a0  02 00 03 00 02 09 18 00  ae 9d d4 03 01 00 00 00  ||
04b0  d0 0a 00 00 23 25 10 07  88 02 13 0f 2c 04 78 01  |#%..,.x.|

Based on the above observations it seems to me that occasionally some of the 
changes aren't replicating to or persisting by the standby database. In the 
past I've seen some TCP packets get mangled or dropped between our EC2 
instances, leading to sudden disconnects. The standby connects to the primary 
using SSL (sslmode=require sslcompression=1) so I would think if there's any 
network-level corruption SSL would catch it, causing the connection to fail and 
reconnect. Outside of any SSL disconnects (which don't happen often), this 
database is stopped and restarted twice a week so we can clone it (using cp -a 
--reflink=always).

Any ideas on what might be causing this?

Thanks,

Alex

Re: understand the pg locks in in an simple case

2019-08-26 Thread Alex

On Tue, Aug 20, 2019 at 10:52 PM Alex  wrote:

>
>
> On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas 
> wrote:
>
>> On 20/08/2019 10:23, Alex wrote:
>> > I have troubles to understand the pg lock in the following simple
>> > situation.
>> >
>> >
>> > Session 1:
>> >
>> >
>> > begin;   update  tset  a=  1  where  a=  10;
>> >
>> >
>> > Session 2:
>> >
>> >
>> > begin;  update  tset  a=  2  where  a=  10;
>> >
>> >
>> > They update the same row and session 2 is blocked by session 1 without
>> > surprise.
>> >
>> >
>> > The pretty straight implementation is:
>> >
>> > Session 1 lock the the *tuple (ExclusiveLock)* mode.
>> >
>> > when session 2 lock it in exclusive mode,  it is blocked.
>> >
>> >
>> > But when I check the pg_locks: session 1.  I can see *no tuple
>> > lock*there,  when I check the session 2,   I can see a
>> > *tuple(ExclusiveLock) is granted*,  but it is waiting for a
>> transactionid.
>> >
>> >
>> > since every tuple has txn information,  so it is not hard to implement
>> > it this way.  but is there any benefits over the the straight way?
>> >   with the current implementation, what is the point
>> > of tuple(ExclusiveLock) for session 2?
>>
>> The reason that tuple locking works with XIDs, rather than directly
>> acquiring a lock on the tuple, is that the memory allocated for the lock
>> manager is limited. One transaction can lock millions of tuples, and if
>> it had to hold a normal lock on every tuple, you would run out of memory
>> very quickly.
>>
>
> Thank you!
>
> so can I understand that we don't need a lock on every tuple we updated
> since
> 1).  the number of lock may be  huge,  if we do so,  it will consume a lot
> of memory
> 2).  the tuple header which includes xid info are unavoidable due to MVCC
> requirement, and it can be used here, so we saved the individual lock
>
> and in my above example,  when session 2 waiting for a xid lock,  it is
> *granted* with a tuple lock with ExclusiveLock mode,  what is the purpose
> of this lock?
>

I will try to answer this question myself.  the purpose of the tuple lock
(with ExclusiveLock mode) is to protect there is no more than 1 client to
add the transaction lock on the same tuple at the same time.  once the txn
lock is added,  the tuple lock can be released.


So it may seem that we don't need heavy-weight locks on individual
>> tuples at all. But we still them to establish the order that backends
>> are waiting. The locking protocol is:
>>
>> 1. Check if a tuple's xmax is set.
>> 2. If it's set, obtain a lock on the tuple's TID.
>> 3. Wait on the transaction to finish, by trying to acquire lock on the
>> XID.
>> 4. Update the tuple, release the lock on the XID, and on the TID.
>>
>> It gets more complicated if there are multixids, or you update a row you
>> have earlier locked in a weaker mode, but that's the gist of it.
>>
>> We could skip the lock on the tuple's TID, but then if you have multiple
>> backends trying to update or lock a row, it would be not be
>> deterministic, who gets the lock first. For example:
>>
>> Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123;
>> Session B: UPDATE foo SET col='b' WHERE id = 123; 
>> Session C: UPDATE foo SET col='c' WHERE id = 123; 
>> Session A: ROLLBACK;
>>
>> Without the lock on the TID, it would be indeterministic, whether
>> session B or C gets to update the tuple, when A rolls back. With the
>> above locking protocol, B will go first. B will acquire the lock on the
>> TID, and block on the XID lock, while C will block on the TID lock held
>> by B. If there were more backends trying to do the same, they would
>> queue for the TID lock, too.
>>
>> - Heikki
>>
>

when the IndexScan reset to the next ScanKey for in operator

2019-08-21 Thread Alex

given the following example:
postgres=# create table t2 as select generate_series(1, 10) as a,
 generate_series(1, 10) as b;
SELECT 10
postgres=# create index t2_idx on t2(a);
CREATE INDEX
postgres=# set enable_seqscan = 0;
SET
postgres=# select * from t2 where a in (1, 10);
   a|   b
+
  1 |  1
 10 | 10
(2 rows)


I can see the plan stores the "1 and 10" information in
IndexScan->indexqual, which is an SCALARARRAYOPEXPR expression.

suppose the executor  should scan 1 first,  If all the tuples for 1 has
been scanned,  then **it should be reset to 10**  and scan again.
 however I can't find out the code for that.  looks index_rescan is not for
this.   am I miss something?

thanks

Re: Serialization questions

2019-08-20 Thread Alex

On Tue, Aug 20, 2019 at 4:47 PM Alex  wrote:

> Before understanding how postgres implements the serializable isolation
> level (I have see many paper related to it), I have question about how it
> should be.
>
>
> I mainly read the ideas from
> https://www.postgresql.org/docs/11/transaction-iso.html.
>
>
> In fact, this isolation level works exactly the same as Repeatable Read
> except that it monitors for conditions which could make execution of a
> concurrent set of serializable transactions behave in a manner inconsistent
> with all possible serial (one at a time) executions of those transactions.
>
>
>
> in repeatable read,  every statement will use the transaction start
> timestamp,  so is it in serializable isolation level?
>
>
> When relying on Serializable transactions to prevent anomalies, it is
> important that any data read from a permanent user table not be considered
> valid until the transaction which read it has successfully committed. This
> is true even for read-only transactions ...
>
>
> What does the "not be considered valid" mean?  and if it is a read-only
> transaction (assume T1),  I think it is ok to let other transaction do
> anything with the read set of T1, since it is invisible to T1(use the
> transaction start time as statement timestamp).
>

first issue "set default_transaction_isolation to 'serializable';" on the
both sessions,  then run:

Session 1:   begin;  select * from t;  (2 rows selected);
Session 2:   delete from t;   (committed automatically)
Session 1:  commit;  (commit successfully).

looks the reads in session 1 has no impact on the session 2 at all which is
conflicted with the document


> Thanks
>

Re: understand the pg locks in in an simple case

2019-08-20 Thread Alex

On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas  wrote:

> On 20/08/2019 10:23, Alex wrote:
> > I have troubles to understand the pg lock in the following simple
> > situation.
> >
> >
> > Session 1:
> >
> >
> > begin;   update  tset  a=  1  where  a=  10;
> >
> >
> > Session 2:
> >
> >
> > begin;  update  tset  a=  2  where  a=  10;
> >
> >
> > They update the same row and session 2 is blocked by session 1 without
> > surprise.
> >
> >
> > The pretty straight implementation is:
> >
> > Session 1 lock the the *tuple (ExclusiveLock)* mode.
> >
> > when session 2 lock it in exclusive mode,  it is blocked.
> >
> >
> > But when I check the pg_locks: session 1.  I can see *no tuple
> > lock*there,  when I check the session 2,   I can see a
> > *tuple(ExclusiveLock) is granted*,  but it is waiting for a
> transactionid.
> >
> >
> > since every tuple has txn information,  so it is not hard to implement
> > it this way.  but is there any benefits over the the straight way?
> >   with the current implementation, what is the point
> > of tuple(ExclusiveLock) for session 2?
>
> The reason that tuple locking works with XIDs, rather than directly
> acquiring a lock on the tuple, is that the memory allocated for the lock
> manager is limited. One transaction can lock millions of tuples, and if
> it had to hold a normal lock on every tuple, you would run out of memory
> very quickly.
>

Thank you!

so can I understand that we don't need a lock on every tuple we updated
since
1).  the number of lock may be  huge,  if we do so,  it will consume a lot
of memory
2).  the tuple header which includes xid info are unavoidable due to MVCC
requirement, and it can be used here, so we saved the individual lock

and in my above example,  when session 2 waiting for a xid lock,  it is
*granted* with a tuple lock with ExclusiveLock mode,  what is the purpose
of this lock?


> So it may seem that we don't need heavy-weight locks on individual
> tuples at all. But we still them to establish the order that backends
> are waiting. The locking protocol is:
>
> 1. Check if a tuple's xmax is set.
> 2. If it's set, obtain a lock on the tuple's TID.
> 3. Wait on the transaction to finish, by trying to acquire lock on the XID.
> 4. Update the tuple, release the lock on the XID, and on the TID.
>
> It gets more complicated if there are multixids, or you update a row you
> have earlier locked in a weaker mode, but that's the gist of it.
>
> We could skip the lock on the tuple's TID, but then if you have multiple
> backends trying to update or lock a row, it would be not be
> deterministic, who gets the lock first. For example:
>
> Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123;
> Session B: UPDATE foo SET col='b' WHERE id = 123; 
> Session C: UPDATE foo SET col='c' WHERE id = 123; 
> Session A: ROLLBACK;
>
> Without the lock on the TID, it would be indeterministic, whether
> session B or C gets to update the tuple, when A rolls back. With the
> above locking protocol, B will go first. B will acquire the lock on the
> TID, and block on the XID lock, while C will block on the TID lock held
> by B. If there were more backends trying to do the same, they would
> queue for the TID lock, too.
>
> - Heikki
>

Serialization questions

2019-08-20 Thread Alex

Before understanding how postgres implements the serializable isolation
level (I have see many paper related to it), I have question about how it
should be.


I mainly read the ideas from
https://www.postgresql.org/docs/11/transaction-iso.html.


In fact, this isolation level works exactly the same as Repeatable Read
except that it monitors for conditions which could make execution of a
concurrent set of serializable transactions behave in a manner inconsistent
with all possible serial (one at a time) executions of those transactions.


in repeatable read,  every statement will use the transaction start
timestamp,  so is it in serializable isolation level?


When relying on Serializable transactions to prevent anomalies, it is
important that any data read from a permanent user table not be considered
valid until the transaction which read it has successfully committed. This
is true even for read-only transactions ...


What does the "not be considered valid" mean?  and if it is a read-only
transaction (assume T1),  I think it is ok to let other transaction do
anything with the read set of T1, since it is invisible to T1(use the
transaction start time as statement timestamp).


Thanks

understand the pg locks in in an simple case

2019-08-20 Thread Alex

I have troubles to understand the pg lock in the following simple
situation.


Session 1:


begin;  update t set a = 1 where a = 10;


Session 2:


begin; update t set a = 2 where a = 10;


They update the same row and session 2 is blocked by session 1 without
surprise.


The pretty straight implementation is:

Session 1 lock the the *tuple (ExclusiveLock)* mode.

when session 2 lock it in exclusive mode,  it is blocked.


But when I check the pg_locks: session 1.  I can see *no tuple lock*
there,  when I check the session 2,   I can see a *tuple(ExclusiveLock) is
granted*,  but it is waiting for a transactionid.


since every tuple has txn information,  so it is not hard to implement it
this way.  but is there any benefits over the the straight way?   with the
current implementation, what is the point of tuple(ExclusiveLock) for
session 2?

pg can create duplicated index without any errors even warnning

2019-08-05 Thread Alex

postgres=# create table t (a int, b int);
CREATE TABLE
postgres=# create index m on t(a);
CREATE INDEX
postgres=# create index m2 on t(a);
CREATE INDEX
postgres=# \d t
 Table "demo.t"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Indexes:
"m" btree (a)
"m2" btree (a)


is this by design?

run os command in pg_regress?

2019-07-04 Thread Alex

In my case, I want to sleep 3 seconds in xxx.sql for pg_regress program.
but I don't want to run 'select pg_sleep(3)' .   so it is possible for
pg_regress?

in psql, I can run \! sleep(3); exit;

but looks pg_regress doesn't support it.

Re: set parameter for all existing session

2019-06-12 Thread alex lock

On Wed, Jun 12, 2019 at 4:25 PM Pavel Stehule 
wrote:

> Hi
>
> st 12. 6. 2019 v 9:58 odesílatel alex lock  napsal:
>
>> I check the “alter database, alter role " and "set " command, but none of
>> them can set the parameters to all the existing sessions.   do we have a
>> way to do that?  looks the "assign_hook" can be used to customize this,  is
>> it a right way to do that?
>>
>>
> Maybe you miss to call pg_reload_conf();
>
> example:
>
> alter system set work_mem to '10MB';
> select pg_reload_conf();
>

Thanks,  it works!

>
> in other session you can:
>
> show work_mem;
>
> Regards
>
> Pavel
>

set parameter for all existing session

2019-06-12 Thread alex lock

I check the “alter database, alter role " and "set " command, but none of
them can set the parameters to all the existing sessions.   do we have a
way to do that?  looks the "assign_hook" can be used to customize this,  is
it a right way to do that?

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Alex

Thanks! Appreciate it for your time!

On Mon, Jun 10, 2019 at 5:34 PM Kuntal Ghosh 
wrote:

> On Mon, Jun 10, 2019 at 2:12 PM Alex  wrote:
> > On Mon, Jun 10, 2019 at 4:10 PM Kuntal Ghosh 
> wrote:
> >> I think what I'm trying to say is different.
> >>
> >> For my case, the sequence is as following:
> >> 1. Transaction A has deleted a tuple, say t1 and got committed.
> >> 2. Index A has been created successfully.
> >> 3. Now, transaction B starts and use the index A to fetch the tuple
> >> t1. While doing visibility check, transaction B gets to know that t1
> >> has been deleted by a committed transaction A, so it can't see the
> >> tuple. But, it creates a dependency edge that transaction A precedes
> >> transaction B. This edge is required to detect a serializable conflict
> >> failure.
> >>
> >> If you don't create the index entry, it'll not be able to create that
> edge.
> >
> >
> > Thanks,  I got the difference now, but still not get the necessity of it.
> > 1.   Assume we don't index it,  in which situation we can get a wrong
> result?
>
> Consider the following sequence of three different transactions X,A and B:
>
> 1. Transaction X reads a tuple t2.
> 2. Transaction A updates the tuple t2, deletes a tuple t1 and gets
> committed. So, there transaction X precedes transaction A, i.e., X <-
> A.
> 3. Index A is created successfully.
> 4. Transaction B starts and use the index A to fetch tuple t1. But,
> it's already deleted by the committed transaction A. So, transaction A
> precedes transaction B, i.e., A<-B.
> 5. At this point you've a dangerous structure X<-A<-B (definition of
> dangerous structure src/backend/storage/lmgr/README-SSI) in the graph
> which can produce an anomaly. For example now, if X tries to update
> another tuple previously read by B, you'll have a dependency B<-X.
> But, you already have X<-B which leads to serializable conflict.
> Postgres tries to resolve this anomaly by rolling back one of the
> transaction.
>
> In your case, it'll be difficult to detect.
>
> > 2.   If we only support "Read Committed" isolation level,  is there a
> safe way to not index such data?
> >
> I can't think of a case where the RECENTLY_DELETED tuple needs to be
> indexed in "Read Committed" case. So, your suggestion likely to work
> logically in "Read committed" isolation level. But, I'm not sure
> whether you'll encounter any assertion failures in vacuum path or
> concurrent index paths.
>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Alex

On Mon, Jun 10, 2019 at 4:10 PM Kuntal Ghosh 
wrote:

> On Mon, Jun 10, 2019 at 1:30 PM Alex  wrote:
> >
> >
> >
> > On Mon, Jun 10, 2019 at 3:28 PM Kuntal Ghosh 
> wrote:
> >>
> >> On Mon, Jun 10, 2019 at 12:15 PM Alex  wrote:
> >>>
> >>>  HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */
> >>>
> >>>  It is a tuple which has been deleted AND committed but before the
> delete there is a transaction started but not committed. Let call this
> transaction as Transaction A.
> >>>
> >>> if we create index on this time, Let's call this index as Index A， it
> still index this record.  my question is why need this.
> >>>
> >> In this case, the changes of the tuple is not visible yet. Now suppose,
> your transaction A is serializable and you've another serializable
> transaction B which can see the index A. It generates a plan that requires
> to fetch the deleted tuple through an index scan. If the tuple is not
> present in the index, how are you going to create a conflict edge between
> transaction A and transaction B?
> >>
> >> Basically, you need to identify the following clause to detect
> serializable conflicts:
> >> Transaction A precedes transaction B. (Because, transaction A has
> deleted a tuple and it's not visible to transaction B)
> >>
> >
> > thanks Ghosh.  Looks your answer is similar with my previous point
> (transaction is  serializable).   actually if the transaction B can't see
> the “deleted" which has been committed,  should it see the index A which is
> created after the "delete" transaction?
> >
> I think what I'm trying to say is different.
>
> For my case, the sequence is as following:
> 1. Transaction A has deleted a tuple, say t1 and got committed.
> 2. Index A has been created successfully.
> 3. Now, transaction B starts and use the index A to fetch the tuple
> t1. While doing visibility check, transaction B gets to know that t1
> has been deleted by a committed transaction A, so it can't see the
> tuple. But, it creates a dependency edge that transaction A precedes
> transaction B. This edge is required to detect a serializable conflict
> failure.
>
> If you don't create the index entry, it'll not be able to create that edge.
>

Thanks,  I got the difference now, but still not get the necessity of it.
1.   Assume we don't index it,  in which situation we can get a wrong
result?
2.   If we only support "Read Committed" isolation level,  is there a safe
way to not index such data?

-- 
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Why to index a "Recently DEAD" tuple when creating index

2019-06-10 Thread Alex

On Mon, Jun 10, 2019 at 3:28 PM Kuntal Ghosh 
wrote:

> On Mon, Jun 10, 2019 at 12:15 PM Alex  wrote:
>
>>  HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */
>>
>>  It is a tuple which has been deleted AND committed but before the delete
>> there is a transaction started but not committed. Let call this transaction
>> as Transaction A.
>>
>> if we create index on this time, Let's call this index as Index A， it
>> still index this record.  my question is why need this.
>>
>> In this case, the changes of the tuple is not visible yet. Now suppose,
> your transaction A is serializable and you've another serializable
> transaction B which can see the index A. It generates a plan that requires
> to fetch the deleted tuple through an index scan. If the tuple is not
> present in the index, how are you going to create a conflict edge between
> transaction A and transaction B?
>
> Basically, you need to identify the following clause to detect
> serializable conflicts:
> Transaction A precedes transaction B. (Because, transaction A has deleted
> a tuple and it's not visible to transaction B)
>
>
thanks Ghosh.  Looks your answer is similar with my previous point
(transaction is  serializable).   actually if the transaction B can't see
the “deleted" which has been committed,  should it see the index A which is
created after the "delete" transaction?


-- 
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>

Why to index a "Recently DEAD" tuple when creating index

2019-06-09 Thread Alex

 HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */

 It is a tuple which has been deleted AND committed but before the delete
there is a transaction started but not committed. Let call this transaction
as Transaction A.

if we create index on this time, Let's call this index as Index A， it still
index this record.  my question is why need this.

The only reason I can think out (maybe also not reasonable enough) is:
If we index like this and the  isolate level of transaction A is
serializable， it is possible that the query in transaction A can use Index
A since it contains the snapshot data when the transaction A was began.
 this reason may be not reasonable enough is because the transaction A may
be should not see the index A at all.

some questions about fast-path-lock

2019-05-26 Thread Alex

I got some idea from the README under storage/lmgr and read some code of
LockAcquireExtended ,   but I still have some questions now.

LWLockAcquire(&MyProc->backendLock, LW_EXCLUSIVE);
if (FastPathStrongRelationLocks->count[fasthashcode] != 0)
acquired = false;
else
 acquired = FastPathGrantRelationLock(locktag->locktag_field2,
lockmode);

1.  In the README,   it says:  "A key point of this algorithm is that it
must be possible to verify the
absence of possibly conflicting locks without fighting over a shared LWLock
or
spinlock.  Otherwise, this effort would simply move the contention
bottleneck
from one place to another."

but in the code, there is LWLockAcquire in the above code.  Actually I
can't think out how can we proceed without a lock.

2.   Why does the MyProc->backendLock work?   it is MyProc not a global
lock.

3. for the line,acquired =
FastPathGrantRelationLock(locktag->locktag_field2,
lockmode);I think it should  be able to replaced with  "acquired =
true" (but obviously I'm wrong)  .   I read "FastPathGrantRelationLock" but
can't understand it.


Any hint will be helpful.   thanks!

Re: any suggestions to detect memory corruption

2019-05-09 Thread Alex

On Thu, May 9, 2019 at 9:30 PM Tom Lane  wrote:

> Alex  writes:
> > Someone add some code during backend init which used palloc. but at that
> > time,  the CurrentMemoryContext is PostmasterContext.   at the end of
> > backend initialization, the PostmasterContext is deleted, then the error
> > happens.  the reason why it happens randomly is before the palloc, there
> > are some other if clause which may skip the palloc.
>
> > I still can't explain why PostmasterContext may have impact "index info"
> > MemoryContext sometime,  but now I just can't reproduce it (before the
> > fix,  it may happen in 30% cases).
>
> Well, once the context is deleted, that memory is available for reuse.
> Everything will seem fine until it *is* reused, and then boom!
>
> The error would have been a lot more obvious if you'd enabled
> MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage.
>

Thanks!  I didn't know this before and   " once the context is deleted,
that memory is available for reuse.
Everything will seem fine until it *is* reused". I have enabled
 enable-cassert  now.

That is normally turned on in --enable-cassert builds.  Anybody who's been
> hacking Postgres for more than a week does backend code development in
> --enable-cassert mode as a matter of course; it turns on a *lot* of
> helpful cross-checks.
>
>


> regards, tom lane
>

Re: any suggestions to detect memory corruption

2019-05-08 Thread Alex

Thanks you Tom and Robert!   I tried valgrind,  and looks it help me fix
the issue.

Someone add some code during backend init which used palloc. but at that
time,  the CurrentMemoryContext is PostmasterContext.   at the end of
backend initialization, the PostmasterContext is deleted, then the error
happens.  the reason why it happens randomly is before the palloc, there
are some other if clause which may skip the palloc.

I still can't explain why PostmasterContext may have impact "index info"
MemoryContext sometime,  but now I just can't reproduce it (before the
fix,  it may happen in 30% cases).

On Thu, May 9, 2019 at 1:21 AM Robert Haas  wrote:

> On Wed, May 8, 2019 at 10:34 AM Tom Lane  wrote:
> > Alex  writes:
> > > I can get the following log randomly and I am not which commit caused
> it.
> >
> > > 2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set
> index
> > > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
> >
> > I've had success in finding memory stomp causes fairly quickly by setting
> > a hardware watchpoint in gdb on the affected location.  Then you just let
> > it run to see when the value changes, and check whether that's a "legit"
> > or "not legit" modification point.
> >
> > The hard part of that, of course, is to know in advance where the
> affected
> > location is.  You may be able to make things sufficiently repeatable by
> > doing the problem query in a fresh session each time.
>
> valgrind might also be a possibility, although that has a lot of overhead.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

any suggestions to detect memory corruption

2019-05-08 Thread Alex

I can get the following log randomly and I am not which commit caused it.
I spend one day but failed at last.


2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
2019-05-08 21:37:46.692 CST [60110] WARNING:  idx: 2 problem in alloc set
index info: bad single-chunk 0x2a33a78 in block 0x2a33a18, chsize: 1408,
chunkLimit: 1024, chunkHeaderSize: 24, block_used: 768 request size: 2481
2019-05-08 21:37:46.692 CST [60110] WARNING:  problem in alloc set index
info: found inconsistent memory block 0x2a33a18

 it looks like the memory which is managed by "index info" memory context
is written by some other wrong codes.

I didn't change any AllocSetXXX related code and I think I just use it
wrong in some way.

Thanks

Re: Help to review the with X cursor option.

2019-04-24 Thread alex lock

On Thu, Apr 25, 2019 at 9:53 AM alex lock  wrote:

>
>
> that's something I want to change,  as I said at the beginning.  include
> avoid some memory release (like the EState and so on),  snapshot release.
>
>

I check my original statement, I found "snapshot release" was missed,  that
obviously is a key point..

Re: Help to review the with X cursor option.

2019-04-24 Thread alex lock

On Wed, Apr 24, 2019 at 11:30 PM Tom Lane  wrote:

> alex lock  writes:
> > The cursor means  something like  declare c cursor for select * from t;
> > The holdable cursor means declare c cursor WITH HOLD for select * from t;
>
> > Holdable cursor is good at transaction,  user can still access it after
> the
> > transaction is commit.  But it is bad at it have to save all the record
> to
> > tuple store before we fetch 1 row.
>
> > what I want is:
> > 1.   The cursor is still be able to fetch after the transaction is
> > committed.
> > 2.   the cursor will not fetch the data when fetch statement is issue
> (just
> > like non-holdable cursor).
>
> > I called this as with X cursor..
>
> > I check the current implementation and think it would be possible with
> the
> > following methods:
> > 1.   allocate the memory  in a  {LongerMemoryContext}, like EState  to
> > prevent they are
> > 2.   allocate a more bigger resource owner to prevent the LockReleaseAll
> > during CommitTransaction.
> > 3.   add the "with X" option to cursor so that Precommit_portals will not
> > drop it during CommitTransaction.
>
> > Before I implement it,  could you give some suggestions?
>
> You don't actually understand the problem.

>
Thanks tones.  I know that and that's just something I want to change.

> The reason a holdable cursor forcibly reads all the data before commit is
> that the data might not be there to read any later than that.

I think this can be done with snapshot read, like we want the data  at time
1, even the data is not there at time 2,  we provide the snapshot,  we can
read the data.  Oracle has a similar function called flashback query
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm#1008580
 .

> Once we end
> the transaction and release its snapshot (specifically, advance the
> backend's advertised global xmin), it's possible and indeed desirable for
> obsoleted row versions to be vacuumed.

that's something I want to change,  as I said at the beginning.  include
avoid some memory release (like the EState and so on),  snapshot release.

> The only way to avoid that would
> be to not advance xmin, which is pretty much just as bad as not committing
> the transaction.

there is something different between "not advance xmin" or "not committing
the transaction" for me.   "not commit the transaction" will take up the
connection,  but "not advance xmin" one not.   without this reason,
non-holdable cursor is good for me.

> Not releasing the transaction's locks is also bad.

Assume that if the table was dropped among the fetches, we can just raise
error,  we can releasing the lock?  I am still not sure about this part,
but keep the lock is still acceptable for me since it will not take up the
connection already(my purpose).   but releasing the lock can be better.

> So it doesn't seem like there's anything to be gained here that you don't
> have today by just not committing yet.
>

it is connection:)  I want to run dml or other stuff on the current
connection.

>
> If you're concerned about not losing work due to possible errors later in
> the transaction, you could prevent those from causing problems through
> subtransactions (savepoints).
>
> Thanks for your tip,  I have thought the possibility but I can think
more.  the business model is a bit of complex and I don't want to talk more
here.

> regards, tom lane
>

Help to review the with X cursor option.

2019-04-24 Thread alex lock

The cursor means  something like  declare c cursor for select * from t;
The holdable cursor means declare c cursor WITH HOLD for select * from t;

Holdable cursor is good at transaction,  user can still access it after the
transaction is commit.  But it is bad at it have to save all the record to
tuple store before we fetch 1 row.

what I want is:
1.   The cursor is still be able to fetch after the transaction is
committed.
2.   the cursor will not fetch the data when fetch statement is issue (just
like non-holdable cursor).

I called this as with X cursor..

I check the current implementation and think it would be possible with the
following methods:
1.   allocate the memory  in a  {LongerMemoryContext}, like EState  to
prevent they are
2.   allocate a more bigger resource owner to prevent the LockReleaseAll
during CommitTransaction.
3.   add the "with X" option to cursor so that Precommit_portals will not
drop it during CommitTransaction.

Before I implement it,  could you give some suggestions?

Thanks!

Re: pg_basebackup, walreceiver and wal_sender_timeout

2019-01-28 Thread Alex Kliukin

On Mon, Jan 28, 2019, at 10:25, Michael Paquier wrote:
> On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote:

> > And for plain format, we'd do the same -- sync after each file segment, and
> > then a final one of the directory when done, right?
> 
> Well, the code is doing a double amount of work in its current shape
> as we call fsync_pgdata() for the plain format, which cascades to
> pg_wal and all its files, so it seems to me that there is little point
> in issuing a sync when each segment is finished streaming if that's
> what you mean.

Agreed.

While reading the doc page for the pg_basebackup, I've been confused by the 
fact that it says WAL files will be written to .tarballs (either base.tar or 
pg_wal.tar) when pg_basebackup is instructed to stream WALs alongside the 
backup itself. I think it makes sense to elaborate that it only happens when 
tar format is specified (doc patch is attached).

Cheers,
Oleksii


diff --git a/doc/src/sgml/ref/pg_basebackup.sgml 
b/doc/src/sgml/ref/pg_basebackup.sgml
index 57dc83b620..c4f3950e5b 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -306,8 +306,8 @@ PostgreSQL documentation
  backup will fail and be unusable.


-The write-ahead log files will be written to
-the base.tar file.
+When tar format mode is used, the write-ahead log files will be
+written to the base.tar file.

   
  
@@ -326,9 +326,10 @@ PostgreSQL documentation
  requires no extra write-ahead logs to be saved on the master.


-The write-ahead log files are written to a separate file
-named pg_wal.tar (if the server is a version
-earlier than 10, the file will be named 
pg_xlog.tar).
+When tar format mode is used, the write-ahead log files will be
+written to a separate file named pg_wal.tar 
+(if the server is a version earlier than 10, the file will be named
+pg_xlog.tar).


 This value is the default.

RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-05-21 Thread Alex Ignatov



--
Alex Ignatov 
Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company

-Original Message-
From: Alex Ignatov  
Sent: Monday, May 21, 2018 6:00 PM
To: 'Robert Haas' ; 'Andres Freund' 
Cc: 'Masahiko Sawada' ; 'Michael Paquier' 
; 'Mithun Cy' ; 'Tom Lane' 
; 'Thomas Munro' ; 'Amit 
Kapila' ; 'PostgreSQL-development' 

Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock 
manager




-Original Message-
From: Robert Haas 
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund 
Cc: Masahiko Sawada ; Michael Paquier 
; Mithun Cy ; Tom Lane 
; Thomas Munro ; Amit Kapila 
; PostgreSQL-development 
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock 
manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund  wrote:
>> I think the real question is whether the scenario is common enough to 
>> worry about.  In practice, you'd have to be extremely unlucky to be 
>> doing many bulk loads at the same time that all happened to hash to 
>> the same bucket.
>
> With a bunch of parallel bulkloads into partitioned tables that really 
> doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of 
cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 
of them at a time.  It's virtually certain that you will have some collisions, 
but the amount of contention within each bucket will remain fairly low because 
each backend spends only 1% of its time in the bucket corresponding to any 
given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but 
got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function 
‘LockRelationForExtension’
   LockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here  
extern void LockRelationForExtension(Relation relation);
 ^
gistvacuum.c:95:3: error: too many arguments to function 
‘UnlockRelationForExtension’
   UnlockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here  
extern void UnlockRelationForExtension(Relation relation);



Sorry, forgot to mention that patch version is extension-lock-v12.patch

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company

RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-05-21 Thread Alex Ignatov




-Original Message-
From: Robert Haas  
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund 
Cc: Masahiko Sawada ; Michael Paquier 
; Mithun Cy ; Tom Lane 
; Thomas Munro ; Amit Kapila 
; PostgreSQL-development 
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock 
manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund  wrote:
>> I think the real question is whether the scenario is common enough to 
>> worry about.  In practice, you'd have to be extremely unlucky to be 
>> doing many bulk loads at the same time that all happened to hash to 
>> the same bucket.
>
> With a bunch of parallel bulkloads into partitioned tables that really 
> doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of 
cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 
of them at a time.  It's virtually certain that you will have some collisions, 
but the amount of contention within each bucket will remain fairly low because 
each backend spends only 1% of its time in the bucket corresponding to any 
given partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but
got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function 
‘LockRelationForExtension’
   LockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here
 extern void LockRelationForExtension(Relation relation);
 ^
gistvacuum.c:95:3: error: too many arguments to function 
‘UnlockRelationForExtension’
   UnlockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here
 extern void UnlockRelationForExtension(Relation relation);


--
Alex Ignatov 
Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company

'tuple concurrently updated' error w/o visible catalog updates

2018-05-17 Thread Alex Kliukin

Hello,

Earlier this week we have split our Postgres 9.6.8 shards, each having two
databases, into one database per shard setup. This was done by promoting
replicas and subsequently removing unused databases.

Immediately afterwards we have discovered repeated 'tuple concurrently updated'
errors on most of those new shards.

The error is always shown at the same statement:
ERROR,XX000,"tuple concurrently updated","SQL statement ""UPDATE
config_content SET cc_content = l_config_content WHERE cc_config_content_id =
l_ccm_content_id""

By searching the archives (i.e.
https://www.postgresql.org/messageid/flat/CAB7nPqSZCkVfibTvx9TYmHYhVtV_vOMNwOpLHnRU85qeiimUaQ%40mail.gmail.com#cab7npqszckvfibtvx9tymhyhvtv_vomnwoplhnru85qeiim...@mail.gmail.com)
I’ve got an impression that this error manifests itself when system catalog
tuples are updated concurrently, however I see none of that in the query that
leads to an ERROR. There are no triggers on 'config_content' table, neither
there are any views referring to it.

The errors stopped when we disabled a call to the 'upsert_foo_content' function
(here and below I obfuscated real names). This is a fairly simple pl/pgsql 
function
that does a few selects and an upsert. The block inside that function that 
contains
the statement at fault is:

--
SELECT ccm_content_id, ccm_simple_update_received_at INTO l_ccm_content_id, 
l_ccm_simple_update_received_at FROM config_content_metadata
WHERE  ccm_config_id = l_c_id AND ccm_sales_channel_id = l_sales_channel_id;

IF (l_ccm_content_id IS NULL) THEN
 -- insert config content --
 INSERT INTO config_content_metadata(ccm_config_id, ccm_sales_channel_id, 
ccm_update_caused_by, ccm_simple_update_eid, ccm_simple_update_received_at)
 VALUES(l_c_id, l_sales_channel_id, l_rp_id, l_content_update_eid, 
l_content_update_received_at) RETURNING ccm_content_id INTO l_ccm_content_id;

 INSERT INTO config_content(cc_config_content_id, cc_content) VALUES 
(l_ccm_content_id, l_config_content);

ELSIF (l_ccm_simple_update_received_at < l_content_update_received_at) THEN

 UPDATE  config_content_metadata
 SET ccm_update_caused_by = l_rp_id, ccm_simple_update_eid = 
l_content_update_eid, ccm_simple_update_received_at = 
l_content_update_received_at, ccm_updated_at = now()
 WHERE ccm_content_id = l_ccm_content_id;

 -- XXX problematic statement XXX
 UPDATE config_content SET cc_content = l_config_content WHERE 
cc_config_content_id = l_ccm_content_id;

END IF;
--

Note that config_content references config_metdata with a foreign key, however,
the referenced column is not updated.

That 'upsert_foo_content' is called by another one,  upsert_foo_content_batch,
in a loop over the elements of a JSON array, something like:

--
CREATE OR REPLACE FUNCTION upsert_foo_content_batch(p_batch jsonb)
RETURN void LANGUAGE plpgpsql
AS $function$
DECLARE
...
BEGIN
FOR item IN SELECT * FROM jsonb_array_elements(p_batch)
LOOP
  -- some unpacking of fields from json into the local variables
  PERFORM upsert_foo_content(..) -- called with the unpacked variables
END LOOP;
END;
$function$
--

'upsert_foo_content_batch' is called, in order, at the end of a long pl/pgsql
function 'upsert_foo_event_batch', which consists of a very long CTE that
extracts individual fields from a JSON argument, and then performs a number of
inserts into some tables, doing on conflict do nothing, afterwards performing
more inserts into the tables that reference the previous ones, doing on
conflict do update. However, it modifies neither 'config_content' or
'config_content_metadata' tables.

So the chain of calls is
'upsert_foo_event_batch' -> 
'upsert_foo_content_batch' -> 
   'upsert_foo_content'.
(the last one contains the statement that leads to the "tuple concurrently 
updated" error).


It is possible that 'upsert_foo_content' function is called with the same data
multiple times in different processes, however, I’d expect it to either
complete successfully, or throw an error because the PK already exists (this is
running in a read committed mode, so ISTM not immune to the case where the row
in the metadata table is inserted after another session does the check, but
before the insert), but not an error mentioned at the beginning of this
message.

Are there any signs in this description that the queries might be doing
something unexpected to PostgreSQL, or that something went wrong during the
split?  I am running out of options of what could cause the issue, so any
pointers or help in debugging it is appreciated (note that this is a production
database, I cannot just stop it at will).

Cheers,
Oleksii

Re: All Taxi Services need Index Clustered Heap Append

2018-03-05 Thread Alex Kane

https://aws.amazon.com/dms/

DMS might be helpful if you need to move off of RDS


Alex Kane

On Mon, Mar 5, 2018 at 11:48 AM, Ants Aasma  wrote:

> On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
>  wrote:
> >> This approach mixes well with hash
> >> partitioning. It would be neat indeed if PostgreSQL do something
> >> equivalent on its own, and pluggable storage work being done could
> >> enable index organized tables that would help. But you probably need
> >> something right now.
> >
> >
> > Fixing glaring issues (no vacuum and thus no Index-Only Scan on
> append-only
> > tables, vacuum processing all of the eternity of btree) by 11 will get
> most
> > of spike-nails out of the microservice code, and we can probably live
> with
> > them until 11 gets to RDS.
> >
> > I also don't see why a pluggable storage is a must for the clustered
> write.
> > Postgres does have a mechanism for selecting the next page to write tuple
> > to, right now it's just looking at FSM - but what if it just peeked at
> > existing index that already has enough the data to route tuple to correct
> > page on write?
>
> The mechanism you outlined would likely work for your use case, but it
> has many issues that prevent it from being universally useful. From
> the top of my head:
>
> * One extra index descent per insertion (I/O for this is necessary
> anyway, but CPU work is duplicated).
> * We don't currently track the amount of bloat. A mechanism that does
> this needs to be added.
> * If table hits the bloat limit there will be a sudden change in
> behavior. This is pretty nasty from an operations point of view.
> * With your (id,ts) clustering and data coming in mostly ordered by
> timestamp, after initial warmup, each page will contain rows from a
> single id, but different ids are arbitrarily interleaved. This is
> better than current state, but people might want to have an
> interleaving step bigger than 8kB to better utilize storage hardware.
> * It seems that with a common (ts) clustering and age of timestamp
> coming from an exponential distribution, this will quickly bloat to
> threshold and then insert data in a rather arbitrary order. This is
> much worse than the default behavior.
>
> At least in my opinion these problems make it a special case
> optimization that is hard to justify in core. A decent alternative
> would be a plugin mechanism for locating free space for a tuple where
> you can write your extension to find a suitable location for the row.
>
> >> I guess I don't have to tell you that it looks like your needs have
> >> outgrown what RDS works well with and you are in for a painful move
> >> sooner or later.
> >
> >
> > Painful move where to? If we just run a Postgres instance without RDS
> we'll
> > get the pain of setting up Postgres and replication and backups and
> > autofailover, with no visible gain except if we get some private /
> > unaccepted patches applied to it. If we can get these things right
> upstream
> > why would we want to switch?
>
> EC2 for example. Mainly because I3 instances and ephemeral provide an
> order of magnitude or two of performance improvement while costing
> less. Being able to run custom extensions and patches if necessary is
> a nice bonus. Yes, setting up replication, autofailover and backups is
> extra work that you have to weigh against the benefits. But don't
> overestimate the effort - there are some pretty nice tools available
> that make a proper cluster relatively simple to set up.
>
> > Per my colleagues, MySQL offers clustered index, also MySQL is available
> on
> > RDS without the need of "painful move", which is doable by writing to two
> > locations for a day and then pointing readers to new DB. But if we can
> > instead do no move and be sure the issues are gone upstream before we hit
> > the limit of spike-nails we're running on currently, wouldn't that be
> > better? :)
>
> The move off of RDS is painful because getting data out of RDS
> involves either downtime or building an ad-hoc logical replication
> solution. You need to solve that regardless of where you move to.
>
> Providing an out-of-the-box solution in core PostgreSQL would of
> course be best, but realistically you will be waiting at least 2 years
> to get it on RDS. In the meanwhile either the buffer partition
> approach I described, or a buffering microservice in front of
> PostgreSQL like Aleksander recommended should fix data locality for
> you. If you weren't running on RDS I would even propose using Redis as
> the buffer with one key per driver and redis_fdw to make the data
> accessible from within PostgreSQL.
>
> Regards,
> Ants  Aasma
> --
> +43-670-6056265
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26, A-2700 Wiener Neustadt
> Web: https://www.cybertec-postgresql.com
>
>

RE: Direct converting numeric types to bool

2018-02-28 Thread Alex Ignatov


-Original Message-
From: n.zhuch...@postgrespro.ru [mailto:n.zhuch...@postgrespro.ru] 
Sent: Wednesday, February 28, 2018 6:04 PM
To: pgsql-hackers 
Subject: Direct converting numeric types to bool

Attached patch allow direct convertion of numeric types to bool like
integer::bool.
Supported types:
  - smallint;
  - bigint;
  - real;
  - double precision;
  - decimal(numeric).

This functionality is helped with migration from Oracle.

--
Nikita Zhuchkov
Postgres Professional: http://www.postgrespro.com The Russian Postgres
Company

Hello!

What prevent us from:

postgres=# select 1::bigint::int::boolean;
 bool
--
 t
(1 row)

It is just one additional casting and required no additional patching
--
Alex Ignatov 
Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company

Re: Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Alex Shulgin

On Tue, Dec 19, 2017 at 3:15 PM Greg Stark  wrote:

> On 19 December 2017 at 10:00, Oleksandr Shulgin
>  wrote:
>
> > If there would be an option in the database itself to provide those
> > estimation, we wouldn't even need to figure out estimation queries.
> > "EXPLAIN CREATE INDEX" anyone?
>
> You're not the first to propose something like that. I think an
> EXPLAIN ALTER TABLE would also be very handy -- it's currently
> impossible to tell without carefully reading the source code whether a
> given DDL change will require a full table scan, a full table rewrite,
> or just a quick meta data update (and even in that case what strength
> lock will be required). I think there are other utility statements
> that make interesting heuristic decisions that would be nice to be
> able to have some visibility into -- CLUSTER comes to mind.
>

Yes, that would be pretty handy.

I'm not clear how you would determine how much memory is needed to
> sort a table without actually doing the sort though. So that would be
> more of an EXPLAIN ANALYZE wouldn't it?
>

My idea would be to use statistic.  So that EXPLAIN CREATE INDEX (or
whatever the actual interface could be like) would benefit from up-to-date
statistic produced by ANALYZE.

Based on the estimated number of rows in the table, average width of
column(s) to index and taking into account the bookkeeping structures one
should be able to arrive at a good guess for the amount of memory the
backend would end up allocating (assuming it is available).

Having done that, as the first step, and using statistic again we could
also infer (though, probably with less accuracy) memory requirements for
building partial indexes.  Functional indexes would be harder to tackle, I
would think this is only possible if the return type(s) of the function(s)
has all fixed width.

I didn't look in the code, but I imagine the procedure to read -> sort
-> spill to tapes, if needed -> merge sort the tapes is generic to all
index types, so this shouldn't be a breaking change for any user-defined
indexes (is this already a thing?).  OK, maybe it's only generic for B-Tree
and BRIN, but not for GIN and GiST, to name a few.  Damn, I gotta look in
the code at some point. ;-)

To let me fantasize a little more, what I would also love to see is the
estimated on-disk size for the resulting index, before starting to create
it.  This is obviously dependent on the actual index type and options, such
as fill-factor, etc.

Cheers,
--
Alex

Re: GSoC 2018

2017-12-15 Thread Alex Kliukin



On Fri, Dec 15, 2017, at 14:52, Aleksander Alekseev wrote:

> Completely agree, this project can be an improvement for Stolon (or
> Patroni, but I personally never tested or used it, also I got a feeling
> that Google guys will prefer a project that is written in Go). This
> would make much more sense.

I don't believe Google will reject a project based on the fact that it
is written in Python (in fact, Python Software Foundation has
successfully participated in GSoC for many years). Based on the github
statistics, Patroni has started earlier and has more contributors than
Stolon (including those contributed more than one patch/pull-request.)
-- 
Sincerely,
Alex

Re: GSoC 2018

2017-12-15 Thread Alex Kliukin

Hello,

On Fri, Dec 15, 2017, at 14:30, Stephen Frost wrote:
> Aleksander,
> 
> * Aleksander Alekseev (a.aleks...@postgrespro.ru) wrote:
> 
> > Regarding the difficulty of the project - in fact it's not that
> > difficult. Particularly this project can rely on external tools, e.g.
> > use Consul for service discovery and leader election based on
> > leader-lease approach (implementation [1]). Having this the only thing
> > is missing is automatic replica promoting and (optionally)
> > re-configuring of HAProxy / pgbouncer / whatever. Yes, and lots of
> > Jepsen-like test of course. I believe it's not such a complicated
> > project.

Does it make sense to address the  limitations of the logical
replication first, i.e. inability to replicate DDL, sequences and so on?

> 
> What you're talking about is rebuilding Patroni, but adding more into it
> than even Patroni tries to do, building it on Logical Replication
> instead of physical replication, and calling it simple and something
> that could get into core PostgreSQL over a 12 week GSoC project.  I've
> certainly got doubts about that, even if we decide that it'd be an
> external-to-PG project (like Patroni).
> 
> What might be interesting is seeing if Logical Replication could be
> added to Patroni as an option and then building on that..  Having
> someone involved in the Patroni project would be the right way to go
> about proposing that though to see what they think of it.  That would
> also be much more sensible as a GSoC project, since it'd be an addition
> to an existing project and not more-or-less starting a whole new
> project.

Right now logical replication and  physical replication-based HA tools
don't work together nicely, since logical replication position is not
propagated to the promoted replica (I think Craig Ringer has been
tackling this issue for a few releases already, the latest set of
patches I could find is https://commitfest.postgresql.org/15/788/).
Perhaps there is opportunity for a GSoC student to help fixing it. Until
then  we cannot use logical replication for HA, and even doing something
simpler like automating creation of logical replicas in Patroni makes
little sense, as they are doomed to be reinitialized on every failover.

-- 
Sincerely,
Alex

57 matches

Mail list logo