subject:"Re\: Show various offset arrays for heap WAL records"

Re: Show various offset arrays for heap WAL records

2023-10-02 Thread Heikki Linnakangas


On 04/09/2023 23:02, Melanie Plageman wrote:

I might phrase the last bit as "neither the description functions nor
the output format should be considered part of a stable API"

+Guidelines for rmgrdesc output format
+=

I noticed you used === for both headings and wondered if it was
intentional. Other READMEs I looked at in src/backend/access tend to
have a single heading underlined with  and then subsequent
headings are underlined with -. I could see an argument either way
here, but I just thought I would bring it up in case it was not a
conscious choice.

Otherwise, LGTM.


Made these changes and committed. Thank you!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Show various offset arrays for heap WAL records

2023-09-04 Thread Melanie Plageman

On Mon, Jul 10, 2023 at 3:44 AM Heikki Linnakangas  wrote:
> I'm late to the party, but regarding commit c03c2eae0a, which added the
> guidelines for writing formatting desc functions:
>
> You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I
> don't think that was a good idea. Our usual convention is to have the
> function comment in the .c file, not at the declaration in the header
> file. When I want to know what a function does, I jump to the .c file,
> and might miss the comment in the header entirely.
>
> Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> have any explanation anywhere why the rmgr desc functions are in a
> separate directory. The README would be a good place to explain that,
> and to have the formatting guidelines. See attached.

diff --git a/src/backend/access/rmgrdesc/README
b/src/backend/access/rmgrdesc/README
new file mode 100644
index 00..abe84b9f11
--- /dev/null
+++ b/src/backend/access/rmgrdesc/README
@@ -0,0 +1,60 @@
+src/backend/access/rmgrdesc/README
+
+WAL resource manager description functions
+==
+
+For debugging purposes, there is a "description function", or rmgrdesc
+function, for each WAL resource manager. The rmgrdesc function parses the WAL
+record and prints the contents of the WAL record in a somewhat human-readable
+format.
+
+The rmgrdesc functions for all resource managers are gathered in this
+directory, because they are also used in the stand-alone pg_waldump program.

"standalone" seems the more common spelling of this adjective in the
codebase today.

+They could potentially be used by out-of-tree debugging tools too, although
+the the functions or the output format should not be considered a stable API.

You have an extra "the".

I might phrase the last bit as "neither the description functions nor
the output format should be considered part of a stable API"

+Guidelines for rmgrdesc output format
+=

I noticed you used === for both headings and wondered if it was
intentional. Other READMEs I looked at in src/backend/access tend to
have a single heading underlined with  and then subsequent
headings are underlined with -. I could see an argument either way
here, but I just thought I would bring it up in case it was not a
conscious choice.

Otherwise, LGTM.

- Melanie

Re: Show various offset arrays for heap WAL records

2023-07-25 Thread Peter Geoghegan

On Mon, Jul 10, 2023 at 10:29 PM Peter Geoghegan  wrote:
> > Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> > have any explanation anywhere why the rmgr desc functions are in a
> > separate directory. The README would be a good place to explain that,
> > and to have the formatting guidelines. See attached.
>
> I agree that it's better this way, though.

Did you forget to follow up here?

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-07-10 Thread Peter Geoghegan

On Mon, Jul 10, 2023 at 12:44 AM Heikki Linnakangas  wrote:
> This is still listed in the July commitfest; is there some work remaining?

I don't think so; not in the scope of the original patch series from
Melanie, at least.

> You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I
> don't think that was a good idea. Our usual convention is to have the
> function comment in the .c file, not at the declaration in the header
> file. When I want to know what a function does, I jump to the .c file,
> and might miss the comment in the header entirely.

I think that this was a gray area. It wasn't particularly obvious
where this would go. At least not to me.

> Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> have any explanation anywhere why the rmgr desc functions are in a
> separate directory. The README would be a good place to explain that,
> and to have the formatting guidelines. See attached.

I agree that it's better this way, though.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-07-10 Thread Heikki Linnakangas


On 12/04/2023 01:29, Peter Geoghegan wrote:

Thanks for your help with the follow-up work. Seems like we're done
with this now.


This is still listed in the July commitfest; is there some work remaining?

I'm late to the party, but regarding commit c03c2eae0a, which added the 
guidelines for writing formatting desc functions:


You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I 
don't think that was a good idea. Our usual convention is to have the 
function comment in the .c file, not at the declaration in the header 
file. When I want to know what a function does, I jump to the .c file, 
and might miss the comment in the header entirely.


Let's add a src/backend/access/rmgrdesc/README file. We don't currently 
have any explanation anywhere why the rmgr desc functions are in a 
separate directory. The README would be a good place to explain that, 
and to have the formatting guidelines. See attached.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 5261e3b4c373aef47458bed14d339d4ea2e6df5a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 10 Jul 2023 10:41:35 +0300
Subject: [PATCH v1 1/1] Add rmgrdesc README.

In the README, briefly explain what rmgrdesc functions are, and why they
are in a separate directory. Commit c03c2eae0a added some guidelines on
the preferred output format; move that to the README too.

Discussion: TODO
---
 src/backend/access/rmgrdesc/README   | 60 
 src/backend/access/rmgrdesc/rmgrdesc_utils.c |  4 ++
 src/include/access/rmgrdesc_utils.h  | 44 --
 3 files changed, 64 insertions(+), 44 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/README

diff --git a/src/backend/access/rmgrdesc/README b/src/backend/access/rmgrdesc/README
new file mode 100644
index 00..abe84b9f11
--- /dev/null
+++ b/src/backend/access/rmgrdesc/README
@@ -0,0 +1,60 @@
+src/backend/access/rmgrdesc/README
+
+WAL resource manager description functions
+==
+
+For debugging purposes, there is a "description function", or rmgrdesc
+function, for each WAL resource manager. The rmgrdesc function parses the WAL
+record and prints the contents of the WAL record in a somewhat human-readable
+format.
+
+The rmgrdesc functions for all resource managers are gathered in this
+directory, because they are also used in the stand-alone pg_waldump program.
+They could potentially be used by out-of-tree debugging tools too, although
+the the functions or the output format should not be considered a stable API.
+
+Guidelines for rmgrdesc output format
+=
+
+The goal of these guidelines is to avoid gratuitous inconsistencies across
+each rmgr, and to allow users to parse desc output strings without too much
+difficulty.  This is not an API specification or an interchange format.
+(Only heapam and nbtree desc routines follow these guidelines at present, in
+any case.)
+
+Record descriptions are similar to JSON style key/value objects.  However,
+there is no explicit "string" type/string escaping.  Top-level { } brackets
+should be omitted.  For example:
+
+snapshotConflictHorizon: 0, flags: 0x03
+
+Record descriptions may contain variable-length arrays.  For example:
+
+nunused: 5, unused: [1, 2, 3, 4, 5]
+
+Nested objects are supported via { } brackets.  They generally appear inside
+variable-length arrays.  For example:
+
+ndeleted: 0, nupdated: 1, deleted: [], updated: [{ off: 45, nptids: 1, ptids: [0] }]
+
+Try to output things in an order that faithfully represents the order of
+fields from the underlying physical WAL record struct.  Key names should be
+unique (at the same nesting level) to make parsing easy.  It's a good idea if
+the number of items in the array appears before the array.
+
+It's okay for individual WAL record types to invent their own conventions.
+For example, Heap2's PRUNE record descriptions use a custom array format for
+the record's "redirected" field:
+
+... redirected: [1->4, 5->9], dead: [10, 11], unused: [3, 7, 8]
+
+Arguably the desc routine should be using object notation for this instead.
+However, there is value in using a custom format when it conveys useful
+information about the underlying physical data structures.
+
+This ad-hoc format has the advantage of being close to the format used for
+the "dead" and "unused" arrays (which follow the standard desc convention for
+page offset number arrays).  It suggests that the "redirected" elements shown
+are just pairs of page offset numbers (which is how it really works).
+
+rmgrdesc_utils.c contains some helper functions to print data in this format.
diff --git a/src/backend/access/rmgrdesc/rmgrdesc_utils.c b/src/backend/access/rmgrdesc/rmgrdesc_utils.c
index 90051f0df9..808770524d 100644
--- a/src/backend/access/rmgrdesc/rmgrdesc_utils.c
+++ b/src/backend/access/rmgrdesc/rmgrdesc_utils.c
@@ -16,6 +16,10 @@
 #include "access/rmgrdesc_utils.h"
 #include "storage/off.h"

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Peter Geoghegan

On Tue, Apr 11, 2023 at 2:29 PM Melanie Plageman
 wrote:
> > That doesn't seem great to me either. I don't like this ambiguity,
> > because it seems like it makes the description hard to parse in a way
> > that flies in the face of what we're trying to do here, in general.
> > So it seems like it might be worth fixing now, in the scope of this
> > patch.
>
> Agreed.

Great -- pushed a fix for this just now, which included that change.

> I agree it would be nice for xl_heap_lock->locking_xid to be renamed
> xmax for clarity. I would suggest that if you don't intend to put it
> in a separate commit, you mention it explicitly in the final commit
> message. Its motivation isn't immediately obvious to the reader.

What I ended up doing is making that part of a bug fix for a minor
buglet I noticed in passing -- it became part of the "Fix xl_heap_lock
WAL record field's data type" commit from a bit earlier on.

Thanks for your help with the follow-up work. Seems like we're done
with this now.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Melanie Plageman

On Tue, Apr 11, 2023 at 1:35 PM Peter Geoghegan  wrote:
>
> On Tue, Apr 11, 2023 at 7:40 AM Melanie Plageman  
> wrote:
> > Not the fault of this patch, but I also noticed that heap UPDATE and
> > HOT_UPDATE records have xmax twice and don't differentiate between new
> > and old. I think that was probably a mistake.
> >
> > description  | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> > [], new off: 100, xmax 0
>
> That doesn't seem great to me either. I don't like this ambiguity,
> because it seems like it makes the description hard to parse in a way
> that flies in the face of what we're trying to do here, in general.
> So it seems like it might be worth fixing now, in the scope of this
> patch.

Agreed.

On Tue, Apr 11, 2023 at 3:22 PM Peter Geoghegan  wrote:
>
> On Tue, Apr 11, 2023 at 11:48 AM Peter Geoghegan  wrote:
> > Attached revision deals with this by spelling out the names in full
> > (e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
> > to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
> > record types, on the theory that those should match the physical
> > record (unless there is a good reason not to, which doesn't apply
> > here).
>
> I just noticed that we don't even show xmax in the case of DELETE
> records. Perhaps the original assumption is that it must match the
> record's own XID, but that's not true after the MultiXact enhancements
> for foreign key locking added to 9.3 (and in any case there is no
> reason at all to show xmax in UPDATE but not in DELETE).
>
> Attached revision v4 fixes this, making DELETE, UPDATE, HOT_UPDATE,
> LOCK, and LOCK_UPDATED record types consistent with each other in
> terms of the key names output by the heap desc routine. The field
> order also needed a couple of tweaks for struct consistency (and
> cross-record consistency) for v4.

Code in v4 all seems fine to me.
I like the update guidelines comment.

I agree it would be nice for xl_heap_lock->locking_xid to be renamed
xmax for clarity. I would suggest that if you don't intend to put it
in a separate commit, you mention it explicitly in the final commit
message. Its motivation isn't immediately obvious to the reader.

- Melanie

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Peter Geoghegan

On Tue, Apr 11, 2023 at 11:48 AM Peter Geoghegan  wrote:
> Attached revision deals with this by spelling out the names in full
> (e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
> to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
> record types, on the theory that those should match the physical
> record (unless there is a good reason not to, which doesn't apply
> here).

I just noticed that we don't even show xmax in the case of DELETE
records. Perhaps the original assumption is that it must match the
record's own XID, but that's not true after the MultiXact enhancements
for foreign key locking added to 9.3 (and in any case there is no
reason at all to show xmax in UPDATE but not in DELETE).

Attached revision v4 fixes this, making DELETE, UPDATE, HOT_UPDATE,
LOCK, and LOCK_UPDATED record types consistent with each other in
terms of the key names output by the heap desc routine. The field
order also needed a couple of tweaks for struct consistency (and
cross-record consistency) for v4.

-- 
Peter Geoghegan

v4-0001-Fix-heapdesc-infomask-array-output.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Peter Geoghegan

On Tue, Apr 11, 2023 at 10:34 AM Peter Geoghegan  wrote:
> > description  | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> > [], new off: 100, xmax 0
>
> That doesn't seem great to me either. I don't like this ambiguity,
> because it seems like it makes the description hard to parse in a way
> that flies in the face of what we're trying to do here, in general.
> So it seems like it might be worth fixing now, in the scope of this
> patch.

Attached revision deals with this by spelling out the names in full
(e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
record types, on the theory that those should match the physical
record (unless there is a good reason not to, which doesn't apply
here). I also removed some inconsistencies between
xl_heap_lock_updated and xl_heap_lock, since they're very similar
record types.

The revision also adds an extra sentence to the guidelines, since this
seems like something that we're entitled to take a relatively firm
position on. Finally, it also adds a comment about the rules for
infobits_desc callers in header comments for the function, per your
concern about that.

-- 
Peter Geoghegan

v3-0001-Fix-heapdesc-infomask-array-output.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Peter Geoghegan

On Tue, Apr 11, 2023 at 7:40 AM Melanie Plageman
 wrote:
> static void
> infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
> {
> appendStringInfo(buf, "%s: [", keyname);
>
> Why can we assume that there will be no space at the end here?

I don't think that anybody is going to try that, but if they do then
the assertion will fail reliably.

> I know we need to be able to avoid doing the comma overwriting if no
> flags were set. In general, we expect record description elements to
> prepend themselves with commas and spaces, but these infobits, for
> example, use a trailing comma and space. If someone adds a description
> element with a trailing space, they will trip this assert. We should at
> least add a comment explaining this assertion so someone knows what to
> do if they trip it.

The invariant here is roughly: caller's keyname argument cannot have
trailing spaces or punctuation characters. It looks like it would be
inconvenient to write a precise assertion for that, but it doesn't
feel particularly necessary, given that this is just a static helper
function.

> Otherwise, we can return early if no flags are set. That will probably
> make for slightly messier code since we would still have to construct
> the empty list.

I prefer to keep this as simple as possible for now.

> Also you didn't add the same assert to truncate_flags_desc().

That's because truncate_flags_desc doesn't have a "keyname" argument.
Though it does have an assertion at the end that is almost equivalent:
the "Assert(buf->data[buf->len - 2] == ',') assertion (a matching
assertion appears at the end of infobits_desc).

> Not the fault of this patch, but I also noticed that heap UPDATE and
> HOT_UPDATE records have xmax twice and don't differentiate between new
> and old. I think that was probably a mistake.
>
> description  | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> [], new off: 100, xmax 0

That doesn't seem great to me either. I don't like this ambiguity,
because it seems like it makes the description hard to parse in a way
that flies in the face of what we're trying to do here, in general.
So it seems like it might be worth fixing now, in the scope of this
patch.

> Also not the fault of this patch, but looking at the output while using
> this, I realized truncate record type has a stringified version of its
> flags while other record types, like update, don't. Do you think this
> makes sense? Perhaps not something we can change now, though...

You don't have to look at the truncate record type (which is a
relatively obscure and unimportant record type) to see these kinds of
inconsistencies. You can see the same thing with HEAP_UPDATE and
HEAP_HOT_UPDATE, which have stringified constants for infomask bits,
but not for the xl_heap_update.flags status bits.

I don't see any principled reason why such an inconsistency should
exist -- and we're talking about a pretty glaring inconsistency here.
On the other hand I don't think that we're obligated to do anything
about it for 16.

> Also not the fault of this patch, but I noticed that leaftopparent is
> often InvalidBlockNumber--which shows up as 4294967295. I wonder if
> anyone would be confused by this. Maybe devs know that this value is
> InvalidBlockNumber. In the future, perhaps we should interpolate the
> string "InvalidBlockNumber"?
>
> description | left: 436, right: 389, level: 0, safexid: 0:1091,
> leafleft: 436, leafright: 389, leaftopparent: 4294967295

In my personal opinion (this is a totally subjective question), the
current approach here is okay because (on its own) "leaftopparent:
4294967295" isn't any more or less meaningful than "leaftopparent:
InvalidBlockNumber". It's not as if the REDO routine actually relies
on the value ever being InvalidBlockNumber at all (except in an
assertion).

Plus it's easier to parse as-is. That's what swings it for me, in fact
(as with the "two xmax fields in update records" question).

This is the kind of question that tends to lead to bikeshedding. The
guidelines should avoid taking a firm position on these more
subjective questions, where we must make a subjective trade-off.
Especially a trade-off around how faithfully we represent the physical
WAL record versus readability (whatever "readability" means). I
pondered a similar trade-off in comments added to delvacuum_desc. That
contributed to my feeling that this is best left up to individual rmgr
desc routines.

--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-04-11 Thread Melanie Plageman

Hi,

static void
infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
{
appendStringInfo(buf, "%s: [", keyname);

Why can we assume that there will be no space at the end here?

I know we need to be able to avoid doing the comma overwriting if no
flags were set. In general, we expect record description elements to
prepend themselves with commas and spaces, but these infobits, for
example, use a trailing comma and space. If someone adds a description
element with a trailing space, they will trip this assert. We should at
least add a comment explaining this assertion so someone knows what to
do if they trip it.

Otherwise, we can return early if no flags are set. That will probably
make for slightly messier code since we would still have to construct
the empty list.

Assert(buf->data[buf->len - 1] != ' ');

if (infobits & XLHL_XMAX_IS_MULTI)
appendStringInfoString(buf, "IS_MULTI, ");
if (infobits & XLHL_XMAX_LOCK_ONLY)
appendStringInfoString(buf, "LOCK_ONLY, ");
if (infobits & XLHL_XMAX_EXCL_LOCK)
appendStringInfoString(buf, "EXCL_LOCK, ");
if (infobits & XLHL_XMAX_KEYSHR_LOCK)
appendStringInfoString(buf, "KEYSHR_LOCK, ");
if (infobits & XLHL_KEYS_UPDATED)
appendStringInfoString(buf, "KEYS_UPDATED, ");

if (buf->data[buf->len - 1] == ' ')
{
/* Truncate-away final unneeded ", "  */
Assert(buf->data[buf->len - 2] == ',');
buf->len -= 2;
buf->data[buf->len] = '\0';
}

appendStringInfoString(buf, "]");
}

Also you didn't add the same assert to truncate_flags_desc().

static void
truncate_flags_desc(StringInfo buf, uint8 flags)
{
appendStringInfoString(buf, "flags: [");

if (flags & XLH_TRUNCATE_CASCADE)
appendStringInfoString(buf, "CASCADE, ");
if (flags & XLH_TRUNCATE_RESTART_SEQS)
appendStringInfoString(buf, "RESTART_SEQS, ");

if (buf->data[buf->len - 1] == ' ')
{
/* Truncate-away final unneeded ", "  */
Assert(buf->data[buf->len - 2] == ',');
buf->len -= 2;
buf->data[buf->len] = '\0';
}

appendStringInfoString(buf, "]");
}

Not the fault of this patch, but I also noticed that heap UPDATE and
HOT_UPDATE records have xmax twice and don't differentiate between new
and old. I think that was probably a mistake.

description  | off: 119, xmax: 1105, flags: 0x00, old_infobits:
[], new off: 100, xmax 0

else if (info == XLOG_HEAP_UPDATE)
{
xl_heap_update *xlrec = (xl_heap_update *) rec;

appendStringInfo(buf, "off: %u, xmax: %u, flags: 0x%02X, ",
xlrec->old_offnum,
xlrec->old_xmax,
xlrec->flags);
infobits_desc(buf, xlrec->old_infobits_set, "old_infobits");
appendStringInfo(buf, ", new off: %u, xmax %u",
xlrec->new_offnum,
xlrec->new_xmax);
}
else if (info == XLOG_HEAP_HOT_UPDATE)
{
xl_heap_update *xlrec = (xl_heap_update *) rec;

appendStringInfo(buf, "off: %u, xmax: %u, flags: 0x%02X, ",
xlrec->old_offnum,
xlrec->old_xmax,
xlrec->flags);
infobits_desc(buf, xlrec->old_infobits_set, "old_infobits");
appendStringInfo(buf, ", new off: %u, xmax: %u",
xlrec->new_offnum,
xlrec->new_xmax);
}

Also not the fault of this patch, but looking at the output while using
this, I realized truncate record type has a stringified version of its
flags while other record types, like update, don't. Do you think this
makes sense? Perhaps not something we can change now, though...

description  | off: 1, xmax: 1183, flags: 0x00, old_infobits: [],
new off: 119, xmax 0

Also not the fault of this patch, but I noticed that leaftopparent is
often InvalidBlockNumber--which shows up as 4294967295. I wonder if
anyone would be confused by this. Maybe devs know that this value is
InvalidBlockNumber. In the future, perhaps we should interpolate the
string "InvalidBlockNumber"?

description | left: 436, right: 389, level: 0, safexid: 0:1091,
leafleft: 436, leafright: 389, leaftopparent: 4294967295

- Melanie

Re: Show various offset arrays for heap WAL records

2023-04-10 Thread Peter Geoghegan

On Mon, Apr 10, 2023 at 5:23 PM Melanie Plageman
 wrote:
> If you keep the name, I'd explain it briefly in a comment above the code
> then -- for those of us who spend less time with btrees. It is a tool
> that will be often used by developers, so it is not unreasonable to
> assume they may read the code if they are confused.

Okay, I'll do something about that shortly.

Attached v2 deals with the "trailing comma and space in flags array"
heap desc issue using an approach that's along the same lines as your
suggested approach. What do you think?

-- 
Peter Geoghegan

v2-0001-Fix-heapdesc-infomask-array-output.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-04-10 Thread Melanie Plageman

On Mon, Apr 10, 2023 at 04:31:44PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 10, 2023 at 3:04 PM Melanie Plageman  
> wrote:
> > 
> > I will say that the prefix of p in "ptid" makes it sound like pointer to
> > a tid, which I don't believe is what you meant.
> 
> I was thinking of the symbol name "ptid" from
> _bt_delitems_delete_check() (it even appears in code comments). I
> intended "posting list TID". But "pointer to a TID" actually kinda
> works too, since these are offsets into a posting list (a simple
> ItemPointerData array) for those TIDs that we're in the process of
> removing/deleted from the tuple.

If you keep the name, I'd explain it briefly in a comment above the code
then -- for those of us who spend less time with btrees. It is a tool
that will be often used by developers, so it is not unreasonable to
assume they may read the code if they are confused.

- Melanie

Re: Show various offset arrays for heap WAL records

2023-04-10 Thread Peter Geoghegan

On Mon, Apr 10, 2023 at 3:04 PM Melanie Plageman
 wrote:
> I took a look at the first patch even though you've pushed the bugfix
> part. Any reason you didn't use array_desc() for the inner array (of
> "ptids")? I find that following the pattern of using array_desc (when it
> is correct, of course!) helps me to quickly identify: "okay, this is an
> array of x" without having to stare at the loop too much.

It was fairly arbitrary. I was thinking "we can't use array_desc for
this", which wasn't 100% true, but seemed close enough. It helped that
this allowed me to remove uint16_elem_desc(), which likely wouldn't
have been reused later on.

> I will say that the prefix of p in "ptid" makes it sound like pointer to
> a tid, which I don't believe is what you meant.

I was thinking of the symbol name "ptid" from
_bt_delitems_delete_check() (it even appears in code comments). I
intended "posting list TID". But "pointer to a TID" actually kinda
works too, since these are offsets into a posting list (a simple
ItemPointerData array) for those TIDs that we're in the process of
removing/deleted from the tuple.

> I like the new guidelines you proposed (in the patch).
> They are well-written and clear.

Thanks. The guidelines might well become stricter in the future. Right
now I'd be happy if everybody could at least be in rough agreement
about the most basic things.

> I recognized that the output doesn't look nice, but I hadn't exactly
> thought of it as malformed. Perhaps you are right.

It does seem like an annoying thing to have to handle if you actually
want to parse the array. It requires a different approach to every
other array, which seems bad.

> I will say and I am still not a fan of the "if (first) else" logic in
> your attached patch.

I agree that my approach there was pretty ugly.

> How about we have the flags use a trailing comma and space and then
> overwrite the last one with something this:
>
> if (infobits & XLHL_KEYS_UPDATED)
> appendStringInfoString(buf, "KEYS_UPDATED, ");
> buf->data[buf->len -= strlen(", ")] = '\0';

I'll try something like that instead.

> -offsets = (OffsetNumber *) [xlrec->nplans];
> +offsets = (OffsetNumber *) ((char *) plans +
> +(xlrec->nplans *
> +sizeof(xl_heap_freeze_plan)));
>  appendStringInfoString(buf, ", plans:");
>  array_desc(buf, plans, sizeof(xl_heap_freeze_plan), 
> xlrec->nplans,
>_elem_desc, );

I thought that it made sense to match the FREEZE_PAGE REDO routine.

Another fairly arbitrary change, to be honest.

> > Note that the patch makes many individual (say) HOT_UPDATE records
> > have descriptions that look like this:
> >
> > ... old_infobits: [], ...
> >
> > This differs from HEAD, where the output is totally suppressed because
> > there are no flag bits to show. I think that this behavior is more
> > logical and consistent overall.
>
> Yea, I think it is better to include things and show that they are empty
> then omit them. I find it more clear.

Right. It makes sense for something like this, because generally
speaking the structures aren't nested in any real sense. They're also
very static -- WAL records have a fixed structure. So it's unlikely
that anybody is going to try to parse the description before knowing
which particular WAL record type (or perhaps types, plural) are
involved.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-04-10 Thread Melanie Plageman

On Sun, Apr 9, 2023 at 8:12 PM Peter Geoghegan  wrote:
>
> On Fri, Apr 7, 2023 at 4:46 PM Peter Geoghegan  wrote:
> > Pushed that one too.
>
> I noticed that the nbtree VACUUM and DELETE record types have their
> update/xl_btree_update arrays output incorrectly. We cannot use the
> generic array_desc() approach with xl_btree_update elements, because
> they're variable-width elements. The problem is that array_desc() only deals
> with fixed-width elements.

You are right. I'm sorry for the rather egregious oversight.

I took a look at the first patch even though you've pushed the bugfix
part. Any reason you didn't use array_desc() for the inner array (of
"ptids")? I find that following the pattern of using array_desc (when it
is correct, of course!) helps me to quickly identify: "okay, this is an
array of x" without having to stare at the loop too much.

I will say that the prefix of p in "ptid" makes it sound like pointer to
a tid, which I don't believe is what you meant.

> I also changed some of the details around whitespace in arrays in the
> fixup patch (though I didn't do the same with objects). It doesn't
> seem useful to use so much whitespace for long arrays of integers
> (really page offset numbers). And I brought a few nbtree desc routines
> that still used ";" characters as punctuation in line with the new
> convention.

Cool.

> Finally, the patch revises the guidelines written for rmgr desc
> routine authors. I don't think that we need to describe how to handle
> outputting whitespace in detail. It'll be quite natural for other
> rmgrs to use existing facilities such as array_desc() themselves,
> which makes whitespace type inconsistencies unlikely. I've tried to
> make the limits of the guidelines clear. The main goal is to avoid
> gratuitous inconsistencies, and to provide a standard way of doing
> things that many different rmgrs are likely to want to do, again and
> again. But individual rmgrs still have a certain amount of discretion,
> which seems like a good thing to me (the alternative requires that we
> fix at least a couple of things in nbtdesc.c and in heapdesc.c, which
> doesn't seem useful to me).

I like the new guidelines you proposed (in the patch).
They are well-written and clear.


On Mon, Apr 10, 2023 at 3:18 PM Peter Geoghegan  wrote:
>
> On Sun, Apr 9, 2023 at 5:12 PM Peter Geoghegan  wrote:
> > I noticed that the nbtree VACUUM and DELETE record types have their
> > update/xl_btree_update arrays output incorrectly. We cannot use the
> > generic array_desc() approach with xl_btree_update elements, because
> > they're variable-width elements. The problem is that array_desc() only deals
> > with fixed-width elements.
>
> I pushed this fix just now, though without the updates to the
> guidelines (or only minimal updates).
>
> A remaining problem with arrays appears in "infobits" fields for
> record types such as LOCK. Here's an example of the problem:
>
> off: 34, xid: 3, flags: 0x00, infobits: [, LOCK_ONLY, EXCL_LOCK ]
>
> Clearly the punctuation from the array is malformed.

So, I did do this on purpose -- because I didn't want to have to do the
gymnastics to determine which flag was hit first (though it looks like I
mistakenly omitted the comma prepending IS_MULTI -- that was not
intentional).
I recognized that the output doesn't look nice, but I hadn't exactly
thought of it as malformed. Perhaps you are right.

I will say and I am still not a fan of the "if (first) else" logic in
your attached patch.

I've put my suggestion for how to do it instead inline with the code
diff below for clarity.

diff --git a/src/backend/access/rmgrdesc/heapdesc.c
b/src/backend/access/rmgrdesc/heapdesc.c
index 3bd083875..a64d14c2c 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -18,29 +18,75 @@
 #include "access/rmgrdesc_utils.h"

 static void
-out_infobits(StringInfo buf, uint8 infobits)
+infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
...
 if (infobits & XLHL_KEYS_UPDATED)
-appendStringInfoString(buf, ", KEYS_UPDATED");
+{
+if (first)
+appendStringInfoString(buf, "KEYS_UPDATED");
+else
+appendStringInfoString(buf, ", KEYS_UPDATED");
+first = false;
+}

How about we have the flags use a trailing comma and space and then
overwrite the last one with something this:

if (infobits & XLHL_KEYS_UPDATED)
appendStringInfoString(buf, "KEYS_UPDATED, ");
buf->data[buf->len -= strlen(", ")] = '\0';


@@ -230,7 +271,9 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 OffsetNumber *offsets;

I don't prefer this to what I had, which is also correct, right?

 plans = (xl_heap_freeze_plan *)
XLogRecGetBlockData(record, 0, NULL);
-offsets = (OffsetNumber *) [xlrec->nplans];
+offsets = (OffsetNumber *) ((char *) plans +
+(xlrec->nplans *
+

Re: Show various offset arrays for heap WAL records

2023-04-10 Thread Peter Geoghegan

On Sun, Apr 9, 2023 at 5:12 PM Peter Geoghegan  wrote:
> I noticed that the nbtree VACUUM and DELETE record types have their
> update/xl_btree_update arrays output incorrectly. We cannot use the
> generic array_desc() approach with xl_btree_update elements, because
> they're variable-width elements. The problem is that array_desc() only deals
> with fixed-width elements.

I pushed this fix just now, though without the updates to the
guidelines (or only minimal updates).

A remaining problem with arrays appears in "infobits" fields for
record types such as LOCK. Here's an example of the problem:

off: 34, xid: 3, flags: 0x00, infobits: [, LOCK_ONLY, EXCL_LOCK ]

Clearly the punctuation from the array is malformed.

A second issue (related to the first) is the name of the key itself,
"infobits". While "infobits" actually seems fine in this particular
example, I don't think that we want to do the same for record types
such as HEAP_UPDATE, since such records require that the description
show information about flags whose underlying field in the WAL record
struct is actually called "old_infobits_set". I think that we should
be outputting "old_infobits: [ ... ] " in the description of
HEAP_UPDATE records, which isn't the case right now.

A third issue is present in the nearby handling of xl_heap_truncate
status flags. It's the same basic array punctuation issue again, so
arguably this is the same issue as the first one.

Attached patch fixes all of these issues, and overhauls the guidelines
in the way originally proposed by the nbtree fix patch (since I didn't
keep that part of the nbtree patch when I pushed it today).

Note that the patch makes many individual (say) HOT_UPDATE records
have descriptions that look like this:

... old_infobits: [], ...

This differs from HEAD, where the output is totally suppressed because
there are no flag bits to show. I think that this behavior is more
logical and consistent overall.

-- 
Peter Geoghegan

v1-0001-Fix-heapdesc-infomask-array-output.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-04-09 Thread Peter Geoghegan

On Fri, Apr 7, 2023 at 4:46 PM Peter Geoghegan  wrote:
> Pushed that one too.

I noticed that the nbtree VACUUM and DELETE record types have their
update/xl_btree_update arrays output incorrectly. We cannot use the
generic array_desc() approach with xl_btree_update elements, because
they're variable-width elements. The problem is that array_desc() only deals
with fixed-width elements.

I also changed some of the details around whitespace in arrays in the
fixup patch (though I didn't do the same with objects). It doesn't
seem useful to use so much whitespace for long arrays of integers
(really page offset numbers). And I brought a few nbtree desc routines
that still used ";" characters as punctuation in line with the new
convention.

Finally, the patch revises the guidelines written for rmgr desc
routine authors. I don't think that we need to describe how to handle
outputting whitespace in detail. It'll be quite natural for other
rmgrs to use existing facilities such as array_desc() themselves,
which makes whitespace type inconsistencies unlikely. I've tried to
make the limits of the guidelines clear. The main goal is to avoid
gratuitous inconsistencies, and to provide a standard way of doing
things that many different rmgrs are likely to want to do, again and
again. But individual rmgrs still have a certain amount of discretion,
which seems like a good thing to me (the alternative requires that we
fix at least a couple of things in nbtdesc.c and in heapdesc.c, which
doesn't seem useful to me).

--
Peter Geoghegan
From 44bfa575f5f28d1a8093e02b4a7993c1368cbeb5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 8 Apr 2023 18:59:09 -0700
Subject: [PATCH v1] Fix WAL description of posting list updates.

---
 src/include/access/rmgrdesc_utils.h  |  41 ++-
 src/backend/access/rmgrdesc/heapdesc.c   |   2 +-
 src/backend/access/rmgrdesc/nbtdesc.c| 121 ++-
 src/backend/access/rmgrdesc/rmgrdesc_utils.c |  43 +--
 doc/src/sgml/pgwalinspect.sgml   |  34 +++---
 5 files changed, 123 insertions(+), 118 deletions(-)

diff --git a/src/include/access/rmgrdesc_utils.h b/src/include/access/rmgrdesc_utils.h
index 13552d6f3..16916e3e8 100644
--- a/src/include/access/rmgrdesc_utils.h
+++ b/src/include/access/rmgrdesc_utils.h
@@ -12,12 +12,49 @@
 #ifndef RMGRDESC_UTILS_H_
 #define RMGRDESC_UTILS_H_
 
+/*
+ * Guidelines for rmgr descriptor routine authors:
+ *
+ * The goal of these guidelines is to avoid gratuitous inconsistencies across
+ * each rmgr, and to allow users to parse desc output strings without too much
+ * difficulty.  This is not an API specification or an interchange format.
+ * (Only heapam and nbtree desc routines follow these guidelines at present,
+ * in any case.)
+ *
+ * Record descriptions are similar to JSON style key/value objects.  However,
+ * there is no explicit "string" type/string escaping.  Top-level { } brackets
+ * should be omitted.  For example:
+ *
+ * snapshotConflictHorizon: 0, flags: 0x03
+ *
+ * Record descriptions may contain variable-length arrays.  For example:
+ *
+ * nunused: 5, unused: [1, 2, 3, 4, 5]
+ *
+ * Nested objects are supported via { } brackets.  They generally appear
+ * inside variable-length arrays.  For example:
+ *
+ * ndeleted: 0, nupdated: 1, deleted: [], updated: [{ off: 45, nptids: 1, ptids: [0] }]
+ *
+ * Try to output things in an order that faithfully represents the order of
+ * things in the physical WAL record struct.  It's a good idea if the number
+ * of items in the array appears before the array.
+ *
+ * It's okay for individual WAL record types to invent their own conventions.
+ * For example, heapam's PRUNE records output the follow representation of
+ * redirects:
+ *
+ * ... redirected: [39->46], ...
+ *
+ * Arguably the PRUNE desc routine should be using object notation instead.
+ * This ad-hoc representation of redirects has the advantage of being terse in
+ * a context where that might matter a lot.
+ */
 extern void array_desc(StringInfo buf, void *array, size_t elem_size, int count,
 	   void (*elem_desc) (StringInfo buf, void *elem, void *data),
 	   void *data);
 extern void offset_elem_desc(StringInfo buf, void *offset, void *data);
 extern void redirect_elem_desc(StringInfo buf, void *offset, void *data);
-extern void relid_desc(StringInfo buf, void *relid, void *data);
-extern void uint16_elem_desc(StringInfo buf, void *value, void *data);
+extern void oid_elem_desc(StringInfo buf, void *relid, void *data);
 
 #endif			/* RMGRDESC_UTILS_H */
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 6dfd7d7d1..3bd083875 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -127,7 +127,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, ", nrelids: %u", xlrec->nrelids);
 		appendStringInfoString(buf, ", relids:");
 		array_desc(buf, xlrec->relids,

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Peter Geoghegan

On Fri, Apr 7, 2023 at 4:21 PM Melanie Plageman
 wrote:
> It's come to my attention that I forgot to include the btree patch earlier.

Pushed that one too.

Also removed the use of the "restrict" keyword here.

Thanks
--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Melanie Plageman

On Fri, Apr 7, 2023 at 7:09 PM Peter Geoghegan  wrote:
>
> On Fri, Apr 7, 2023 at 4:01 PM Melanie Plageman
>  wrote:
> > LGTM
>
> Pushed, thanks.

It's come to my attention that I forgot to include the btree patch earlier.

PFA
From 4f502b2513ba79d738e7ed87aaf7d18ed2a2e30f Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 13 Mar 2023 18:15:17 -0400
Subject: [PATCH v4 2/2] Add detail to some btree xlog record descs

Suggested by Peter Geoghegan

Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
---
 src/backend/access/rmgrdesc/nbtdesc.c| 84 +---
 src/backend/access/rmgrdesc/rmgrdesc_utils.c |  6 ++
 src/include/access/rmgrdesc_utils.h  |  2 +
 3 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index c5dc543a0f..9ffece109d 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -15,6 +15,58 @@
 #include "postgres.h"
 
 #include "access/nbtxlog.h"
+#include "access/rmgrdesc_utils.h"
+
+static void btree_del_desc(StringInfo buf, char *block_data, uint16 ndeleted,
+		   uint16 nupdated);
+static void btree_update_elem_desc(StringInfo buf, void *restrict update, void
+   *restrict data);
+
+static void
+btree_del_desc(StringInfo buf, char *block_data, uint16 ndeleted, uint16 nupdated)
+{
+	OffsetNumber *updatedoffsets;
+	xl_btree_update *updates;
+	OffsetNumber *data = (OffsetNumber *) block_data;
+
+	appendStringInfoString(buf, ", deleted:");
+	array_desc(buf, data, sizeof(OffsetNumber), ndeleted, _elem_desc, NULL);
+
+	appendStringInfoString(buf, ", updated:");
+	array_desc(buf, data, sizeof(OffsetNumber), nupdated, _elem_desc, NULL);
+
+	if (nupdated <= 0)
+		return;
+
+	updatedoffsets = (OffsetNumber *)
+		((char *) data + ndeleted * sizeof(OffsetNumber));
+	updates = (xl_btree_update *) ((char *) updatedoffsets +
+   nupdated *
+   sizeof(OffsetNumber));
+
+	appendStringInfoString(buf, ", updates:");
+	array_desc(buf, updates, sizeof(xl_btree_update),
+			   nupdated, _update_elem_desc,
+			   );
+}
+
+static void
+btree_update_elem_desc(StringInfo buf, void *restrict update, void *restrict data)
+{
+	xl_btree_update *new_update = (xl_btree_update *) update;
+	OffsetNumber *updated_offset = *((OffsetNumber **) data);
+
+	appendStringInfo(buf, "{ updated offset: %u, ndeleted tids: %u", *updated_offset, new_update->ndeletedtids);
+
+	appendStringInfoString(buf, ", deleted tids:");
+
+	array_desc(buf, (char *) new_update + SizeOfBtreeUpdate,
+			   sizeof(uint16), new_update->ndeletedtids, _elem_desc, NULL);
+
+	updated_offset++;
+
+	appendStringInfo(buf, " }");
+}
 
 void
 btree_desc(StringInfo buf, XLogReaderState *record)
@@ -31,7 +83,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-appendStringInfo(buf, "off %u", xlrec->offnum);
+appendStringInfo(buf, "off: %u", xlrec->offnum);
 break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -39,7 +91,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-appendStringInfo(buf, "level %u, firstrightoff %d, newitemoff %d, postingoff %d",
+appendStringInfo(buf, "level: %u, firstrightoff: %d, newitemoff: %d, postingoff: %d",
  xlrec->level, xlrec->firstrightoff,
  xlrec->newitemoff, xlrec->postingoff);
 break;
@@ -48,31 +100,41 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
 
-appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
+appendStringInfo(buf, "nintervals: %u", xlrec->nintervals);
 break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-appendStringInfo(buf, "ndeleted %u; nupdated %u",
+appendStringInfo(buf, "ndeleted: %u, nupdated: %u",
  xlrec->ndeleted, xlrec->nupdated);
+
+if (!XLogRecHasBlockImage(record, 0))
+	btree_del_desc(buf, XLogRecGetBlockData(record, 0, NULL),
+   xlrec->ndeleted, xlrec->nupdated);
+
 break;
 			}
 		case XLOG_BTREE_DELETE:
 			{
 xl_btree_delete *xlrec = (xl_btree_delete *) rec;
 
-appendStringInfo(buf, "snapshotConflictHorizon %u; ndeleted %u; nupdated %u",
+appendStringInfo(buf, "snapshotConflictHorizon: %u, ndeleted: %u, nupdated: %u",
  xlrec->snapshotConflictHorizon,
  xlrec->ndeleted, xlrec->nupdated);
+
+if (!XLogRecHasBlockImage(record, 0))
+	btree_del_desc(buf, XLogRecGetBlockData(record, 0, NULL),
+   xlrec->ndeleted, xlrec->nupdated);
+
 break;
 			}
 		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
 			{
 xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
 
-appendStringInfo(buf, "topparent %u; leaf %u; left %u; right %u",
+appendStringInfo(buf, "topparent: %u; leaf: %u; left: %u;

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Peter Geoghegan

On Fri, Apr 7, 2023 at 4:01 PM Melanie Plageman
 wrote:
> LGTM

Pushed, thanks.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Melanie Plageman

On Fri, Apr 7, 2023 at 5:43 PM Peter Geoghegan  wrote:
>
> On Fri, Apr 7, 2023 at 1:33 PM Melanie Plageman
>  wrote:
> > Attached v3 is cleaned up and includes a pg_walinspect docs update as
> > well as some edited comments in rmgr_utils.c
>
> Attached v4 has some small tweaks on your v3. Mostly just whitespace
> tweaks. Two slightly notable tweaks:
>
> * I changed the approach to globbing in the Makefile, rather than use
> your original overwide formulation for the new rmgrdesc_utils.c file.
>
> What do you think of this approach?

Seems fine.

> * Removed use of the restrict keyword.
>
> While "restrict" is C99, I'm not completely sure that it's totally
> supported by Postgres. I'm a bit surprised that you opted to use it in
> this particular patch.
>
> I meant to ask you about this earlier...why use restrict in this patch?


So, I think the signature I meant to have was:

void
array_desc(StringInfo buf, void *array, size_t elem_size, int count,
  void (*elem_desc) (StringInfo buf, const void *elem, void *data),
  void *data)

Basically I wanted to indicate that elem was not and should not be
modified and data can be modified but that they should not be the same
element or overlap at all.

> > I've added such an example to pg_walinspect docs.
>
> There already was a PRUNE example, though -- for the
> pg_get_wal_record_info function (singular, not to be confused with
> pg_get_wal_records_info).
>
> v4 makes the example a VACUUM record, which replaces the previous
> pg_get_wal_record_info PRUNE example -- that needed to be updated
> anyway. This approach has the advantage of not being too verbose,
> which still showing some of this kind of detail.
>
> This has the advantage of allowing pg_get_wal_records_info's example
> to continue to be an example that lacks a block reference (and so has
> a NULL block_ref). This is a useful contrast against the new
> pg_get_wal_block_info function.

LGTM

- Melanie

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Peter Geoghegan

On Fri, Apr 7, 2023 at 1:33 PM Melanie Plageman
 wrote:
> Attached v3 is cleaned up and includes a pg_walinspect docs update as
> well as some edited comments in rmgr_utils.c

Attached v4 has some small tweaks on your v3. Mostly just whitespace
tweaks. Two slightly notable tweaks:

* I changed the approach to globbing in the Makefile, rather than use
your original overwide formulation for the new rmgrdesc_utils.c file.

What do you think of this approach?

* Removed use of the restrict keyword.

While "restrict" is C99, I'm not completely sure that it's totally
supported by Postgres. I'm a bit surprised that you opted to use it in
this particular patch.

I meant to ask you about this earlier...why use restrict in this patch?

> I've added such an example to pg_walinspect docs.

There already was a PRUNE example, though -- for the
pg_get_wal_record_info function (singular, not to be confused with
pg_get_wal_records_info).

v4 makes the example a VACUUM record, which replaces the previous
pg_get_wal_record_info PRUNE example -- that needed to be updated
anyway. This approach has the advantage of not being too verbose,
which still showing some of this kind of detail.

This has the advantage of allowing pg_get_wal_records_info's example
to continue to be an example that lacks a block reference (and so has
a NULL block_ref). This is a useful contrast against the new
pg_get_wal_block_info function.

> I really like this idea and would find it useful. I reviewed the patch
> and tried it out and it worked for me and code looked fine as well.
>
> I didn't include it in the attached patchset because I don't feel
> confident enough in my own understanding of any potential implications
> of splitting up these record types to definitively endorse it. But, if
> someone else felt comfortable with it, I would like to see it in the
> tree.

I'm not going to move on it now for 16, given the lack of feedback about it.

-- 
Peter Geoghegan

v4-0001-Add-rmgr_desc-utilities.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-04-07 Thread Melanie Plageman

Attached v3 is cleaned up and includes a pg_walinspect docs update as
well as some edited comments in rmgr_utils.c

On Mon, Mar 27, 2023 at 6:27 PM Peter Geoghegan  wrote:
>
> On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
>  wrote:
> > I went to add dedup records and noticed that since the actual
> > BTDedupInterval struct is what is put in the xlog, I would need access
> > to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> > work because it includes files that cannot be included in frontend code.
>
> I suppose that the BTDedupInterval struct could have just as easily
> gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
> moment where I thought about doing it that way, but I guess I found it
> slightly preferable to use that symbol name (BTDedupInterval) rather
> than (say) xl_btree_dedup_interval in places like the nearby
> BTDedupStateData struct.
>
> Actually, I suppose that it's hard to make that alternative work, at
> least without
> including nbtxlog.h in nbtree.h. Which sounds wrong.
>
> > I, of course, could make some local struct in nbtdesc.c which has an
> > OffsetNumber and a uint16, since the BTDedupInterval is pretty
> > straightforward, but that seems a bit annoying.
> > I'm probably missing something obvious, but is there a better way to do
> > this?
>
> It was probably just one of those cases where I settled on the
> arrangement that looked least odd overall. Not a particularly
> principled approach. But the approach that I'm going to take once more
> here.  ;-)
>
> All of the available alternatives are annoying in roughly the same
> way, though perhaps to varying degrees. All except one: I'm okay with
> just not adding coverage for deduplication records, for the time being
> -- just seeing the number of intervals alone is relatively informative
> with deduplication records, unlike (say) nbtree delete records. I'm
> also okay with having coverage for dedup records if you feel it's
> worth having. Your call.
>
> If we're going to have coverage for deduplication records then it
> seems to me that we have to have a struct in nbtxlog.h for your code
> to work off of. It also seems likely that we'll want to use that same
> struct within nbtxlog.c. What's less clear is what that means for the
> BTDedupInterval struct. I don't think that we should include nbtxlog.h
> in nbtree.h, nor should we do the converse.
>
> I guess maybe two identical structs would be okay. BTDedupInterval,
> and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
> and the latter used through a pointer at the point that nbtxlog.c
> reads a dedup record. Then maybe at a sizeof() static assert beside
> the existing btree_xlog_dedup() assertions that check that the dedup
> state interval array matches the array taken from the WAL record.
> That's still a bit weird, but I find it preferable to any alternative
> that I can think of.

I've omitted enhancements for the dedup record type for now.

> > On another note, I've thought about how to include some example output
> > in docs, and, for example we could modify the example output in the
> > pgwalinspect docs which includes a PRUNE record already for
> > pg_get_wal_record_info() docs. We'd probably just want to keep it short.
>
> Yeah. Perhaps a PRUNE record for one of the system catalogs whose
> relfilenode is relatively recognizable. Say pg_class. It probably
> doesn't matter that much, but there is perhaps some small value in
> picking an example that is relatively easy to recreate later on (or to
> approximately recreate). I'm certainly not insisting on that, though.

I've added such an example to pg_walinspect docs.

On Tue, Mar 21, 2023 at 6:37 PM Peter Geoghegan  wrote:
>
> On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan  wrote:
> > There are several different things that seem important to me
> > personally. These are in tension with each other, to a degree. These
> > are:
> >
> > 1. Like Andres, I'd really like to have some way of inspecting things
> > like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
> > detail. These record types happen to be very important in general, and
> > the ability to see detailed information about the WAL record would
> > definitely help with some debugging scenarios. I've really missed
> > stuff like this while debugging serious issues under time pressure.
>
> One problem that I often run into when performing analysis of VACUUM
> using pg_walinspect is the issue of *who* pruned which heap page, for
> any given PRUNE record. Was it VACUUM/autovacuum, or was it
> opportunistic pruning? There is no way of knowing for sure right now.
> You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
> record coming from VACUUM; it could just have been an opportunistic
> prune operation that happened to take place when a SELECT query ran,
> before any XID was ever allocated.
>
> I think that we should do something like the attached, to completely
> avoid this ambiguity.

Re: Show various offset arrays for heap WAL records

2023-03-27 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
 wrote:
> I went to add dedup records and noticed that since the actual
> BTDedupInterval struct is what is put in the xlog, I would need access
> to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> work because it includes files that cannot be included in frontend code.

I suppose that the BTDedupInterval struct could have just as easily
gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
moment where I thought about doing it that way, but I guess I found it
slightly preferable to use that symbol name (BTDedupInterval) rather
than (say) xl_btree_dedup_interval in places like the nearby
BTDedupStateData struct.

Actually, I suppose that it's hard to make that alternative work, at
least without
including nbtxlog.h in nbtree.h. Which sounds wrong.

> I, of course, could make some local struct in nbtdesc.c which has an
> OffsetNumber and a uint16, since the BTDedupInterval is pretty
> straightforward, but that seems a bit annoying.
> I'm probably missing something obvious, but is there a better way to do
> this?

It was probably just one of those cases where I settled on the
arrangement that looked least odd overall. Not a particularly
principled approach. But the approach that I'm going to take once more
here.  ;-)

All of the available alternatives are annoying in roughly the same
way, though perhaps to varying degrees. All except one: I'm okay with
just not adding coverage for deduplication records, for the time being
-- just seeing the number of intervals alone is relatively informative
with deduplication records, unlike (say) nbtree delete records. I'm
also okay with having coverage for dedup records if you feel it's
worth having. Your call.

If we're going to have coverage for deduplication records then it
seems to me that we have to have a struct in nbtxlog.h for your code
to work off of. It also seems likely that we'll want to use that same
struct within nbtxlog.c. What's less clear is what that means for the
BTDedupInterval struct. I don't think that we should include nbtxlog.h
in nbtree.h, nor should we do the converse.

I guess maybe two identical structs would be okay. BTDedupInterval,
and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
and the latter used through a pointer at the point that nbtxlog.c
reads a dedup record. Then maybe at a sizeof() static assert beside
the existing btree_xlog_dedup() assertions that check that the dedup
state interval array matches the array taken from the WAL record.
That's still a bit weird, but I find it preferable to any alternative
that I can think of.

> On another note, I've thought about how to include some example output
> in docs, and, for example we could modify the example output in the
> pgwalinspect docs which includes a PRUNE record already for
> pg_get_wal_record_info() docs. We'd probably just want to keep it short.

Yeah. Perhaps a PRUNE record for one of the system catalogs whose
relfilenode is relatively recognizable. Say pg_class. It probably
doesn't matter that much, but there is perhaps some small value in
picking an example that is relatively easy to recreate later on (or to
approximately recreate). I'm certainly not insisting on that, though.

--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-03-27 Thread Melanie Plageman

On Mon, Mar 13, 2023 at 9:41 PM Peter Geoghegan  wrote:
> On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman  
> wrote:
>
> > I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> > the updated/deleted target offset numbers and the updated tuples
> > metadata.
> >
> > I wondered if there was any reason to do xl_btree_dedup deduplication
> > intervals.
>
> No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
> intervals -- each element is a page offset number, and a corresponding
> count of index tuples to merge together in the REDO routine. That's
> slightly different to anything else, but not in a way that seems like
> it requires very much additional effort.

I went to add dedup records and noticed that since the actual
BTDedupInterval struct is what is put in the xlog, I would need access
to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
work because it includes files that cannot be included in frontend code.

I, of course, could make some local struct in nbtdesc.c which has an
OffsetNumber and a uint16, since the BTDedupInterval is pretty
straightforward, but that seems a bit annoying.
I'm probably missing something obvious, but is there a better way to do
this?

On another note, I've thought about how to include some example output
in docs, and, for example we could modify the example output in the
pgwalinspect docs which includes a PRUNE record already for
pg_get_wal_record_info() docs. We'd probably just want to keep it short.

- Melanie

Re: Show various offset arrays for heap WAL records

2023-03-21 Thread Peter Geoghegan

On Tue, Mar 21, 2023 at 3:37 PM Peter Geoghegan  wrote:
> One problem that I often run into when performing analysis of VACUUM
> using pg_walinspect is the issue of *who* pruned which heap page, for
> any given PRUNE record. Was it VACUUM/autovacuum, or was it
> opportunistic pruning? There is no way of knowing for sure right now.
> You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
> record coming from VACUUM; it could just have been an opportunistic
> prune operation that happened to take place when a SELECT query ran,
> before any XID was ever allocated.

In case it's unclear how much of a problem this can be, here's an example:

The misc.sql regression test does a bulk update of the table "onek". A
little later, one of the queries that appears under the section "copy"
from the same file SELECTs from "onek". This produces a succession of
opportunistic prune records that look exactly like what you'd expect from
a VACUUM when viewed through pg_walinspect (without this patch). Each
PRUNE record has XID 0. The records appear in ascending heap block
number order, since there is a sequential scan involved (we go through
heapgetpage() to get to heap_page_prune_opt(), where the query prunes
opportunistically).

Another slightly surprising fact revealed by the patch is the ratio of
opportunistic prunes ("Heap2/PRUNE") to prunes run during VACUUM
("Heap2/PRUNE+BYVACUUM") with the regression tests:

│ resource_manager/record_type │ Heap2/PRUNE │
│ count│ 4,521   │
│ count_perc   │ 0.220   │
│ rec_size │ 412,442 │
│ avg_rec_size │ 91  │
│ rec_size_perc│ 0.194   │
│ fpi_size │ 632,828 │
│ fpi_size_perc│ 1.379   │
│ combined_size│ 1,045,270   │
│ combined_size_perc   │ 0.404   │
├─[ RECORD 61 ]┼─┤
│ resource_manager/record_type │ Heap2/PRUNE+BYVACUUM│
│ count│ 2,784   │
│ count_perc   │ 0.135   │
│ rec_size │ 467,057 │
│ avg_rec_size │ 167 │
│ rec_size_perc│ 0.219   │
│ fpi_size │ 546,344 │
│ fpi_size_perc│ 1.190   │
│ combined_size│ 1,013,401   │
│ combined_size_perc   │ 0.391   │
├─[ RECORD 62 ]┼─┤
│ resource_manager/record_type │ Heap2/VACUUM│
│ count│ 3,463   │
│ count_perc   │ 0.168   │
│ rec_size │ 610,038 │
│ avg_rec_size │ 176 │
│ rec_size_perc│ 0.286   │
│ fpi_size │ 893,964 │
│ fpi_size_perc│ 1.948   │
│ combined_size│ 1,504,002   │
│ combined_size_perc   │ 0.581   │
├─[ RECORD 63 ]┼─┤
│ resource_manager/record_type │ Heap2/VISIBLE   │
│ count│ 7,293   │
│ count_perc   │ 0.354   │
│ rec_size │ 431,382 │
│ avg_rec_size │ 59  │
│ rec_size_perc│ 0.202   │
│ fpi_size │ 1,794,048   │
│ fpi_size_perc│ 3.909   │
│ combined_size│ 2,225,430   │
│ combined_size_perc   │ 0.859   │


--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-03-21 Thread Peter Geoghegan

On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan  wrote:
> There are several different things that seem important to me
> personally. These are in tension with each other, to a degree. These
> are:
>
> 1. Like Andres, I'd really like to have some way of inspecting things
> like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
> detail. These record types happen to be very important in general, and
> the ability to see detailed information about the WAL record would
> definitely help with some debugging scenarios. I've really missed
> stuff like this while debugging serious issues under time pressure.

One problem that I often run into when performing analysis of VACUUM
using pg_walinspect is the issue of *who* pruned which heap page, for
any given PRUNE record. Was it VACUUM/autovacuum, or was it
opportunistic pruning? There is no way of knowing for sure right now.
You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
record coming from VACUUM; it could just have been an opportunistic
prune operation that happened to take place when a SELECT query ran,
before any XID was ever allocated.

I think that we should do something like the attached, to completely
avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
XLOG_HEAP2 record types to indicate that they took place during
VACUUM, by XOR'ing the flag with the record type/info when
XLogInsert() is called. For now this is only used by PRUNE records.
Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
record_type, as well as the unadorned Heap2/PRUNE record_type, which
we'll now know must have been opportunistic pruning.

The approach of using a bit in the style of the heapam init bit makes
sense to me, because the bit is available, and works in a way that is
minimally invasive. Also, one can imagine needing to resolve a similar
ambiguity in the future, when (say) opportunistic freezing is added.

I think that it makes sense to treat this within the scope of
Melanie's ongoing work to improve the instrumentation of these records
-- meaning that it's in scope for Postgres 16. Admittedly this is a
slightly creative interpretation, so if others disagree then I won't
argue. This is quite a small patch, though, which makes debugging
significantly easier. I think that there could be a great deal of
utility in being able to easily "pair up" corresponding
"Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging
scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and
"Heap2/VISIBLE" records, too, since they're all closely related record
types.

--
Peter Geoghegan

v1-0001-Record-which-PRUNE-records-are-from-VACUUM.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-03-13 Thread Peter Geoghegan

On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman
 wrote:
> On Fri, Jan 27, 2023 at 3:02 PM Robert Haas  wrote:
> > I'm not sure what's best in terms of formatting details but I
> > definitely like the idea of making pg_waldump show more details.

> If I'm not mistaken, this would be quite difficult without changing
> rm_desc to return some kind of self-describing data type.

I'd say that it would depend on how far you went with it. Basic
information about the tuple wouldn't require any of that. I suggest
leaving this part out for now, though.

> So, we can scrap any README or big comment, but are there other changes
> to the code or structure you think would avoid it being seen as an
> API?

I think that it would be good to try to build something that looks
like an API, while making zero promises about its stability -- at
least until further notice. Kind of like how there are no guarantees
about the stability of internal interfaces within the Linux kernel.

There is no reason to not take a firm position on some things now.
Things like punctuation, and symbol names for generic cross-record
symbols like snapshotConflictHorizon. Many of the differences that
exist now are wholly gratuitous -- just accidents. It would make sense
to standardize-away these clearly unnecessary variations. And to
document the new standard. I'd be surprised if anybody disagreed with
me on this point.

> I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> the updated/deleted target offset numbers and the updated tuples
> metadata.
>
> I wondered if there was any reason to do xl_btree_dedup deduplication
> intervals.

No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
intervals -- each element is a page offset number, and a corresponding
count of index tuples to merge together in the REDO routine. That's
slightly different to anything else, but not in a way that seems like
it requires very much additional effort.

> I wanted to include at least a minimal example for those following along
> with this thread that would cause creation of one of the record types
> which I have enhanced, but I had a little trouble making a reliable
> example.
>
> Below is my strategy for getting a Heap PRUNE record with redirects, but
> it occasionally doesn't end up working and I wasn't sure why (I can do
> more investigation if we think that having some kind of test for this is
> useful).

I'm not sure, but offhand I think that there could be a number of
annoying little implementation details that make it hard to come up
with a perfectly reliable test case. Perhaps try it while using VACUUM
VERBOSE, with the proviso that we should only expect the revised
example workflow to show a redirect record as intended when the
VERBOSE output confirms that VACUUM actually ran as expected, in
whatever way. For example, VACUUM can't have failed to acquire a
cleanup lock on a heap page due to the current phase of the moon.
VACUUM shouldn't have its "removable cutoff" held back by
who-knows-what when the test case is run, either.

Some of the tests for VACUUM use a temp table, since they conveniently
cannot have their "removable cutoff" held back -- not since commit
a7212be8. Of course, that strategy won't help you here. Getting VACUUM
to behave very predictably for testing purposes has proven tricky at
times.

> > I agree, in general, though long term the best approach is one that
> > has a configurable level of verbosity, with some kind of roughly
> > uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
> > probably with only 2 or 3 distinct levels).
>
> Given this comment and Robert's concern quoted below, I am wondering if
> the consensus is that a lack of verbosity control is a dealbreaker for
> adding offsets or not.

There are several different things that seem important to me
personally. These are in tension with each other, to a degree. These
are:

1. Like Andres, I'd really like to have some way of inspecting things
like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
detail. These record types happen to be very important in general, and
the ability to see detailed information about the WAL record would
definitely help with some debugging scenarios. I've really missed
stuff like this while debugging serious issues under time pressure.

2. To a lesser extent I would like to see similar detailed information
for nbtree's DELETE, VACUUM, and possibly DEDUPLICATE record types.
They might also come in handy for debugging, in about the same way.

3. More manageable verbosity.

I think that it would be okay to put off coming up with a solution to
3, for now. 1 and 2 seem more important than 3.

> I think if there was a more structured output of rmgrdesc, then this
> would also solve the verbosity level problem. Consumers could decide on
> their verbosity level -- in various pg_walinspect function outputs, that
> would probably just be column selection. For pg_waldump, I imagine that
> some kind of

Re: Show various offset arrays for heap WAL records

2023-03-13 Thread Melanie Plageman

Thanks for the various perspectives and feedback.

Attached v2 has additional info for xl_btree_vacuum and xl_btree_delete.

I've quoted various emails by various senders below and replied.

On Fri, Jan 27, 2023 at 3:02 PM Robert Haas  wrote:
>
> On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
>  wrote:
> > I believe I have addressed this in the attached patch.
>
> I'm not sure what's best in terms of formatting details but I
> definitely like the idea of making pg_waldump show more details. I'd
> even like to have a way to extract the tuple data, when it's
> operations on tuples and we have those tuples in the payload. That'd
> be a lot more verbose than what you are doing here, though, and I'm
> not saying you should go do it right now or anything like that.

If I'm not mistaken, this would be quite difficult without changing
rm_desc to return some kind of self-describing data type.

On Tue, Jan 31, 2023 at 4:52 PM Peter Geoghegan  wrote:
> On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
>  wrote:
> > I started documenting it in the rmgr_utils.h header file in a comment,
> > however it may be worth a README?
> >
> > I haven't polished this description of the format (or added examples,
> > etc) or used it in the btree-related functions because I assume the
> > format and helper function API will need more discussion.
>
> I think that standardization is good, but ISTM that we need clarity on
> what the scope is -- what is *not* being standardized? It may or may
> not be useful to call the end result an API. Or it may not make sense
> to do so in the first committed version, even though we may ultimately
> end up as something that deserves to be called an API. The obligation
> to not break tools that are scraping the output in whatever way seems
> kind of onerous right now -- just not having any gratuitous
> inconsistencies (e.g., fixing totally inconsistent punctuation, making
> the names for fields across WAL records consistent when they serve
> exactly the same purpose) would be a big improvement.

So, we can scrap any README or big comment, but are there other changes
to the code or structure you think would avoid it being seen as an
API?

> As I mentioned in passing already, I actually don't think that the
> B-Tree WAL records are all that special, as far as this stuff goes.
> For example, the DELETE Btree record type is very similar to heapam's
> PRUNE record type, and practically identical to Btree's VACUUM record
> type. All of these record types use the same basic conventions, like a
> snapshotConflictHorizon field for recovery conflicts (which is
> generated in a very similar way during original execution, and
> processed in precisely the same way during REDO), and arrays of page
> offset numbers sorted in ascending order.
>
> There are some remaining details where things from an index AM WAL
> record aren't directly analogous (or pretty much identical) to some
> other heapam WAL records, such as the way that the DELETE Btree record
> type deals with deleting a subset of TIDs from a posting list index
> tuple (generated by B-Tree deduplication). But even these exceptions
> don't require all that much discussion. You could either choose to
> only display the array of deleted index tuple page offset numbers, as
> well as the similar array of "updated" index tuple page offset numbers
> from xl_btree_delete, in which case you just display two arrays of
> page offset numbers, in the same standard way. You may or may not want
> to also show each individual xl_btree_update entry -- doing so would
> be kinda like showing the details of individual freeze plans, except
> that you'd probably display something very similar to the page offset
> number display here too (even though these aren't page offset numbers,
> they're 0-based offsets into the posting list's item pointer data
> array).

I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
the updated/deleted target offset numbers and the updated tuples
metadata.

I wondered if there was any reason to do xl_btree_dedup deduplication
intervals.

> BTW, there is also a tendency for non-btree index AM WAL records to be
> fairly similar or even near-identical to the B-Tree WAL records. While
> Hash indexes are very different to B-Tree indexes at a high level, it
> is nevertheless the case that xl_hash_vacuum_one_page is directly
> based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
> directly based on xl_btree_insert. There are some other WAL record
> types that are completely different across hash and B-Tree, which is a
> reflection of the fact that the index grows using a totally different
> approach in each AM -- but that doesn't seem like something that
> throws up any roadblocks for you (these can all be displayed as simple
> structs anyway).

I chose not to take on any other index types until I saw if this was viable.

> > Perhaps there should also be example output of the offset arrays in
> > pgwalinspect docs?
>

Re: Show various offset arrays for heap WAL records

2023-03-02 Thread Peter Eisentraut


On 01.03.23 17:11, Melanie Plageman wrote:

diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql 
b/contrib/pg_walinspect/pg_walinspect--1.0.sql
index 08b3dd5556..eb8ff82dd8 100644
--- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
+++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
@@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
  OUT main_data_length int4,
  OUT fpi_length int4,
  OUT description text,
-OUT block_ref text
+OUT block_ref int4[][]
  )
  AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
  LANGUAGE C STRICT PARALLEL SAFE;


A change like this would require a new extension version and an upgrade 
script.


I suppose it's ok to postpone that work while the actual meat of the 
patch is still being worked out, but I figured I'd mention it in case it 
wasn't considered yet.

Re: Show various offset arrays for heap WAL records

2023-03-01 Thread Melanie Plageman

On Tue, Jan 31, 2023 at 5:48 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan  wrote:
> > > I would also like to see functions like XLogRecGetBlockRefInfo() pass
> > > something more useful than a stringinfo buffer so that we could easily
> > > extract out the relfilenode in pgwalinspect.
> >
> > That does seem particularly important. It's a pain to do this from
> > SQL. In general I'm okay with focussing on pg_walinspect over
> > pg_waldump, since it'll become more important over time. Obviously
> > pg_waldump needs to still work, but I think it's okay to care less
> > about pg_waldump usability.
>
> I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
> probably shouldn't even be used by pg_walinspect at all (just by
> pg_waldump). Using something like XLogRecGetBlockRefInfo() within
> pg_walinspect misses out on the opportunity to output information in a
> more descriptive tuple format, with real data types. It's not just the
> relfilenode, either -- it's the block numbers themselves. And the fork
> number.
>
> In other words, I suspect that this is out of scope for this patch,
> strictly speaking. We simply shouldn't be using
> XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
> pg_walinspect should be calling some other function that ultimately
> allows the user to work with (say) an array of int8 from SQL for the
> block numbers. There is no great reason not to, AFAICT, since this
> information is completely generic -- it's not like the rmgr-specific
> output from GetRmgr(), where fine grained type information is just a
> nice-to-have, with usability issues of its own (on account of the
> details being record type specific).

Something like the attached?

start_lsn| 0/19823390
end_lsn  | 0/19824360
prev_lsn | 0/19821358
xid  | 1355
resource_manager | Heap
record_type  | UPDATE
record_length| 4021
main_data_length | 14
fpi_length   | 3948
description  | off 11 xmax 1355 flags 0x00 ; new off 109 xmax 0
block_ref|
[0:1][0:8]={{0,1663,5,17033,0,442,460,4244,0},{1,1663,5,17033,0,0,0,0,0}}

It is a bit annoying not to have information about what each block_ref
item in the array represents (previously in the string), so maybe the
format in the attached shouldn't be a replacement for what is already
displayed by pg_get_wal_records_info() and friends.

It could instead be a new function which returns information in this
format -- perhaps tuples with separate columns for each labeled block
ref field denormalized to repeat the wal record info for every block?

The one piece of information I didn't include in the new block_ref
columns is the compression type (since it is a string). Since I used the
forknum value instead of the forknum name, maybe it is defensible to
also provide a documented int value for the compression type and make
that an int too?

- Melanie
From 8e2308b6b9047c42d3906943dec3fd11e975ef72 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 1 Mar 2023 10:53:41 -0500
Subject: [PATCH v1] Return block_ref details as an array

In pg_walinspect, the block_ref column was a string, making it difficult
to access individual members without regular expressions. Now it is an
int array of int arrays for each block with the relevant information.

Note that in the case of a compressed block image, the compression type
is not listed as it is a string.
---
 contrib/pg_walinspect/pg_walinspect--1.0.sql |   6 +-
 contrib/pg_walinspect/pg_walinspect.c| 135 +++
 2 files changed, 114 insertions(+), 27 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql b/contrib/pg_walinspect/pg_walinspect--1.0.sql
index 08b3dd5556..eb8ff82dd8 100644
--- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
+++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
@@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
 OUT main_data_length int4,
 OUT fpi_length int4,
 OUT description text,
-OUT block_ref text
+OUT block_ref int4[][]
 )
 AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
 LANGUAGE C STRICT PARALLEL SAFE;
@@ -40,7 +40,7 @@ CREATE FUNCTION pg_get_wal_records_info(IN start_lsn pg_lsn,
 OUT main_data_length int4,
 OUT fpi_length int4,
 OUT description text,
-OUT block_ref text
+OUT block_ref int4[][]
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'pg_get_wal_records_info'
@@ -63,7 +63,7 @@ CREATE FUNCTION pg_get_wal_records_info_till_end_of_wal(IN start_lsn pg_lsn,
 OUT main_data_length int4,
 OUT fpi_length int4,
 OUT description text,
-OUT block_ref text
+OUT block_ref int4[][]
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'pg_get_wal_records_info_till_end_of_wal'
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index b7b0a805ee..446449bbe1 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@

Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Robert Haas

On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan  wrote:
> On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> > If we're dumping a lot of details out of each WAL record, we might
> > want to switch to a multi-line format of some kind. No one enjoys a
> > 460-character wide line, let alone 46000.
>
> I generally prefer it when I can use psql without using expanded table
> format mode, and without having to use a pager. Of course that isn't
> always possible, but it often is. I just don't think that that's going
> to become feasible with pg_walinspect queries any time soon, since it
> really requires a comprehensive strategy to deal with the issue of
> verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Peter Geoghegan

On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> If we're dumping a lot of details out of each WAL record, we might
> want to switch to a multi-line format of some kind. No one enjoys a
> 460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

It seems practically mandatory to use a pager when running
pg_walinspect queries in psql right now -- pspg is good for this. I
really can't use expanded table mode here, since it obscures the
relationship between adjoining records. I'm usually looking through
rows/records in LSN order, and want to be able to easily compare the
LSNs (or other details) of groups of adjoining records.

--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Robert Haas

On Tue, Jan 31, 2023 at 6:20 PM Peter Geoghegan  wrote:
> Actually the really wide output comes from COMMIT records. After I run
> the regression tests, and execute some of my own custom pg_walinspect
> queries, I see that some individual COMMIT records have a
> length(description) of over 10,000 bytes/characters. There is even one
> particular COMMIT record whose length(description) is about 46,000
> bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
> uncommon today. The worst case (or even particularly bad cases) won't
> be made any worse by this patch, because there are obviously limits on
> the width of the arrays that it outputs details descriptions of, that
> don't apply to these COMMIT records.

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Show various offset arrays for heap WAL records

2023-01-31 Thread Peter Geoghegan

On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan  wrote:
> Obviously what you're doing here will lead to a significant increase
> in the verbosity of the output for affected WAL records. I don't feel
> too bad about that, though. It's really an existing problem, and one
> that should be fixed either way. You kind of have to deal with this
> already, by having a good psql pager, since record types such as
> COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
> verbose in roughly the same way. You only need to have one of these
> record types output by a function like pg_get_wal_records_info() to
> get absurdly wide output -- it hardly matters that most individual WAL
> record types have terse output at that point.

Actually the really wide output comes from COMMIT records. After I run
the regression tests, and execute some of my own custom pg_walinspect
queries, I see that some individual COMMIT records have a
length(description) of over 10,000 bytes/characters. There is even one
particular COMMIT record whose length(description) is about 46,000
bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
uncommon today. The worst case (or even particularly bad cases) won't
be made any worse by this patch, because there are obviously limits on
the width of the arrays that it outputs details descriptions of, that
don't apply to these COMMIT records.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-31 Thread Peter Geoghegan

On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan  wrote:
> > I would also like to see functions like XLogRecGetBlockRefInfo() pass
> > something more useful than a stringinfo buffer so that we could easily
> > extract out the relfilenode in pgwalinspect.
>
> That does seem particularly important. It's a pain to do this from
> SQL. In general I'm okay with focussing on pg_walinspect over
> pg_waldump, since it'll become more important over time. Obviously
> pg_waldump needs to still work, but I think it's okay to care less
> about pg_waldump usability.

I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
probably shouldn't even be used by pg_walinspect at all (just by
pg_waldump). Using something like XLogRecGetBlockRefInfo() within
pg_walinspect misses out on the opportunity to output information in a
more descriptive tuple format, with real data types. It's not just the
relfilenode, either -- it's the block numbers themselves. And the fork
number.

In other words, I suspect that this is out of scope for this patch,
strictly speaking. We simply shouldn't be using
XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
pg_walinspect should be calling some other function that ultimately
allows the user to work with (say) an array of int8 from SQL for the
block numbers. There is no great reason not to, AFAICT, since this
information is completely generic -- it's not like the rmgr-specific
output from GetRmgr(), where fine grained type information is just a
nice-to-have, with usability issues of its own (on account of the
details being record type specific).

I've been managing this problem within my own custom pg_walinspect
queries by using my own custom ICU collation. I use ICU's natural sort
order to order based on block_ref, or based on a substring()
expression that extracts something interesting from block_ref, such as
relfilenode. You can create a custom collation for this like so, per
the docs:

CREATE COLLATION IF NOT EXISTS numeric (provider = icu, locale =
'en-u-kn-true');

Obviously this hack of mine works, but hardly anybody else would be
willing to take the time to figure something like this out. Plus it's
error prone when it doesn't really have to be. And it suggests that
the block_ref field isn't record type generic -- that's sort of
misleading IMV.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-31 Thread Peter Geoghegan

On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
 wrote:
> I have taken a stab at doing some of the tasks listed in this email.

Cool.

> I have made the new files rmgr_utils.c/h.
>
> I have come up with a standard format that I like for the output and
> used it in all the heap record types.
>
> Examples below:

That seems like a reasonable approach.

> I started documenting it in the rmgr_utils.h header file in a comment,
> however it may be worth a README?
>
> I haven't polished this description of the format (or added examples,
> etc) or used it in the btree-related functions because I assume the
> format and helper function API will need more discussion.

I think that standardization is good, but ISTM that we need clarity on
what the scope is -- what is *not* being standardized? It may or may
not be useful to call the end result an API. Or it may not make sense
to do so in the first committed version, even though we may ultimately
end up as something that deserves to be called an API. The obligation
to not break tools that are scraping the output in whatever way seems
kind of onerous right now -- just not having any gratuitous
inconsistencies (e.g., fixing totally inconsistent punctuation, making
the names for fields across WAL records consistent when they serve
exactly the same purpose) would be a big improvement.

As I mentioned in passing already, I actually don't think that the
B-Tree WAL records are all that special, as far as this stuff goes.
For example, the DELETE Btree record type is very similar to heapam's
PRUNE record type, and practically identical to Btree's VACUUM record
type. All of these record types use the same basic conventions, like a
snapshotConflictHorizon field for recovery conflicts (which is
generated in a very similar way during original execution, and
processed in precisely the same way during REDO), and arrays of page
offset numbers sorted in ascending order.

There are some remaining details where things from an index AM WAL
record aren't directly analogous (or pretty much identical) to some
other heapam WAL records, such as the way that the DELETE Btree record
type deals with deleting a subset of TIDs from a posting list index
tuple (generated by B-Tree deduplication). But even these exceptions
don't require all that much discussion. You could either choose to
only display the array of deleted index tuple page offset numbers, as
well as the similar array of "updated" index tuple page offset numbers
from xl_btree_delete, in which case you just display two arrays of
page offset numbers, in the same standard way. You may or may not want
to also show each individual xl_btree_update entry -- doing so would
be kinda like showing the details of individual freeze plans, except
that you'd probably display something very similar to the page offset
number display here too (even though these aren't page offset numbers,
they're 0-based offsets into the posting list's item pointer data
array).

BTW, there is also a tendency for non-btree index AM WAL records to be
fairly similar or even near-identical to the B-Tree WAL records. While
Hash indexes are very different to B-Tree indexes at a high level, it
is nevertheless the case that xl_hash_vacuum_one_page is directly
based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
directly based on xl_btree_insert. There are some other WAL record
types that are completely different across hash and B-Tree, which is a
reflection of the fact that the index grows using a totally different
approach in each AM -- but that doesn't seem like something that
throws up any roadblocks for you (these can all be displayed as simple
structs anyway).

Speaking with my B-Tree hat on, I'd just be happy to be able to see
both of the page offset number arrays (the deleted + updated offset
number arrays from xl_btree_delete/xl_btree_vacuum), without also
being able to\ see output for each individual xl_btree_update
item-pointer-array-offset arrays -- just seeing that much is already a
huge improvement. That's why I'm a bit hesitant to use the term API
just yet, because an obligation to be consistent in whatever way seems
like it might block incremental progress.

> Perhaps there should also be example output of the offset arrays in
> pgwalinspect docs?

That would definitely make sense.

> I've changed the array format helper functions that Andres added to be a
> single function with an additional layer of indirection so that any
> record with an array can use it regardless of type and format of the
> individual elements. The signature is based somewhat off of qsort_r()
> and allows the user to pass a function with the the desired format of
> the elements.

That's handy.

> Personally, I like having the infomasks for the freeze plans. If we
> someday have a more structured input to rmgr_desc, we could then easily
> have them in their own column and use functions like
> heap_tuple_infomask_flags() on them.

I agree, in general, though long term the

Re: Show various offset arrays for heap WAL records

2023-01-27 Thread Robert Haas

On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
 wrote:
> I believe I have addressed this in the attached patch.

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details. I'd
even like to have a way to extract the tuple data, when it's
operations on tuples and we have those tuples in the payload. That'd
be a lot more verbose than what you are doing here, though, and I'm
not saying you should go do it right now or anything like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Show various offset arrays for heap WAL records

2023-01-27 Thread Melanie Plageman

Hi,

I have taken a stab at doing some of the tasks listed in this email.

I have made the new files rmgr_utils.c/h.

I have come up with a standard format that I like for the output and
used it in all the heap record types.

Examples below:

snapshotConflictHorizon: 2184, nplans: 2, plans [ { xmax: 0, infomask:
2816, infomask2: 2, ntuples: 5, offsets: [ 10, 11, 12, 18, 71 ] }, {
xmax: 0, infomask: 11008, infomask2: 2, ntuples: 2, offsets: [ 72, 73
] } ]

snapshotConflictHorizon: 2199, nredirected: 4, ndead: 0, nunused: 4,
redirected: [ 1->38, 2->39, 3->40, 4->41 ], dead: [], unused: [ 24,
25, 26, 27, 37 ]

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

This is still a rough draft, as I anticipate changes will be requested.
I would split it into multiple patches, etc. But I am looking for
feedback on the suggested format and the array formatting helper
function API.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

I've changed the array format helper functions that Andres added to be a
single function with an additional layer of indirection so that any
record with an array can use it regardless of type and format of the
individual elements. The signature is based somewhat off of qsort_r()
and allows the user to pass a function with the the desired format of
the elements.

On a semi-unrelated note, I think it might be nice to have a comment in
heapam_xlog.h about what the infobits fields actually are and why they
exist -- e.g. we only need a subset of infomask[2] bits in these
records.
I put a random comment in the code where I think it should go.
I will delete it later, of course.

On Mon, Jan 9, 2023 at 11:00 PM Peter Geoghegan  wrote:
>
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
>
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

I plan to add these if the format and API I suggested seems like the
right direction.

On Tue, Jan 10, 2023 at 2:35 PM Andres Freund  wrote:
>
> > > I chose to include infomask[2] for the different freeze plans mainly 
> > > because
> > > it looks odd to see different plans without a visible reason. But I'm not 
> > > sure
> > > that's the right choice.
> >
> > I don't think that it is particularly necessary to do so in order for
> > the output to make sense -- pg_waldump is inherently a tool for
> > experts. What it comes down to for me is whether or not this
> > information is sufficiently useful to display, and/or can be (or needs
> > to be) controlled via some kind of verbosity knob.
>
> It seemed useful enough to me, but I likely also stare more at this stuff than
> most. Compared to the list of offsets it's not that much content.
>

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

> > How hard would it be to invent a general mechanism to control the verbosity
> > of what we'll show for each WAL record?
>
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> void(*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

In terms of a more structured format, it seems like it would make the
most sense to pass a JSON or composite datatype structure to rm_desc
instead of that StringInfo.

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

On Mon, Jan 16, 2023 at 10:09 PM Peter Geoghegan  wrote:
>
> On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan  wrote:
> > On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> > > What are your thoughts about the place for the helper functions? You're ok
> > > with rmgrdesc_utils.[ch]?
> >
> > Yeah, that seems okay.
>
> BTW, while playing around with this patch today, I noticed that it
> won't display the number of elements in each offset array directly.
> Perhaps it's worth including that, too?

I believe I have addressed this in the attached patch.

- Melanie

v1-0001-Add-rmgr_desc-utilities.patch
Description: Binary data

Re: Show various offset arrays for heap WAL records

2023-01-16 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan  wrote:
> On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> > What are your thoughts about the place for the helper functions? You're ok
> > with rmgrdesc_utils.[ch]?
>
> Yeah, that seems okay.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> What are your thoughts about the place for the helper functions? You're ok
> with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

We may well need to put more stuff in that file. We're overdue a big
overhaul of the rmgr output, so that everybody uses the same format
for everything. We made some progress on that for 16 already, by
standardizing on the name snapshotConflictHorizon, but a lot of
annoying inconsistencies still remain. Like the punctuation issue you
mentioned.

Ideally we'd be able to make the output more easy to manipulate via
the SQL interface from pg_walinspect, or perhaps via scripting. That
would require some rules that are imposed top-down, so that consumers
of the data can make certain general assumptions. But that's fairly
natural. It's not like there is just inherently a great deal of
diversity that we need to be considered. For example, the WAL records
used by each individual index access method are all very similar. In
fact the most important index AM WAL records used by each index AM
(e.g. insert, delete, vacuum) have virtually the same format as each
other already.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-11 Thread Andres Freund

Hi,

On 2023-01-11 14:53:54 -0800, Peter Geoghegan wrote:
> On Tue, Jan 10, 2023 at 11:35 AM Andres Freund  wrote:
> > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> > void(*rm_desc) (StringInfo buf, XLogReaderState 
> > *record);
> >
> > so we'd need to patch all of them. That might be worth doing at some point,
> > but I don't want to tackle it right now.
> 
> Okay. Let's just get the basics in soon, then.

> I would like to have a similar capability for index access methods,
> but mostly just for investigating performance. Whenever we've really
> needed something like this for debugging it seems to have been a
> heapam thing, just because there's a lot more that can go wrong with
> pruning, which is spread across many different places.

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Greetings,

Andres Freund

Re: Show various offset arrays for heap WAL records

2023-01-11 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 11:35 AM Andres Freund  wrote:
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> void(*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

Okay. Let's just get the basics in soon, then.

I would like to have a similar capability for index access methods,
but mostly just for investigating performance. Whenever we've really
needed something like this for debugging it seems to have been a
heapam thing, just because there's a lot more that can go wrong with
pruning, which is spread across many different places.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote:
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> > A couple times when investigating data corruption issues, the last time just
> > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> > records. As that's probably not just me, I think we should make that change
> > in-tree.
> 
> I remember how useful this was when we were investigating that early
> bug in 14, that turned out to be in parallel VACUUM. So I'm all in
> favor of it.

Cool.


> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
> 
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

Hm, there doesn't seem to be a great location for them today. I guess we could
add something like src/include/access/rmgrdesc_utils.h? And put the
implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was
thinking of just rmgrdesc.[ch], but custom rmgrs added
src/bin/pg_waldump/rmgrdesc.[ch] ...


> > I chose to include infomask[2] for the different freeze plans mainly because
> > it looks odd to see different plans without a visible reason. But I'm not 
> > sure
> > that's the right choice.
> 
> I don't think that it is particularly necessary to do so in order for
> the output to make sense -- pg_waldump is inherently a tool for
> experts. What it comes down to for me is whether or not this
> information is sufficiently useful to display, and/or can be (or needs
> to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.


> How hard would it be to invent a general mechanism to control the verbosity
> of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void(*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Greetings,

Andres Freund

Re: Show various offset arrays for heap WAL records

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> A couple times when investigating data corruption issues, the last time just
> yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> records. As that's probably not just me, I think we should make that change
> in-tree.

I remember how useful this was when we were investigating that early
bug in 14, that turned out to be in parallel VACUUM. So I'm all in
favor of it.

> The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

> I chose to include infomask[2] for the different freeze plans mainly because
> it looks odd to see different plans without a visible reason. But I'm not sure
> that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

I think that it easily could be useful, and I also think that it
easily could be a bit annoying. How hard would it be to invent a
general mechanism to control the verbosity of what we'll show for each
WAL record?

-- 
Peter Geoghegan

45 matches

Mail list logo