Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Amit Kapila
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra
 wrote:
>
> On 7/28/23 11:42, Amit Kapila wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >  wrote:
> >>
> >> On 7/26/23 09:27, Amit Kapila wrote:
> >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  
> >>> wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >>
> >
> > While studying the changes for this proposal and related areas, I have
> > a few comments:
> > 1. I think you need to advance the origin if it is changed due to
> > copy_sequence(), otherwise, if the sync worker restarts after
> > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> > value.
> >
>
> True, we want to restart at the new origin_startpos.
>
> > 2. Between the time of SYNCDONE and READY state, the patch can skip
> > applying non-transactional sequence changes even if it should apply
> > it. The reason is that during that state change
> > should_apply_changes_for_rel() decides whether to apply change based
> > on the value of remote_final_lsn which won't be set for
> > non-transactional change. I think we need to send the start LSN of a
> > non-transactional record and then use that as remote_final_lsn for
> > such a change.
>
> Good catch. remote_final_lsn is set in apply_handle_begin, but that
> won't happen for sequences. We're already sending the LSN, but
> logicalrep_read_sequence ignores it - it should be enough to add it to
> LogicalRepSequence and then set it in apply_handle_sequence().
>

As per my understanding, the LSN sent is EndRecPtr of record which is
the beginning of the next record (means current_record_end + 1). For
comparing the current record, we use the start_position of the record
as we do when we use the remote_final_lsn via apply_handle_begin().

> >
> > 3. For non-transactional sequence change apply, we don't set
> > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> > we are doing in apply_handle_commit_internal() before calling
> > CommitTransactionCommand(). So, that can lead to the origin moving
> > backwards after restart which will lead to requesting and applying the
> > same changes again and for that period of time sequence can go
> > backwards. This needs some more thought as to what is the correct
> > behaviour/solution for this.
> >
>
> I think saying "origin moves backwards" is a bit misleading. AFAICS the
> origin position is not actually moving backwards, it's more that we
> don't (and can't) move it forwards for each non-transactional change. So
> yeah, we may re-apply those, and IMHO that's expected - the sequence is
> allowed to be "ahead" on the subscriber.
>

But, if this happens then for a period of time the sequence will go
backwards relative to what one would have observed before restart.

-- 
With Regards,
Amit Kapila.




Re: Row pattern recognition

2023-07-28 Thread Tatsuo Ishii
>>> - PATTERN variables do not have to exist in the DEFINE clause.  They are
>>> - considered TRUE if not present.
>> Do you think we really need this? I found a criticism regarding this.
>> https://link.springer.com/article/10.1007/s13222-022-00404-3
>> "3.2 Explicit Definition of All Row Pattern Variables"
>> What do you think?
> 
> I think that a large part of obeying the standard is to allow queries
> from other engines to run the same on ours.  The standard does not
> require the pattern variables to be defined and so there are certainly
> queries out there without them, and that hurts migrating to
> PostgreSQL.

Yeah, migration is good point. I agree we should have the feature.

>>> When we get to adding count in the MEASURES clause, there will be a
>>> difference between no match and empty match, but that does not apply
>>> here.
>> Can you elaborate more? I understand that "no match" and "empty match"
>> are different things. But I do not understand how the difference
>> affects the result of count.
> 
> This query:
> 
> SELECT v.a, wcnt OVER w, count(*) OVER w
> FROM (VALUES ('A')) AS v (a)
> WINDOW w AS (
>   ORDER BY v.a
>   MEASURES count(*) AS wcnt
>   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>   PATTERN (B)
>   DEFINE B AS B.a = 'B'
> )
> 
> produces this result:
> 
>  a | wcnt | count
> ---+--+---
>  A |  | 0
> (1 row)
> 
> Inside the window specification, *no match* was found and so all of
> the MEASURES are null.  The count(*) in the target list however, still
> exists and operates over zero rows.
> 
> This very similar query:
> 
> SELECT v.a, wcnt OVER w, count(*) OVER w
> FROM (VALUES ('A')) AS v (a)
> WINDOW w AS (
>   ORDER BY v.a
>   MEASURES count(*) AS wcnt
>   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>   PATTERN (B?)
>   DEFINE B AS B.a = 'B'
> )
> 
> produces this result:
> 
>  a | wcnt | count
> ---+--+---
>  A |0 | 0
> (1 row)
> 
> In this case, the pattern is B? instead of just B, which produces an
> *empty match* for the MEASURES to be applied over.

Thank you for the detailed explanation. I think I understand now.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Avoid possible memory leak (src/common/rmtree.c)

2023-07-28 Thread Michael Paquier
On Tue, Jul 25, 2023 at 04:45:22PM +0200, Daniel Gustafsson wrote:
> Skimming the tree there doesn't seem to be any callers which aren't exiting or
> ereporting on failure so the real-world impact seems low.  That being said,
> silencing static analyzers could be reason enough to delay allocation.

A different reason would be out-of-core code that uses rmtree() in a
memory context where the leak would be an issue if facing a failure
continuously?  Delaying the allocation after the OPENDIR() seems like
a good practice anyway.
--
Michael


signature.asc
Description: PGP signature


Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Michael Paquier
On Fri, Jul 28, 2023 at 12:06:33PM +0200, Alvaro Herrera wrote:
> Hmm, I think having all the workers doing their in the same table is
> better -- if nothing else, because it gives us the opportunity to show
> how to use some other coding technique (but also because we are forced
> to write the SQL code in a way that's correct for potentially multiple
> concurrent workers, which sounds useful to demonstrate).  Can't we
> instead solve the race condition by having some shared resource that
> blocks the other workers from proceeding until the schema has been
> created?  Perhaps an LWLock, or a condition variable, or an advisory
> lock.

That's an idea interesting idea that you have here.  So basically, you
would have all the workers use the same schema do their counting work
for the same base table?  Or should each worker use the same schema,
perhaps defined by a GUC, but different tables?  One thing that has
been itching me a bit with this module was to be able to pass down to
the main worker routine more arguments than just an int ID, but I
could not find myself do that for just for the wait event patch, like:
- The database to connect to.
- The table to create.
- The schema to use.
If any of these are NULL, just use as default what we have now, with
perhaps the bgworker PID as ID instead of a user-specified one.

Having a shared memory state is second thing I was planning to add,
and that can be useful as point of reference in a template.  The other
patch about custom wait events introduces that, FWIW, to track the
custom wait events added.
--
Michael


signature.asc
Description: PGP signature


Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Michael Paquier
On Fri, Jul 28, 2023 at 01:34:15PM -0700, Andres Freund wrote:
> On 2023-07-28 13:45:29 +0900, Michael Paquier wrote:
>> Having each bgworker on its own schema would be enough to prevent
>> conflicts, but I'd like to add a second thing: a check on
>> pg_stat_activity.wait_event after starting the workers.  I have added
>> something like that in the patch I have posted today for the custom
>> wait events at [1] and it enforces the startup sequences of the
>> workers in a stricter way.
> 
> Is that very meaningful? ISTM the interesting thing to check for would be that
> the state is idle?

That's interesting for the sake of the other patch to check that the
custom events are reported.  Anyway, I am a bit short in time, so I
have applied the simplest fix where the dynamic workers just use a
different base ID to get out of your way.
--
Michael


signature.asc
Description: PGP signature


Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan
On Fri, Jul 28, 2023 at 5:03 PM Melanie Plageman
 wrote:
> When you were working on this, what were the downsides of only having the
> criteria that 1) page would be all_visible/all_frozen and 2) we did prune
> something (i.e. no consideration of FPIs)?

Conditioning freezing on "all_visible && all_frozen" seems likely to
always be a good idea when it comes to any sort of speculative trigger
criteria. Most of the wins in this area will come from avoiding future
FPIs, and you can't hope to do that unless you freeze everything on
the page. The cost of freezing when "all_visible && all_frozen" does
not hold may be quite low, but the benefits will also be very low.
Also, the way that recovery conflicts are generated for freeze records
somewhat depends on this right now -- though you could probably fix
that if you had to.

Note that we *don't* really limit ourselves to cases where an FPI was
generated from pruning, exactly -- even on 16. With page-level
checksums we can also generate an FPI just to set a hint bit. That
triggers freezing too, even on insert-only tables (assuming checksums
are enabled). I expect that that will be an important condition for
triggering freezing in practice.

Your question about 2 seems equivalent to "why not just always
freeze?". I don't think that that's a bad question -- quite the
opposite. Even trying to give an answer to this question would amount
to getting involved in new work on VACUUM, though. So I don't think I
can help you here.

-- 
Peter Geoghegan




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-28 Thread Cary Huang
 Hello

I had a look at the patch and tested it on CI bot, it compiles and tests fine 
both autoconf and meson. I noticed that the v2 patch contains the v1 patch file 
as well. Not sure if intended but put there my accident.

> I don't think "invalidating" is the right terminology. Note that we already
 > have InvalidateBuffer() - but it's something we can't allow users to do, as 
 > it
 > throws away dirty buffer contents (it's used for things like dropping a
 > table).
 >
 > How about using "discarding" for this functionality?

I think "invalidating" is the right terminology here, it is exactly what the 
feature is doing, it tries to invalidate a buffer ID by calling 
InvalidateBuffer() routine inside buffer manager and calls FlushBuffer() before 
invalidating if marked dirty. 

The problem here is that InvalidateBuffer() could be dangerous because it 
allows a user to invalidate buffer that may have data in other tables not owned 
by the current user, 

I think it all comes down to the purpose of this feature. Based on the 
description in this email thread, I feel like this feature should be 
categorized as a developer-only feature, to be used by PG developer to 
experiment and observe some development works by invalidating one more more 
specific buffers. If this is the case, it may be helpful to add a 
"DEVELOPER_OPTIONS" in GUC, which allows or disallows the TryInvalidateBuffer() 
to run or to return error if user does not have this developer option enabled.

If the purpose of this feature is for general users, then it would make sense 
to have something like pg_unwarm (exactly opposite of pg_prewarm) that takes 
table name (instead of buffer ID) and drop all buffers associated with that 
table name. There will be permission checks as well so a user cannot pg_unwarm 
a table owned by someone else. User in this case won't be able to invalidate a 
particular buffer, but he/she should not have to as a regular user anyway.

thanks!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca





Re: Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman
On Fri, Jul 28, 2023 at 4:49 PM Peter Geoghegan  wrote:
>
> On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> > > Put differently, I can't see any reason to care whether pruning
> > > emitted an FPI or not. Either way, it's very unlikely that freezing
> > > needs to do so.
> >
> > +many
> >
> > Using FPIs having been generated during pruning as part of the logic to 
> > decide
> > whether to freeze makes no sense to me. We likely need some heuristic here,
> > but I suspect it has to look quite different than the FPI condition.
>
> I think that it depends on how you frame it. It makes zero sense that
> that's the only thing that can do something like this at all. It makes
> some sense as an incremental improvement, since there is no way that
> we can lose by too much, even if you make arbitrarily pessimistic
> assumptions. In other words, you won't be able to come up with an
> adversarial benchmark where this loses by too much.

When you were working on this, what were the downsides of only having the
criteria that 1) page would be all_visible/all_frozen and 2) we did prune
something (i.e. no consideration of FPIs)?

> As I said, I'm no longer interested in working on VACUUM (though I do
> hope that Melanie or somebody else picks up where I left off). I have
> nothing to say about any new work in this area. If you want me to do
> something in the scope of the work on 16, as a release management
> task, please be clear about what that is.

I'm only talking about this in the context of future work for 17.

- Melanie




Re: Eager page freeze criteria clarification

2023-07-28 Thread Robert Haas
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> Using FPIs having been generated during pruning as part of the logic to decide
> whether to freeze makes no sense to me. We likely need some heuristic here,
> but I suspect it has to look quite different than the FPI condition.

Why do we need a heuristic at all?

What we're talking about here, AIUI, is under what circumstance we
should eagerly freeze a page that can be fully frozen, leaving no
tuples that vacuum ever needs to worry about again except if the page
is further modified. Freezing in such circumstances seems highly
likely to work out in our favor. The only argument against it is that
the page might soon be modified again, in which case the effort of
freezing would be mostly wasted. But I'm not sure whether that's
really a valid rationale, because the win from not having to vacuum
the page again if it isn't modified seems pretty big, and emitting a
freeze record for a page we've just pruned doesn't seem that
expensive. If it is a valid rationale, my first thought would be to
make the heuristic based on vacuum_freeze_min_age. XID age isn't a
particularly great proxy for whether the page is still actively being
modified, but at least it has tradition going for it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> > Put differently, I can't see any reason to care whether pruning
> > emitted an FPI or not. Either way, it's very unlikely that freezing
> > needs to do so.
>
> +many
>
> Using FPIs having been generated during pruning as part of the logic to decide
> whether to freeze makes no sense to me. We likely need some heuristic here,
> but I suspect it has to look quite different than the FPI condition.

I think that it depends on how you frame it. It makes zero sense that
that's the only thing that can do something like this at all. It makes
some sense as an incremental improvement, since there is no way that
we can lose by too much, even if you make arbitrarily pessimistic
assumptions. In other words, you won't be able to come up with an
adversarial benchmark where this loses by too much.

As I said, I'm no longer interested in working on VACUUM (though I do
hope that Melanie or somebody else picks up where I left off). I have
nothing to say about any new work in this area. If you want me to do
something in the scope of the work on 16, as a release management
task, please be clear about what that is.

--
Peter Geoghegan




Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Andres Freund
Hi,

On 2023-07-28 13:45:29 +0900, Michael Paquier wrote:
> Having each bgworker on its own schema would be enough to prevent
> conflicts, but I'd like to add a second thing: a check on
> pg_stat_activity.wait_event after starting the workers.  I have added
> something like that in the patch I have posted today for the custom
> wait events at [1] and it enforces the startup sequences of the
> workers in a stricter way.

Is that very meaningful? ISTM the interesting thing to check for would be that
the state is idle?

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-07-28 Thread Andres Freund
Hi,

On 2023-07-28 16:19:47 -0400, Robert Haas wrote:
> On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > > Or are we trying to determine how likely
> > > the freeze record is to emit an FPI and avoid eager freezing when it
> > > isn't worth the cost?
> >
> > No, that's not something that we're doing right now (we just talked
> > about doing something like that). In 16 VACUUM just "makes the best
> > out of a bad situation" when an FPI was already required during
> > pruning. We have already "paid for the privilege" of writing some WAL
> > for the page at that point, so it's reasonable to not squander a
> > window of opportunity to avoid future FPIs in future VACUUM
> > operations, by freezing early.
> 
> This doesn't make sense to me. It's true that if the pruning record
> emitted an FPI, then the freezing record probably won't need to do so
> either, unless by considerable ill-fortune the redo pointer was moved
> in the tiny window between pruning and freezing. But isn't that also
> true that if the pruning record *didn't* emit an FPI? In that case,
> the pruning record wasn't the first WAL-logged modification to the
> page during the then-current checkpoint cycle, and some earlier
> modification already did it. But in that case also the freezing won't
> need to emit a new FPI, except for the same identical case where the
> redo pointer moves at just the wrong time.
> 
> Put differently, I can't see any reason to care whether pruning
> emitted an FPI or not. Either way, it's very unlikely that freezing
> needs to do so.

+many

Using FPIs having been generated during pruning as part of the logic to decide
whether to freeze makes no sense to me. We likely need some heuristic here,
but I suspect it has to look quite different than the FPI condition.

Greetings,

Andres




Re: Eager page freeze criteria clarification

2023-07-28 Thread Robert Haas
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > Or are we trying to determine how likely
> > the freeze record is to emit an FPI and avoid eager freezing when it
> > isn't worth the cost?
>
> No, that's not something that we're doing right now (we just talked
> about doing something like that). In 16 VACUUM just "makes the best
> out of a bad situation" when an FPI was already required during
> pruning. We have already "paid for the privilege" of writing some WAL
> for the page at that point, so it's reasonable to not squander a
> window of opportunity to avoid future FPIs in future VACUUM
> operations, by freezing early.

This doesn't make sense to me. It's true that if the pruning record
emitted an FPI, then the freezing record probably won't need to do so
either, unless by considerable ill-fortune the redo pointer was moved
in the tiny window between pruning and freezing. But isn't that also
true that if the pruning record *didn't* emit an FPI? In that case,
the pruning record wasn't the first WAL-logged modification to the
page during the then-current checkpoint cycle, and some earlier
modification already did it. But in that case also the freezing won't
need to emit a new FPI, except for the same identical case where the
redo pointer moves at just the wrong time.

Put differently, I can't see any reason to care whether pruning
emitted an FPI or not. Either way, it's very unlikely that freezing
needs to do so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan
On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman
 wrote:
> I see. I don't have an opinion on the "best of a bad situation"
> argument. Though, I think it is worth amending the comment in the code
> to include this explanation.

I think that the user-facing docs should be completely overhauled to
make that clear, and reasonably accessible. It's hard to talk about in
code comments because it's something that can only be effective over
time, across multiple VACUUM operations.

> But, ISTM that there should also be some independent heuristic to
> determine whether or not it makes sense to freeze the page. That could
> be related to whether or not it will be cheap to do so (in which case
> we can check if we will have to emit an FPI as part of the freeze
> record) or it could be related to whether or not the freezing is
> likely to be pointless (we are likely to update the page again soon).

It sounds like you're interested in adding additional criteria to
trigger page-level freezing. That's something that the current
structure anticipates.

To give a slightly contrived example: it would be very easy to add
another condition, where (say) we call random() to determine whether
or not to freeze the page. You'd very likely want to gate this new
trigger criteria in the same way as the FPI criteria (i.e. only do it
when "prunestate->all_visible && prunestate->all_frozen" hold). We've
decoupled the decision to freeze from what it means to execute
freezing itself.

Having additional trigger criteria makes a lot of sense. I'm sure that
it makes sense to add at least one more, and it seems possible that
adding several more makes sense. Obviously, that will have to be new
work targeting 17, though. I made a decision to stop working on
VACUUM, though, so I'm afraid I won't be able to offer much help with
any of this. (Happy to give more background information, though.)

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
>
> On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman
>  wrote:
> > if (pagefrz.freeze_required || tuples_frozen == 0 ||
> > (prunestate->all_visible && prunestate->all_frozen &&
> >  fpi_before != pgWalUsage.wal_fpi))
> > {
> >
> > I'm trying to understand the condition fpi_before !=
> > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.
>
> You mean "prunestate->all_visible && prunestate->all_frozen", which is
> a condition of applying FPI-based eager freezing, but not traditional
> lazy freezing?

Ah no, a very key word in my sentence was off:
I meant "don't eager freeze *unless* pruning emitted an FPI"
I was asking about the FPI-trigger optimization specifically.
I'm clear on the all_frozen/all_visible criteria.

> > Is this test meant to guard against unnecessary freezing or to avoid
> > freezing when the cost is too high? That is, are we trying to
> > determine how likely it is that the page has been recently modified
> > and avoid eager freezing when it would be pointless (because the page
> > will soon be modified again)?
>
> Sort of. This cost of freezing over time is weirdly nonlinear, so it's
> hard to give a simple answer.
>
> The justification for the FPI trigger optimization is that FPIs are
> overwhelmingly the cost that really matters when it comes to freezing
> (and vacuuming in general) -- so we might as well make the best out of
> a bad situation when pruning happens to get an FPI. There can easily
> be a 10x or more cost difference (measured in total WAL volume)
> between freezing without an FPI and freezing with an FPI.
...
> In 16 VACUUM just "makes the best
> out of a bad situation" when an FPI was already required during
> pruning. We have already "paid for the privilege" of writing some WAL
> for the page at that point, so it's reasonable to not squander a
> window of opportunity to avoid future FPIs in future VACUUM
> operations, by freezing early.
>
> We're "taking a chance" on being able to get freezing out of the way
> early when an FPI triggers freezing. It's not guaranteed to work out
> in each individual case, of course, but even if we assume it's fairly
> unlikely to work out (which is very pessimistic) it's still very
> likely a good deal.
>
> This strategy (the 16 strategy of freezing eagerly because we already
> got an FPI) seems safer than a strategy involving freezing eagerly
> because we won't get an FPI as a result. If for no other reason than
> this: with the approach in 16 we already know for sure that we'll have
> written an FPI anyway. It's hard to imagine somebody being okay with
> the FPIs, but not being okay with the other extra WAL.

I see. I don't have an opinion on the "best of a bad situation"
argument. Though, I think it is worth amending the comment in the code
to include this explanation.

But, ISTM that there should also be some independent heuristic to
determine whether or not it makes sense to freeze the page. That could
be related to whether or not it will be cheap to do so (in which case
we can check if we will have to emit an FPI as part of the freeze
record) or it could be related to whether or not the freezing is
likely to be pointless (we are likely to update the page again soon).

It sounds like it was discussed before, but I'd be interested in
revisiting it and happy to test out various ideas.

- Melanie




Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan
On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman
 wrote:
> if (pagefrz.freeze_required || tuples_frozen == 0 ||
> (prunestate->all_visible && prunestate->all_frozen &&
>  fpi_before != pgWalUsage.wal_fpi))
> {
>
> I'm trying to understand the condition fpi_before !=
> pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.

You mean "prunestate->all_visible && prunestate->all_frozen", which is
a condition of applying FPI-based eager freezing, but not traditional
lazy freezing?

Obviously, the immediate impact of that is that the FPI trigger
condition is not met unless we know for sure that the page will be
marked all-visible and all-frozen in the visibility map afterwards. A
page that's eligible to become all-visible will also be seen as
eligible to become all-frozen in the vast majority of cases, but there
are some rare and obscure cases involving MultiXacts that must be
considered here. There is little point in freezing early unless we
have at least some chance of not having to freeze the same page again
in the future (ideally forever).

There is generally no point in freezing most of the tuples on a page
when there'll still be one or more not-yet-eligible unfrozen tuples
that get left behind -- we might as well not bother with freezing at
all when we see a page where that'll be the outcome from freezing.
However, there is (once again) a need to account for rare and extreme
cases -- it still needs to be possible to do that. Specifically, we
may be forced to freeze a page that's left with some remaining
unfrozen tuples when VACUUM is fundamentally incapable of freezing
them due to its OldestXmin/removable cutoff being too old. That can
happen when VACUUM needs to freeze according to the traditional
age-based settings, and yet the OldestXmin/removable cutoff gets held
back (by a leaked replication slot or whatever).

(Actually, VACUUM FREEZE freeze will freeze only a subset of the
tuples from some heap pages far more often. VACUUM FREEZE seems like a
bad design to me, though -- it uses the most aggressive possible XID
cutoff for freezing when it should probably hold off on freezing those
individual pages where we determine that it makes little sense. We
need to focus more on physical pages and their costs, and less on XID
cutoffs.)

> Is this test meant to guard against unnecessary freezing or to avoid
> freezing when the cost is too high? That is, are we trying to
> determine how likely it is that the page has been recently modified
> and avoid eager freezing when it would be pointless (because the page
> will soon be modified again)?

Sort of. This cost of freezing over time is weirdly nonlinear, so it's
hard to give a simple answer.

The justification for the FPI trigger optimization is that FPIs are
overwhelmingly the cost that really matters when it comes to freezing
(and vacuuming in general) -- so we might as well make the best out of
a bad situation when pruning happens to get an FPI. There can easily
be a 10x or more cost difference (measured in total WAL volume)
between freezing without an FPI and freezing with an FPI.

> Or are we trying to determine how likely
> the freeze record is to emit an FPI and avoid eager freezing when it
> isn't worth the cost?

No, that's not something that we're doing right now (we just talked
about doing something like that). In 16 VACUUM just "makes the best
out of a bad situation" when an FPI was already required during
pruning. We have already "paid for the privilege" of writing some WAL
for the page at that point, so it's reasonable to not squander a
window of opportunity to avoid future FPIs in future VACUUM
operations, by freezing early.

We're "taking a chance" on being able to get freezing out of the way
early when an FPI triggers freezing. It's not guaranteed to work out
in each individual case, of course, but even if we assume it's fairly
unlikely to work out (which is very pessimistic) it's still very
likely a good deal.

This strategy (the 16 strategy of freezing eagerly because we already
got an FPI) seems safer than a strategy involving freezing eagerly
because we won't get an FPI as a result. If for no other reason than
this: with the approach in 16 we already know for sure that we'll have
written an FPI anyway. It's hard to imagine somebody being okay with
the FPIs, but not being okay with the other extra WAL.

> But, the final rationale is still not clear to me. Could we add a
> comment above the if condition specifying both:
> a) what the test is a proxy for
> b) the intended outcome (when do we expect to eager freeze)
> And perhaps we could even describe a scenario where this heuristic is 
> effective?

There are lots of scenarios where it'll be effective. I agree that
there is a need to document this stuff a lot better. I have a pending
doc patch that overhauls the user-facing docs in this area.

My latest draft is here:

https://postgr.es/m/CAH2-Wz=UUJz+MMb1AxFzz-HDA=1t1fx_kmrdovopzxkpa-t...@mail.gmail.com

Re: add timing information to pg_upgrade

2023-07-28 Thread Nathan Bossart
On Fri, Jul 28, 2023 at 10:38:14AM -0700, Nathan Bossart wrote:
> I'm okay with simply adding the time.  However, I wonder if we want to
> switch to seconds, minutes, hours, etc. if the step takes longer.  This
> would add a bit of complexity, but it would improve human readability.
> Alternatively, we could keep the code simple and machine readable by always
> using milliseconds.

... or maybe we show both like psql does.  Attached іs a new version of the
patch in which I've made use of the INSTR_TIME_* macros and enhanced the
output formatting (largely borrowed from PrintTiming() in
src/bin/psql/common.c).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6a4b836b33c3e18a57438044a10da95ec89dcb65 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 27 Jul 2023 16:16:45 -0700
Subject: [PATCH v2 1/1] add timing information to pg_upgrade

---
 src/bin/pg_upgrade/util.c | 55 ++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c
index 21ba4c8f12..25e43a593a 100644
--- a/src/bin/pg_upgrade/util.c
+++ b/src/bin/pg_upgrade/util.c
@@ -9,14 +9,19 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 
 #include "common/username.h"
 #include "pg_upgrade.h"
+#include "portability/instr_time.h"
 
 LogOpts		log_opts;
 
+static instr_time step_start;
+
 static void pg_log_v(eLogType type, const char *fmt, va_list ap) pg_attribute_printf(2, 0);
+static char *get_time_str(double time_ms);
 
 
 /*
@@ -137,6 +142,8 @@ prep_status(const char *fmt,...)
 
 	/* trim strings */
 	pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+
+	INSTR_TIME_SET_CURRENT(step_start);
 }
 
 /*
@@ -170,6 +177,8 @@ prep_status_progress(const char *fmt,...)
 		pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message);
 	else
 		pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+
+	INSTR_TIME_SET_CURRENT(step_start);
 }
 
 static void
@@ -280,11 +289,55 @@ pg_fatal(const char *fmt,...)
 }
 
 
+static char *
+get_time_str(double time_ms)
+{
+	double		seconds;
+	double		minutes;
+	double		hours;
+	double		days;
+
+	if (time_ms < 1000.0)
+		return psprintf("%.3f ms", time_ms);
+
+	seconds = time_ms / 1000.0;
+	minutes = floor(seconds / 60.0);
+	seconds -= 60.0 * minutes;
+	if (minutes < 60.0)
+		return psprintf("%.3f ms (%02d:%06.3f)",
+		time_ms, (int) minutes, seconds);
+
+	hours = floor(minutes / 60.0);
+	minutes -= 60.0 * hours;
+	if (hours < 24.0)
+		return psprintf("%.3f ms (%02d:%02d:%06.3f)",
+		time_ms, (int) hours, (int) minutes, seconds);
+
+	days = floor(hours / 24.0);
+	hours -= 24.0 * days;
+	return psprintf("%.3f ms (%.0f d %02d:%02d:%06.3f)",
+	time_ms, days, (int) hours, (int) minutes, seconds);
+}
+
+
 void
 check_ok(void)
 {
+	instr_time	step_end;
+	char	   *step_time;
+
+	Assert(!INSTR_TIME_IS_ZERO(step_start));
+
+	INSTR_TIME_SET_CURRENT(step_end);
+	INSTR_TIME_SUBTRACT(step_end, step_start);
+	step_time = get_time_str(INSTR_TIME_GET_MILLISEC(step_end));
+
+	/* reset start time */
+	INSTR_TIME_SET_ZERO(step_start);
+
 	/* all seems well */
-	report_status(PG_REPORT, "ok");
+	report_status(PG_REPORT, "ok\t%s", step_time);
+	pfree(step_time);
 }
 
 
-- 
2.25.1



Re: In Postgres 16 BETA, should the ParseNamespaceItem have the same index as it's RangeTableEntry?

2023-07-28 Thread Farias de Oliveira
Hello,

Thank you Amit, changing the 3rd argument to 0 fixes some of the errors
(there are 6 out of 24 errors still failing) but it throws a new one
"ERROR: bad buffer ID: 0". We will need to take a more in depth look here
on why this is occuring, but thank you so much for the help.

Thank you,
Matheus Farias

Em qua., 26 de jul. de 2023 às 08:30, Amit Langote 
escreveu:

> Hello,
>
> On Fri, Jul 21, 2023 at 5:05 AM Farias de Oliveira
>  wrote:
> >
> > Hello,
> >
> > Thank you for the help guys and I'm so sorry for my late response.
> Indeed, the error relies on the ResultRelInfo. In
> GetResultRTEPermissionInfo() function, it does a checking on the
> relinfo->ri_RootResultRelInfo variable. I believe that it should go inside
> this scope:
> >
> >
> > if (relinfo->ri_RootResultRelInfo)
> > {
> > /*
> > * For inheritance child result relations (a partition routing target
> > * of an INSERT or a child UPDATE target), this returns the root
> > * parent's RTE to fetch the RTEPermissionInfo because that's the only
> > * one that has one assigned.
> > */
> > rti = relinfo->ri_RootResultRelInfo->ri_RangeTableIndex;
> > }
> >
> > The relinfo contains:
> >
> > {type = T_ResultRelInfo, ri_RangeTableIndex = 5, ri_RelationDesc =
> 0x7f44e3308cc8, ri_NumIndices = 0, ri_IndexRelationDescs = 0x0,
> ri_IndexRelationInfo = 0x0, ri_RowIdAttNo = 0,
> >   ri_extraUpdatedCols = 0x0, ri_projectNew = 0x0, ri_newTupleSlot = 0x0,
> ri_oldTupleSlot = 0x0, ri_projectNewInfoValid = false, ri_TrigDesc = 0x0,
> ri_TrigFunctions = 0x0,
> >   ri_TrigWhenExprs = 0x0, ri_TrigInstrument = 0x0, ri_ReturningSlot =
> 0x0, ri_TrigOldSlot = 0x0, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0,
> ri_FdwState = 0x0,
> >   ri_usesFdwDirectModify = false, ri_NumSlots = 0,
> ri_NumSlotsInitialized = 0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots
> = 0x0, ri_WithCheckOptions = 0x0,
> >   ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0,
> ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x0, ri_NumGeneratedNeededI
> = 0, ri_NumGeneratedNeededU = 0,
> >   ri_returningList = 0x0, ri_projectReturning = 0x0,
> ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0,
> ri_matchedMergeAction = 0x0, ri_notMatchedMergeAction = 0x0,
> >   ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0,
> ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0,
> ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0,
> >   ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0,
> ri_ancestorResultRels = 0x0}
> >
> > Since relinfo->ri_RootResultRelInfo = 0x0, the rti will have no value
> and Postgres will interpret that the ResultRelInfo must've been created
> only for filtering triggers and the relation is not being inserted into.
> > The relinfo variable is created with the create_entity_result_rel_info()
> function:
> >
> > ResultRelInfo *create_entity_result_rel_info(EState *estate, char
> *graph_name,
> >  char *label_name)
> > {
> > RangeVar *rv;
> > Relation label_relation;
> > ResultRelInfo *resultRelInfo;
> >
> > ParseState *pstate = make_parsestate(NULL);
> >
> > resultRelInfo = palloc(sizeof(ResultRelInfo));
> >
> > if (strlen(label_name) == 0)
> > {
> > rv = makeRangeVar(graph_name, AG_DEFAULT_LABEL_VERTEX, -1);
> > }
> > else
> > {
> > rv = makeRangeVar(graph_name, label_name, -1);
> > }
> >
> > label_relation = parserOpenTable(pstate, rv, RowExclusiveLock);
> >
> > // initialize the resultRelInfo
> > InitResultRelInfo(resultRelInfo, label_relation,
> >   list_length(estate->es_range_table), NULL,
> >   estate->es_instrument);
> >
> > // open the parse state
> > ExecOpenIndices(resultRelInfo, false);
> >
> > free_parsestate(pstate);
> >
> > return resultRelInfo;
> > }
> >
> > In this case, how can we get the relinfo->ri_RootResultRelInfo to store
> the appropriate data?
>
> Your function doesn't seem to have access to the ModifyTableState
> node, so setting ri_RootResultRelInfo to the correct ResultRelInfo
> node does not seem doable.
>
> As I suggested in my previous reply, please check if passing 0 (not
> list_length(estate->es_range_table)) for the 3rd argument
> InitResultRelInfo() fixes the problem and gives the correct result.
>
> --
> Thanks, Amit Langote
> EDB: http://www.enterprisedb.com
>


Re: Let's make PostgreSQL multi-threaded

2023-07-28 Thread Matthias van de Meent
On Thu, 15 Jun 2023 at 11:07, Hannu Krosing  wrote:
>
> One more unexpected benefit of having shared caches would be easing
> access to other databases.
>
> If the system caches are there for all databases anyway, then it
> becomes much easier to make queries using objects from multiple
> databases.

We have several optimizations in our visibility code that allow us to
remove dead tuples from this database when another database still has
a connection that has an old snapshot in which the deleting
transaction of this database has not yet committed. This is allowed
because we can say with confidence that other database's connections
will never be able to see this database's tables. If we were to allow
cross-database data access, that would require cross-database snapshot
visibility checks, and that would severely hinder these optimizations.
As an example, it would increase the work we need to do for snapshots:
For the snapshot data of tables that aren't shared catalogs, we only
need to consider our own database's backends for visibility. With
cross-database visibility, we would need to consider all active
backends for all snapshots, and this can be significantly more work.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: New PostgreSQL Contributors

2023-07-28 Thread Nathan Bossart
On Fri, Jul 28, 2023 at 07:19:21PM +0200, Vik Fearing wrote:
> On 7/28/23 17:29, Christoph Berg wrote:
> 
>> New PostgreSQL Contributors:
>> 
>>Ajin Cherian
>>Alexander Kukushkin
>>Alexander Lakhin
>>Dmitry Dolgov
>>Hou Zhijie
>>Ilya Kosmodemiansky
>>Melanie Plageman
>>Michael Banck
>>Michael Brewer
>>Paul Jungwirth
>>Peter Smith
>>Vigneshwaran C
>> 
>> New PostgreSQL Major Contributors:
>> 
>>Stacey Haysler
>>Steve Singer
> 
> Congratulations to all, and well deserved!

+1, congrats!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add timing information to pg_upgrade

2023-07-28 Thread Nathan Bossart
On Fri, Jul 28, 2023 at 01:10:14PM +0530, Bharath Rupireddy wrote:
> How about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and
> INSTR_TIME_GET_MILLISEC macros instead of gettimeofday and explicit
> calculations?

That seems like a good idea.

> +report_status(PG_REPORT, "ok (took %ld ms)", elapsed_ms);
> 
> I think it's okay to just leave it with "ok  \t %ld ms", elapsed_ms);
> without "took". FWIW, pg_regress does that way, see below:

I'm okay with simply adding the time.  However, I wonder if we want to
switch to seconds, minutes, hours, etc. if the step takes longer.  This
would add a bit of complexity, but it would improve human readability.
Alternatively, we could keep the code simple and machine readable by always
using milliseconds.

> Just curious, out of all the reported pg_upgrade prep_status()-es,
> which ones are taking more time?

I haven't done any testing on meaningful workloads yet, but the following
show up on an empty cluster: "creating dump of database schemas",
"analyzing all rows in the new cluster", "setting next transaction ID and
epoch for new cluster", "restoring datbase schemas in the new cluster", and
"sync data directory to disk".  I imagine the dump, restore, and
file-copying steps will stand out once you start adding data.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New PostgreSQL Contributors

2023-07-28 Thread Vik Fearing

On 7/28/23 17:29, Christoph Berg wrote:


New PostgreSQL Contributors:

   Ajin Cherian
   Alexander Kukushkin
   Alexander Lakhin
   Dmitry Dolgov
   Hou Zhijie
   Ilya Kosmodemiansky
   Melanie Plageman
   Michael Banck
   Michael Brewer
   Paul Jungwirth
   Peter Smith
   Vigneshwaran C

New PostgreSQL Major Contributors:

   Stacey Haysler
   Steve Singer


Congratulations to all, and well deserved!
--
Vik Fearing





Re: harmonize password reuse in vacuumdb, clusterdb, and reindexdb

2023-07-28 Thread Nathan Bossart
On Wed, Jul 19, 2023 at 10:43:11AM -0700, Nathan Bossart wrote:
> On Wed, Jun 28, 2023 at 10:24:09PM -0700, Nathan Bossart wrote:
>> The same commit that added this comment (ff402ae) also set the
>> allow_password_reuse parameter to true in vacuumdb's connectDatabase()
>> calls.  I found a message from the corresponding thread that provides some
>> additional detail [0].  I wonder if this comment should instead recommend
>> against using the allow_password_reuse flag unless reconnecting to the same
>> host/port/user target.  Connecting to different databases with the same
>> host/port/user information seems okay.  Maybe I am missing something... 
> 
> I added Tom here since it looks like he was the original author of this
> comment.  Tom, do you have any concerns with updating the comment for
> connectDatabase() in src/fe_utils/connect_utils.c like this?

I went ahead and committed this.  I'm happy to revisit if there are
concerns.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




New PostgreSQL Contributors

2023-07-28 Thread Christoph Berg
The PostgreSQL contributors team has been looking over the community
activity and, over the first half of this year, has been recognizing
new contributors to be listed on

https://www.postgresql.org/community/contributors/

New PostgreSQL Contributors:

  Ajin Cherian
  Alexander Kukushkin
  Alexander Lakhin
  Dmitry Dolgov
  Hou Zhijie
  Ilya Kosmodemiansky
  Melanie Plageman
  Michael Banck
  Michael Brewer
  Paul Jungwirth
  Peter Smith
  Vigneshwaran C

New PostgreSQL Major Contributors:

  Julien Rouhaud
  Stacey Haysler
  Steve Singer

Thank you all for contributing to PostgreSQL to make it such a great project!

For the contributors team,
Christoph




Re: Synchronizing slots from primary to standby

2023-07-28 Thread Bharath Rupireddy
On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila  wrote:
>
> I wonder if we anyway some sort of design like this because we
> shouldn't allow to spawn as many workers as the number of databases.
> There has to be some existing or new GUC like max_sync_slot_workers
> which decided the number of workers.

It seems reasonable to not have one slot sync worker for each
database. IMV, the slot sync workers must be generic and independently
manageable - generic in the sense that given a database and primary
conninfo, each worker must sync all the slots related to the given
database, independently mangeable in the sense that separate GUC for
number of sync workers, launchable directly by logical replication
launcher dynamically. The work division amongst the sync workers can
be simple, the logical replication launcher builds a shared memory
structure based on number of slots to sync and starts the sync workers
dynamically, and each sync worker picks {dboid, slot name, conninfo}
from the shared memory, syncs it and proceeds with other slots.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman
Hi,

While hacking on the pruning and page-level freezing code and am a bit
confused by the test guarding eager freezing  [1]:

/*
 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 * freeze when pruning generated an FPI, if doing so means that we set the
 * page all-frozen afterwards (might not happen until final heap pass).
 */
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
 fpi_before != pgWalUsage.wal_fpi))
{

I'm trying to understand the condition fpi_before !=
pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.

Is this test meant to guard against unnecessary freezing or to avoid
freezing when the cost is too high? That is, are we trying to
determine how likely it is that the page has been recently modified
and avoid eager freezing when it would be pointless (because the page
will soon be modified again)? Or are we trying to determine how likely
the freeze record is to emit an FPI and avoid eager freezing when it
isn't worth the cost? Or something else?

The commit message says:

> Also teach VACUUM to trigger page-level freezing whenever it detects
> that heap pruning generated an FPI.  We'll have already written a large
> amount of WAL just to do that much, so it's very likely a good idea to
> get freezing out of the way for the page early.

And I found the thread where it was discussed [2]. Several possible
explanations are mentioned in the thread.

But, the final rationale is still not clear to me. Could we add a
comment above the if condition specifying both:
a) what the test is a proxy for
b) the intended outcome (when do we expect to eager freeze)
And perhaps we could even describe a scenario where this heuristic is effective?

- Melanie

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/access/heap/vacuumlazy.c#L1802
[2] 
https://www.postgresql.org/message-id/flat/CAH2-Wzm_%3DmrWO%2BkUAJbR_gM_6RzpwVA8n8e4nh3dJGHdw_urew%40mail.gmail.com#c5aae6ea65e07de92a24a35445473448




Re: Synchronizing slots from primary to standby

2023-07-28 Thread Bharath Rupireddy
On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  wrote:
>
> > 2. All candidate standbys will start one slot sync worker per logical
> > slot which might not be scalable.
>
> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch
> is using one worker per database (for all slots corresponding to a
> database).

Right. It's based on one worker for each database.

> > Is having one (or a few more - not
> > necessarily one for each logical slot) worker for all logical slots
> > enough?
>
> I guess for a large number of slots the is a possibility of a large
> gap in syncing the slots which probably means we need to retain
> corresponding WAL for a much longer time on the primary. If we can
> prove that the gap won't be large enough to matter then this would be
> probably worth considering otherwise, I think we should find a way to
> scale the number of workers to avoid the large gap.

I think the gap is largely determined by the time taken to advance
each slot and the amount of WAL that each logical slot moves ahead on
primary. I've measured the time it takes for
pg_logical_replication_slot_advance with different amounts WAL on my
system. It took 2595ms/5091ms/31238ms to advance the slot by
3.7GB/7.3GB/13GB respectively. To put things into perspective here,
imagine there are 3 logical slots to sync for a single slot sync
worker and each of them are in need of advancing the slot by
3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again
after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after
advance time of slot 1 with amount of WAL that the slot has moved
ahead on primary during 40sec, gets to slot 3 again after advance time
of slot 1 and slot 2 with amount of WAL that the slot has moved ahead
on primary and so on. If WAL generation on the primary is pretty fast,
and if the logical slot moves pretty fast on the primary, the time it
takes for a single sync worker to sync a slot can increase.

Now, let's think what happens if there's a large gap, IOW, a logical
slot on standby is behind X amount of WAL from that of the logical
slot on primary. The standby needs to retain more WAL for sure. IIUC,
primary doesn't need to retain the WAL required for a logical slot on
standby, no?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Anthonin Bonnefoy
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
If you're talking of the two instances where I'm modifying the instr_time's
ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp
using GetCurrentStatementStartTimestamp and don't
have access to the start instr_time so I need to build the duration from 2
timestamps.
The second case is when building node spans from the planstate. I directly
have the duration from Instrumentation.

I guess one fix would be to use an int64 for the span duration to directly
store nanoseconds instead of an instr_time
but I do use the instrumentation macros outside of those two cases to get
the duration of other spans.

> Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to the
caller through the sampled flag and
only recently added the pg_tracing_sample_rate parameter to give more
control. It should be straightforward to
add an option to create standalone traces based on sample rate alone. This
way, setting the sample rate to 1
would force the queries running in the session to be traced.


On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov  wrote:

> Hi!
>
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
> and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
>
> Also, have you thought about a way to trace existing (running) queries
> without directly instrumenting them? It would
> be much more usable for maintenance and support personnel, because in
> production environments you rarely could
> change query text directly. For the current state the most simple solution
> is switch tracing on and off by calling SQL
> function, and possibly switch tracing for prepared statement the same way,
> but not for any random query.
>
> I'll check the patch for the race conditions.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-07-28 Thread Ashutosh Bapat
Hi Richard,

On Fri, Jul 28, 2023 at 2:03 PM Richard Guo  wrote:
>
>
> It doesn't seem too expensive to translate SpecialJoinInfos, so I think
> it's OK to construct and free the SpecialJoinInfo for a child join on
> the fly.  So the approach in 0002 looks reasonable to me.  But there is
> something that needs to be fixed in 0002.

Thanks for the review. I am fine if going ahead with 0002 is enough.

>
> +   bms_free(child_sjinfo->commute_above_l);
> +   bms_free(child_sjinfo->commute_above_r);
> +   bms_free(child_sjinfo->commute_below_l);
> +   bms_free(child_sjinfo->commute_below_r);
>
> These four members in SpecialJoinInfo only contain outer join relids.
> They do not need to be translated.  So they would reference the same
> memory areas in child_sjinfo as in parent_sjinfo.  We should not free
> them, otherwise they would become dangling pointers in parent sjinfo.
>

Good catch. Saved my time. I would have caught this rather hard way
when running regression. Thanks a lot.

I think we should 1. add an assert to make sure that commute_above_*
do not require any transations i.e. they do not contain any parent
relids of the child rels. 2. Do not free those. 3. Add a comment about
keeping the build and free functions in sync. I will work on those
next.

-- 
Best Wishes,
Ashutosh Bapat




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Nikita Malakhov
Hi!

What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
macros for timing calculations?

Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in
production environments you rarely could
change query text directly. For the current state the most simple solution
is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way,
but not for any random query.

I'll check the patch for the race conditions.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Postgres v15 windows bincheck regression test failures

2023-07-28 Thread Alexander Lakhin

28.07.2023 14:42, Noah Misch wrpte:

That was about a bug that appears when using TCP sockets. ...


Yes, and according to the failed test output, TCP sockets were used:

#   'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
#   This probably means the server terminated abnormally
#   before or while processing the request.

Best regards,
Alexander




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Tomas Vondra



On 7/28/23 14:35, Ashutosh Bapat wrote:
> On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
>  wrote:
>>
>> On 7/24/23 14:57, Ashutosh Bapat wrote:
>>> ...
>>>


 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
 I was thinking maybe we should have it in the transaction (because we
 need to do cleanup at the end). It seem a bit inconvenient, because then
 we'd need to either search htabs in all subxacts, or transfer the
 entries to the top-level xact (otoh, we already do that with snapshots),
 and cleanup on abort.

 What do you think?
>>>
>>> Hash table per transaction seems saner design. Adding it to the top
>>> level transaction should be fine. The entry will contain an XID
>>> anyway. If we add it to every subtransaction we will need to search
>>> hash table in each of the subtransactions when deciding whether a
>>> sequence change is transactional or not. Top transaction is a
>>> reasonable trade off.
>>>
>>
>> It's not clear to me what design you're proposing, exactly.
>>
>> If we track it in top-level transactions, then we'd need copy the data
>> whenever a transaction is assigned as a child, and perhaps also remove
>> it when there's a subxact abort.
> 
> I thought, esp. with your changes to assign xid, we will always know
> the top level transaction when a sequence is assigned a relfilenode.
> So the refilenodes will always get added to the correct hash directly.
> I didn't imagine a case where we will need to copy the hash table from
> sub-transaction to top transaction. If that's true, yes it's
> inconvenient.
> 

Well, it's a matter of efficiency.

To check if a sequence change is transactional, we need to check if it's
for a relfilenode created in the current transaction (it can't be for
relfilenode created in a concurrent top-level transaction, due to MVCC).

If you don't copy the entries into the top-level xact, you have to walk
all subxacts and search all of those, for each sequence change. And
there may be quite a few of both subxacts and sequence changes ...

I wonder if we need to search the other top-level xacts, but we probably
need to do that. Because it might be a subxact without an assignment, or
something like that.

> As to the abort, don't we already remove entries on subtxn abort?
> Having per transaction hash table doesn't seem to change anything
> much.
> 

What entries are we removing? My point is that if we copy the entries to
the top-level xact, we probably need to remove them on abort. Or we
could leave them in the top-level xact hash.

>>
>> And we'd need to still search the hashes in all toplevel transactions on
>> every sequence increment - in principle we can't have increment for a
>> sequence created in another in-progress transaction, but maybe it's just
>> not assigned yet.
> 
> We hold a strong lock on sequence when changing its relfilenode. The
> sequence whose relfilenode is being changed can not be accessed by any
> concurrent transaction. So I am not able to understand what you are
> trying to say.
> 

How do you know the subxact has already been recognized as such? It may
be treated as top-level transaction for a while, until the assignment.

> I think per (top level) transaction hash table is cleaner design. It
> puts the hash table where it should be. But if that makes code
> difficult, current design works too.
> 


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Ashutosh Bapat
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
 wrote:
>
> Anyway, I was thinking about this a bit more, and it seems it's not as
> difficult to use the page LSN to ensure sequences don't go backwards.
> The 0005 change does that, by:
>
> 1) adding pg_sequence_state, that returns both the sequence state and
>the page LSN
>
> 2) copy_sequence returns the page LSN
>
> 3) tablesync then sets this LSN as origin_startpos (which for tables is
>just the LSN of the replication slot)
>
> AFAICS this makes it work - we start decoding at the page LSN, so that
> we  skip the increments that could lead to the sequence going backwards.
>

I like this design very much. It makes things simpler than complex.
Thanks for doing this.

I am wondering whether we could reuse pg_sequence_last_value() instead
of adding a new function. But the name of the function doesn't leave
much space for expanding its functionality. So we are good with a new
one. Probably some code deduplication.

-- 
Best Wishes,
Ashutosh Bapat




Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Tomas Vondra
On 7/28/23 11:42, Amit Kapila wrote:
> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>  wrote:
>>
>> On 7/26/23 09:27, Amit Kapila wrote:
>>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  wrote:
>>
>> Anyway, I was thinking about this a bit more, and it seems it's not as
>> difficult to use the page LSN to ensure sequences don't go backwards.
>>
> 
> While studying the changes for this proposal and related areas, I have
> a few comments:
> 1. I think you need to advance the origin if it is changed due to
> copy_sequence(), otherwise, if the sync worker restarts after
> SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
> value.
> 

True, we want to restart at the new origin_startpos.

> 2. Between the time of SYNCDONE and READY state, the patch can skip
> applying non-transactional sequence changes even if it should apply
> it. The reason is that during that state change
> should_apply_changes_for_rel() decides whether to apply change based
> on the value of remote_final_lsn which won't be set for
> non-transactional change. I think we need to send the start LSN of a
> non-transactional record and then use that as remote_final_lsn for
> such a change.

Good catch. remote_final_lsn is set in apply_handle_begin, but that
won't happen for sequences. We're already sending the LSN, but
logicalrep_read_sequence ignores it - it should be enough to add it to
LogicalRepSequence and then set it in apply_handle_sequence().

> 
> 3. For non-transactional sequence change apply, we don't set
> replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
> we are doing in apply_handle_commit_internal() before calling
> CommitTransactionCommand(). So, that can lead to the origin moving
> backwards after restart which will lead to requesting and applying the
> same changes again and for that period of time sequence can go
> backwards. This needs some more thought as to what is the correct
> behaviour/solution for this.
> 

I think saying "origin moves backwards" is a bit misleading. AFAICS the
origin position is not actually moving backwards, it's more that we
don't (and can't) move it forwards for each non-transactional change. So
yeah, we may re-apply those, and IMHO that's expected - the sequence is
allowed to be "ahead" on the subscriber.

I don't see a way to improve this, except maybe having a separate LSN
for non-transactional changes (for each origin).

> 4. BTW, while checking this behaviour, I noticed that the initial sync
> worker for sequence mentions the table in the LOG message: "LOG:
> logical replication table synchronization worker for subscription
> "mysub", table "s" has finished". Won't it be better here to refer to
> it as a sequence?
> 

Thanks, I'll fix that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Row pattern recognition

2023-07-28 Thread Vik Fearing

On 7/28/23 13:02, Tatsuo Ishii wrote:

Attached is the v3 patch. In this patch following changes are made.


- PATTERN variables do not have to exist in the DEFINE clause.  They are
- considered TRUE if not present.


Do you think we really need this? I found a criticism regarding this.

https://link.springer.com/article/10.1007/s13222-022-00404-3
"3.2 Explicit Definition of All Row Pattern Variables"

What do you think?


I think that a large part of obeying the standard is to allow queries 
from other engines to run the same on ours.  The standard does not 
require the pattern variables to be defined and so there are certainly 
queries out there without them, and that hurts migrating to PostgreSQL.



- I am working on making window aggregates RPR aware now. The
implementation is in progress and far from completeness. An example
is below. I think row 2, 3, 4 of "count" column should be NULL
instead of 3, 2, 0, 0. Same thing can be said to other
rows. Probably this is an effect of moving aggregate but I still
studying the window aggregation code.


This tells me again that RPR is not being run in the right place.  All
windowed aggregates and frame-level window functions should Just Work
with no modification.


I am not touching each aggregate function. I am modifying
eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate
function. Do you think it's not the right place to make window
aggregates RPR aware?


Oh, okay.


SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM
stock
   WINDOW w AS (
   PARTITION BY company
   ORDER BY tdate
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   AFTER MATCH SKIP PAST LAST ROW
   INITIAL
   PATTERN (START UP+ DOWN+)
   DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);
   company  |   tdate| first_value | count
--++-+---
   company1 | 2023-07-01 | 100 | 4
   company1 | 2023-07-02 | | 3
   company1 | 2023-07-03 | | 2
   company1 | 2023-07-04 | | 0
   company1 | 2023-07-05 | | 0
   company1 | 2023-07-06 |  90 | 4
   company1 | 2023-07-07 | | 3
   company1 | 2023-07-08 | | 2
   company1 | 2023-07-09 | | 0
   company1 | 2023-07-10 | | 0
   company2 | 2023-07-01 |  50 | 4
   company2 | 2023-07-02 | | 3
   company2 | 2023-07-03 | | 2
   company2 | 2023-07-04 | | 0
   company2 | 2023-07-05 | | 0
   company2 | 2023-07-06 |  60 | 4
   company2 | 2023-07-07 | | 3
   company2 | 2023-07-08 | | 2
   company2 | 2023-07-09 | | 0
   company2 | 2023-07-10 | | 0


In this scenario, row 1's frame is the first 5 rows and specified SKIP
PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are
skipped) and the result of the outer count should be 0 for all of them
because there are no rows in the frame.


Ok. Just I want to make sure. If it's other aggregates like sum or
avg, the result of the outer aggregates should be NULL.


They all behave the same way as in a normal query when they receive no 
rows as input.



When we get to adding count in the MEASURES clause, there will be a
difference between no match and empty match, but that does not apply
here.


Can you elaborate more? I understand that "no match" and "empty match"
are different things. But I do not understand how the difference
affects the result of count.


This query:

SELECT v.a, wcnt OVER w, count(*) OVER w
FROM (VALUES ('A')) AS v (a)
WINDOW w AS (
  ORDER BY v.a
  MEASURES count(*) AS wcnt
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  PATTERN (B)
  DEFINE B AS B.a = 'B'
)

produces this result:

 a | wcnt | count
---+--+---
 A |  | 0
(1 row)

Inside the window specification, *no match* was found and so all of the 
MEASURES are null.  The count(*) in the target list however, still 
exists and operates over zero rows.


This very similar query:

SELECT v.a, wcnt OVER w, count(*) OVER w
FROM (VALUES ('A')) AS v (a)
WINDOW w AS (
  ORDER BY v.a
  MEASURES count(*) AS wcnt
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  PATTERN (B?)
  DEFINE B AS B.a = 'B'
)

produces this result:

 a | wcnt | count
---+--+---
 A |0 | 0
(1 row)

In this case, the pattern is B? instead of just B, which produces an 
*empty match* for the MEASURES to be applied over.

--
Vik Fearing





Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Ashutosh Bapat
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra
 wrote:
>
> On 7/24/23 14:57, Ashutosh Bapat wrote:
> > ...
> >
> >>
> >>
> >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
> >> I was thinking maybe we should have it in the transaction (because we
> >> need to do cleanup at the end). It seem a bit inconvenient, because then
> >> we'd need to either search htabs in all subxacts, or transfer the
> >> entries to the top-level xact (otoh, we already do that with snapshots),
> >> and cleanup on abort.
> >>
> >> What do you think?
> >
> > Hash table per transaction seems saner design. Adding it to the top
> > level transaction should be fine. The entry will contain an XID
> > anyway. If we add it to every subtransaction we will need to search
> > hash table in each of the subtransactions when deciding whether a
> > sequence change is transactional or not. Top transaction is a
> > reasonable trade off.
> >
>
> It's not clear to me what design you're proposing, exactly.
>
> If we track it in top-level transactions, then we'd need copy the data
> whenever a transaction is assigned as a child, and perhaps also remove
> it when there's a subxact abort.

I thought, esp. with your changes to assign xid, we will always know
the top level transaction when a sequence is assigned a relfilenode.
So the refilenodes will always get added to the correct hash directly.
I didn't imagine a case where we will need to copy the hash table from
sub-transaction to top transaction. If that's true, yes it's
inconvenient.

As to the abort, don't we already remove entries on subtxn abort?
Having per transaction hash table doesn't seem to change anything
much.

>
> And we'd need to still search the hashes in all toplevel transactions on
> every sequence increment - in principle we can't have increment for a
> sequence created in another in-progress transaction, but maybe it's just
> not assigned yet.

We hold a strong lock on sequence when changing its relfilenode. The
sequence whose relfilenode is being changed can not be accessed by any
concurrent transaction. So I am not able to understand what you are
trying to say.

I think per (top level) transaction hash table is cleaner design. It
puts the hash table where it should be. But if that makes code
difficult, current design works too.

-- 
Best Wishes,
Ashutosh Bapat




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-07-28 Thread vignesh C
On Fri, 21 Jul 2023 at 13:00, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> > Based on the above, we are considering that we delay the timing of shutdown 
> > for
> > logical walsenders. The preliminary workflow is:
> >
> > 1. When logical walsenders receives siginal from checkpointer, it consumes 
> > all
> >of WAL records, change its state into WALSNDSTATE_STOPPING, and stop
> > doing
> >anything.
> > 2. Then the checkpointer does the shutdown checkpoint
> > 3. After that postmaster sends signal to walsenders, same as current
> > implementation.
> > 4. Finally logical walsenders process the shutdown checkpoint record and 
> > update
> > the
> >   confirmed_lsn after the acknowledgement from subscriber.
> >   Note that logical walsenders don't have to send a shutdown checkpoint 
> > record
> >   to subscriber but following keep_alive will help us to increment the
> > confirmed_lsn.
> > 5. All tasks are done, they exit.
> >
> > This mechanism ensures that the confirmed_lsn of active slots is same as the
> > current
> > WAL location of old publisher, so that 0003 patch would become more simpler.
> > We would not have to calculate the acceptable difference anymore.
> >
> > One thing we must consider is that any WALs must not be generated while
> > decoding
> > the shutdown checkpoint record. It causes the PANIC. IIUC the record leads
> > SnapBuildSerializationPoint(), which just serializes snapbuild or restores 
> > from
> > it, so the change may be acceptable. Thought?
>
> I've implemented the ideas from my previous proposal, PSA another patch set.
> Patch 0001 introduces the state WALSNDSTATE_STOPPING to logical walsenders. 
> The
> workflow remains largely the same as described in my previous post, with the
> following additions:
>
> * A flag has been added to track whether all the WALs have been flushed. The
>   logical walsender can only exit after the flag is set. This ensures that all
>   WALs are flushed before the termination of the walsender.
> * Cumulative statistics are now forcibly written before changing the state.
>   While the previous involved reporting stats upon process exit, the current 
> approach
>   must report earlier due to the checkpointer's termination timing. See 
> comments
>   in CheckpointerMain() and atop pgstat_before_server_shutdown().
> * At the end of processes, slots are now saved to disk.
>
>
> Patch 0002 adds --include-logical-replication-slots option to pg_upgrade,
> not changed from previous set.
>
> Patch 0003 adds a check function, which becomes simpler.
> The previous version calculated the "acceptable" difference between 
> confirmed_lsn
> and the current WAL position. This was necessary because shutdown records 
> could
> not be sent to subscribers, creating a disparity in these values. However, 
> this
> approach had drawbacks, such as needing adjustments if record sizes changed.
>
> Now, the record can be sent to subscribers, so the hacking is not needed 
> anymore,
> at least in the context of logical replication. The consistency is now 
> maintained
> by the logical walsenders, so slots created by the backend could not be.
> We must consider what should be...
>
> How do you think?

Here is a patch which checks that there are no WAL records other than
CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion
from [1].
Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch
0003 exposes pg_get_wal_records_content to get the WAL records along
with the WAL record type between start and end lsn. pg_walinspect
contrib module already exposes a function for this requirement, I have
moved this functionality to be exposed from the backend. Patch 0004
has slight change in check function to check that there are no other
records other than CHECKPOINT_SHUTDOWN to be consumed. The attached
patch has the changes for the same.
Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com

Regards,
Vignesh
From 49360788fea01756aa9578da3dad34786bd1281a Mon Sep 17 00:00:00 2001
From: Hayato Kuroda 
Date: Fri, 14 Apr 2023 08:59:03 +
Subject: [PATCH 4/4] pg_upgrade: Add check function for
 --include-logical-replication-slots option

XXX: Actually, this commit disallows to support slots which are created by user
backends. In the checking function we ensure that all the avtive slots have
confirmed_flush_lsn which is same as current WAL position, and they would not be
the same. For slots which are used by logical replication, logical walsenders
guarantee that at the shutdown. For individual slots, however, cannot be handled
by walsenders, so confirmed_flush_lsn is behind shutdown checkpoint record.

Author: Hayato Kuroda
Reviewed-by: Wang Wei, Vignesh C
---
 src/bin/pg_upgrade/check.c|  59 
 .../t/003_logical_replication_slots.pl| 134 +++---
 2 files changed, 140 insertions(+), 53 deletions(-)

diff --git 

Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Anthonin Bonnefoy
Hi,

> I'd keep Autotools build up to date
Definitely. The patch is still in a rough state and updating Autotools fell
through the cracks.

> I'm currently playing with the patch and
> reviewing sources, if you need any cooperation - please let us know.
The goal for me was to get validation on the design and to check if there
was anything that would be a blocker before commiting more time on the
project.

There are already several things I was planning on improving:
- The parse span is something I've added recently. I've realised that I
could rely on the stmtStartTimestamp to get the time spent in parsing in
the post parse hook so the code around this is still rough.
- There are probably race conditions on the reads and writes of the query
file that need to be fixed
- I'm using the same Span structure for everything while Executor spans
only need a small subset which is a bit wasteful. I've tried to use two
different shared buffers (base spans and spans with counters) but it was
making things more confusing.
- Finish commenting code and start writing the doc

So any comments, missing features and reviews on the current state of the
project is welcome.

Regards,
Anthonin

On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov  wrote:

> Hi,
>
> I'd keep Autotools build up to date, because Meson is very limited in
> terms of not very
> up-to-date OS versions. Anyway, it's OK now. I'm currently playing with
> the patch and
> reviewing sources, if you need any cooperation - please let us know.
> I'm +1 for committing
> this patch after review.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: Postgres v15 windows bincheck regression test failures

2023-07-28 Thread Noah Misch
On Fri, Jul 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote:
> 28.07.2023 05:17, Noah Misch wrote:
> >On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote:
> >>/*
> >>* We don't use Unix-domain sockets on Windows by default, even if the
> >>* build supports them.  (See comment at remove_temp() for a reason.)
> >>* Override at your own risk.
> >>*/
> >>
> >>Is there some sort of race condition in the SSPI code that sometimes
> >>doesn't gracefully finish/close the connection when the backend
> >>decides to exit due to error?
> >No.  remove_temp() is part of test driver "pg_regress".  Non-test usage is
> >unaffected.  Even for test usage, folks have reported no failures from the
> >cause mentioned in the remove_temp() comment.
> 
> It seems to me that it's just another manifestation of bug #16678 ([1]).
> See also commits 6051857fc and 29992a6a5.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org

That was about a bug that appears when using TCP sockets.  The remove_temp()
comment involves code that doesn't run when using TCP sockets.  I don't think
they can be manifestations of the same phenomenon.




Re: WAL Insertion Lock Improvements

2023-07-28 Thread Bharath Rupireddy
On Wed, Jul 26, 2023 at 1:27 AM Andres Freund  wrote:
>
> > 0001 has been now applied.  I have done more tests while looking at
> > this patch since yesterday and was surprised to see higher TPS numbers
> > on HEAD with the same tests as previously, and the patch was still
> > shining with more than 256 clients.
>
> Just a small heads up:
>
> I just rebased my aio tree over the commit and promptly, on the first run, saw
> a hang. I did some debugging on that. Unfortunately repeated runs haven't
> repeated that hang, despite quite a bit of trying.

Hm. Please share workload details, test scripts, system info and any
special settings for running in my setup.

> The symptom I was seeing is that all running backends were stuck in
> LWLockWaitForVar(), even though the value they're waiting for had
> changed. Which obviously "shouldn't be possible".

Were the backends stuck there indefinitely? IOW, did they get into a deadlock?

> It's of course possible that this is AIO specific, but I didn't see anything
> in stacks to suggest that.
>
> I do wonder if this possibly exposed an undocumented prior dependency on the
> value update always happening under the list lock.

I'm going through the other thread mentioned by Michael Paquier. I'm
wondering if the deadlock issue illustrated here
https://www.postgresql.org/message-id/55BB50D3.9000702%40iki.fi is
showing up again, because 71e4cc6b8e reduced the contention on
waitlist lock and made things *a bit* faster.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-28 Thread Matthias van de Meent
On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan  wrote:
>
> On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent
>  wrote:
> > > Basically, the patch that added that feature had to revise the index
> > > AM API, in order to support a mode of operation where scans return
> > > groupings rather than tuples. Whereas this patch requires none of
> > > that. It makes affected index scans as similar as possible to
> > > conventional index scans.
> >
> > Hmm, yes. I see now where my confusion started. You called it out in
> > your first paragraph of the original mail, too, but that didn't help
> > me then:
> >
> > The wiki does not distinguish "Index Skip Scans" and "Loose Index
> > Scans", but these are not the same.
>
> A lot of people (myself included) were confused on this point for
> quite a while.

I've taken the liberty to update the "Loose indexscan" wiki page [0],
adding detail that Loose indexscans are distinct from Skip scans, and
showing some high-level distinguishing properties.
I also split the TODO entry for `` "loose" or "skip" scan `` into two,
and added links to the relevant recent threads so that it's clear
these are different (and that some previous efforts may have had a
confusing name).

I hope this will reduce the chance of future confusion between the two
different approaches to improving index scan performance.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: https://wiki.postgresql.org/wiki/Loose_indexscan




Re: Row pattern recognition

2023-07-28 Thread Tatsuo Ishii
>> Attached is the v3 patch. In this patch following changes are made.
> 
> Excellent.  Thanks!

You are welcome!

> A few quick comments:
> 
> - PERMUTE is still misspelled as PREMUTE

Oops. Will fix.

> - PATTERN variables do not have to exist in the DEFINE clause.  They are
> - considered TRUE if not present.

Do you think we really need this? I found a criticism regarding this.

https://link.springer.com/article/10.1007/s13222-022-00404-3
"3.2 Explicit Definition of All Row Pattern Variables"

What do you think?

>> - I am working on making window aggregates RPR aware now. The
>>implementation is in progress and far from completeness. An example
>>is below. I think row 2, 3, 4 of "count" column should be NULL
>>instead of 3, 2, 0, 0. Same thing can be said to other
>>rows. Probably this is an effect of moving aggregate but I still
>>studying the window aggregation code.
> 
> This tells me again that RPR is not being run in the right place.  All
> windowed aggregates and frame-level window functions should Just Work
> with no modification.

I am not touching each aggregate function. I am modifying
eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate
function. Do you think it's not the right place to make window
aggregates RPR aware?

>> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM
>> stock
>>   WINDOW w AS (
>>   PARTITION BY company
>>   ORDER BY tdate
>>   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>>   AFTER MATCH SKIP PAST LAST ROW
>>   INITIAL
>>   PATTERN (START UP+ DOWN+)
>>   DEFINE
>>START AS TRUE,
>>UP AS price > PREV(price),
>>DOWN AS price < PREV(price)
>> );
>>   company  |   tdate| first_value | count
>> --++-+---
>>   company1 | 2023-07-01 | 100 | 4
>>   company1 | 2023-07-02 | | 3
>>   company1 | 2023-07-03 | | 2
>>   company1 | 2023-07-04 | | 0
>>   company1 | 2023-07-05 | | 0
>>   company1 | 2023-07-06 |  90 | 4
>>   company1 | 2023-07-07 | | 3
>>   company1 | 2023-07-08 | | 2
>>   company1 | 2023-07-09 | | 0
>>   company1 | 2023-07-10 | | 0
>>   company2 | 2023-07-01 |  50 | 4
>>   company2 | 2023-07-02 | | 3
>>   company2 | 2023-07-03 | | 2
>>   company2 | 2023-07-04 | | 0
>>   company2 | 2023-07-05 | | 0
>>   company2 | 2023-07-06 |  60 | 4
>>   company2 | 2023-07-07 | | 3
>>   company2 | 2023-07-08 | | 2
>>   company2 | 2023-07-09 | | 0
>>   company2 | 2023-07-10 | | 0
> 
> In this scenario, row 1's frame is the first 5 rows and specified SKIP
> PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are
> skipped) and the result of the outer count should be 0 for all of them
> because there are no rows in the frame.

Ok. Just I want to make sure. If it's other aggregates like sum or
avg, the result of the outer aggregates should be NULL.

> When we get to adding count in the MEASURES clause, there will be a
> difference between no match and empty match, but that does not apply
> here.

Can you elaborate more? I understand that "no match" and "empty match"
are different things. But I do not understand how the difference
affects the result of count.

>> I am going to add this thread to CommitFest and plan to add both of
>> you as reviewers. Thanks in advance.
> 
> My pleasure.  Thank you for working on this difficult feature.

Thank you for accepting being registered as a reviewer. Your comments
are really helpful.
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: cataloguing NOT NULL constraints

2023-07-28 Thread Alvaro Herrera
> > Given the following sequence:
> > 
> > drop table if exists p,c;
> > create table p(a int primary key);
> > create table c() inherits (p);
> > alter table p drop constraint p_pkey;

> > However, c.a remains non-nullable, with a NOT NULL constraint that
> > claims to be inherited:
> > 
> > \d+ c
> > Table "public.c"
> >  Column |  Type   | Collation | Nullable | Default | Storage |
> > Compression | Stats target | Description
> > +-+---+--+-+-+-+--+-
> >  a  | integer |   | not null | | plain   |
> > |  |
> > Not null constraints:
> > "c_a_not_null" NOT NULL "a" (inherited)
> > Inherits: p
> > Access method: heap
> > 
> > That's a problem, because now the NOT NULL constraint on c cannot be
> > dropped (attempting to drop it on c errors out because it thinks it's
> > inherited, but it can't be dropped via p, because p.a is already
> > nullable).

So I implemented a fix for this (namely: fix the inhcount to be 0
initially), and it works well, but it does cause a definitional problem:
any time we create a child table that inherits from another table that
has a primary key, all the columns in the child table will get normal,
visible, droppable NOT NULL constraints.  Thus, pg_dump for example will
output that constraint exactly as if the user had specified it in the
child's CREATE TABLE command.  By itself this doesn't bother me, though
I admit it seems a little odd.

When you restore such a setup from pg_dump, things work perfectly -- I
mean, you don't get a second constraint.  But if you do drop the
constraint, then it will be reinstated by the next pg_dump as if you
hadn't dropped it, by way of it springing to life from the PK.

To avoid that, one option would be to make this NN constraint
undroppable ...  but I don't see how.  One option might be to add a
pg_depend row that links the NOT NULL constraint to its PK constraint.
But this will be a strange case that occurs nowhere else, since other
NOT NULL constraint don't have such pg_depend rows.  Also, I won't know
how pg_dump likes this until I implement it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Alvaro Herrera
On 2023-Jul-28, Michael Paquier wrote:

> So you have faced a race condition where the commit of the transaction
> doing the schema creation for the static workers is delayed long
> enough that the dynamic workers don't see it, and bumped on a catalog
> conflict when they try to create the same schemas.
>
> Having each bgworker on its own schema would be enough to prevent
> conflicts, but I'd like to add a second thing: a check on
> pg_stat_activity.wait_event after starting the workers.  I have added
> something like that in the patch I have posted today for the custom
> wait events at [1] and it enforces the startup sequences of the
> workers in a stricter way.

Hmm, I think having all the workers doing their in the same table is
better -- if nothing else, because it gives us the opportunity to show
how to use some other coding technique (but also because we are forced
to write the SQL code in a way that's correct for potentially multiple
concurrent workers, which sounds useful to demonstrate).  Can't we
instead solve the race condition by having some shared resource that
blocks the other workers from proceeding until the schema has been
created?  Perhaps an LWLock, or a condition variable, or an advisory
lock.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: [PoC] Reducing planning time when tables have many partitions

2023-07-28 Thread Ashutosh Bapat
Hi Yuya, Andrey,

On Fri, Jul 28, 2023 at 9:58 AM Andrey Lepikhov
 wrote:

> >>
> > Discovering quality of partition pruning at the stage of execution
> > initialization and using your set of patches I have found some dubious
> > results with performance degradation. Look into the test case in
> > attachment.
> > Here is three queries. Execution times:
> > 1 - 8s; 2 - 30s; 3 - 131s (with your patch set).
> > 1 - 5s; 2 - 10s; 3 - 33s (current master).
> >
> > Maybe it is a false alarm, but on my laptop I see this degradation at
> > every launch.
> Sorry for this. It was definitely a false alarm. In this patch,
> assertion checking adds much overhead. After switching it off, I found
> out that this feature solves my problem with a quick pass through the
> members of an equivalence class. Planning time results for the queries
> from the previous letter:
> 1 - 0.4s, 2 - 1.3s, 3 - 1.3s; (with the patches applied)
> 1 - 5s; 2 - 8.7s; 3 - 22s; (current master).

I measured planning time using my scripts setup.sql and queries.sql
attached to [1] with and without assert build using your patch. The
timings are recorded in the attached spreadsheet. I have following
observations

1. The patchset improves the planning time of queries involving
partitioned tables by an integral factor. Both in case of
partitionwise join and without it. The speedup is 5x to 21x in my
experiment. That's huge.
2. There's slight degradation in planning time of queries involving
unpartitioned tables. But I have seen that much variance usually.
3. assert and debug enabled build shows degradation in planning time
in all the cases.
4. There is substantial memory increase in all the cases. It's
percentage wise predominant when the partitionwise join is not used.

Given that most of the developers run assert enabled builds it would
be good to bring down the degradation there while keeping the
excellent speedup in non-assert builds.
Queries on partitioned tables eat a lot of memory anyways, increasing
that further should be avoided.

I have not studied the patches. But I think the memory increase has to
do with our Bitmapset structure. It's space inefficient when there are
thousands of partitions involved. See my comment at [2]

[1] 
https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAExHW5s4EqY43oB%3Dne6B2%3D-xLgrs9ZGeTr1NXwkGFt2j-OmaQQ%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat


planning time measurement.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2023-07-28 Thread Ashutosh Bapat
Hi Tom, Richard,

On Mon, Jul 24, 2023 at 8:17 AM Richard Guo  wrote:
>
> Thanks for pushing it!

With this fix, I saw a noticeable increase in the memory consumption
of planner. I was running experiments mentioned in [1] The reason is
the Bitmapset created by bms_union() are not freed during planning and
when there are thousands of child joins involved, bitmapsets takes up
a large memory and there any a large number of bitmaps.

Attached 0002 patch fixes the memory consumption by calculating
appinfos only once and using them twice. The number look like below

Number of tables joined | without patch | with patch |
--
  2 |  40.8 MiB |   40.3 MiB |
  3 | 151.6 MiB |  146.9 MiB |
  4 | 463.9 MiB |  445.5 MiB |
  5 |1663.9 MiB | 1563.3 MiB |

The memory consumption is prominent at higher number of joins as that
exponentially increases the number of child joins.

Attached setup.sql and queries.sql and patch 0001 were used to measure
memory similar to [1].

I don't think it's a problem with the patch itself. We should be able
to use Bitmapset APIs similar to what patch is doing. But there's a
problem with our Bitmapset implementation. It's not space efficient
for thousands of partitions. A quick calculation reveals this.

If the number of partitions is 1000, the matching partitions will
usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
the relids will be {b 1000, 2000, 3000, ...}. To represent a single
1000th digit current Bitmapset implementation will allocate 1000/8 =
125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
memory. This is even true for lower relid numbers since they will be
1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
joins require 625KiB memory. Doing this as many times as the number of
joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
difference we see above.

Even if we allocate a list to hold 5 integers it will not take 625
bytes. I think we need to improve our Bitmapset representation to be
efficient in such cases. Of course whatever representation we choose
has to be space efficient for a small number of tables as well and
should gel well with our planner logic. So I guess some kind of
dynamic representation which changes the underlying layout based on
the contents is required. I have looked up past hacker threads to see
if this has been discussed previously.

I don't think this is the thread to discuss it and also I am not
planning to work on it in the immediate future. But I thought I would
mention it where I found it.

[1] 
https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
From 0f00fc48b9ebf73b54724ba82cec4a69d5f63846 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 12 Jul 2023 14:34:14 +0530
Subject: [PATCH 1/2] Report memory used for planning a query in EXPLAIN
 ANALYZE

The memory used in the CurrentMemoryContext and its children is sampled
before and after calling pg_plan_query() from ExplainOneQuery(). The
difference in the two samples is reported as the memory consumed while
planning the query. This may not account for the memory allocated in
memory contexts which are not children of CurrentMemoryContext. These
contexts are usually other long lived contexts, e.g.
CacheMemoryContext, which are shared by all the queries run in a
session. The consumption in those can not be attributed only to a given
query and hence should not be reported any way.

The memory consumption is reported as "Planning Memory" property in
EXPLAIN ANALYZE output.

Ashutosh Bapat
---
 src/backend/commands/explain.c | 12 ++--
 src/backend/commands/prepare.c |  7 ++-
 src/backend/utils/mmgr/mcxt.c  | 19 +++
 src/include/commands/explain.h |  3 ++-
 src/include/utils/memutils.h   |  1 +
 5 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..9f859949f0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -397,16 +397,20 @@ ExplainOneQuery(Query *query, int cursorOptions,
 	planduration;
 		BufferUsage bufusage_start,
 	bufusage;
+		Size		mem_consumed;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
 
 		/* plan the query */
 		plan = pg_plan_query(query, queryString, cursorOptions, params);
 
 		INSTR_TIME_SET_CURRENT(planduration);
 		INSTR_TIME_SUBTRACT(planduration, planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext)
+		- mem_consumed;
 
 		/* calc differences of buffer counters. */
 		if (es->buffers)
@@ -417,7 +421,7 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 		/* run it (if needed) and produce 

Re: logical decoding and replication of sequences, take 2

2023-07-28 Thread Amit Kapila
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
 wrote:
>
> On 7/26/23 09:27, Amit Kapila wrote:
> > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  wrote:
>
> Anyway, I was thinking about this a bit more, and it seems it's not as
> difficult to use the page LSN to ensure sequences don't go backwards.
>

While studying the changes for this proposal and related areas, I have
a few comments:
1. I think you need to advance the origin if it is changed due to
copy_sequence(), otherwise, if the sync worker restarts after
SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN
value.

2. Between the time of SYNCDONE and READY state, the patch can skip
applying non-transactional sequence changes even if it should apply
it. The reason is that during that state change
should_apply_changes_for_rel() decides whether to apply change based
on the value of remote_final_lsn which won't be set for
non-transactional change. I think we need to send the start LSN of a
non-transactional record and then use that as remote_final_lsn for
such a change.

3. For non-transactional sequence change apply, we don't set
replorigin_session_origin_lsn/replorigin_session_origin_timestamp as
we are doing in apply_handle_commit_internal() before calling
CommitTransactionCommand(). So, that can lead to the origin moving
backwards after restart which will lead to requesting and applying the
same changes again and for that period of time sequence can go
backwards. This needs some more thought as to what is the correct
behaviour/solution for this.

4. BTW, while checking this behaviour, I noticed that the initial sync
worker for sequence mentions the table in the LOG message: "LOG:
logical replication table synchronization worker for subscription
"mysub", table "s" has finished". Won't it be better here to refer to
it as a sequence?

-- 
With Regards,
Amit Kapila.




Re: Row pattern recognition

2023-07-28 Thread Vik Fearing

On 7/26/23 14:21, Tatsuo Ishii wrote:

Attached is the v3 patch. In this patch following changes are made.


Excellent.  Thanks!

A few quick comments:

- PERMUTE is still misspelled as PREMUTE

- PATTERN variables do not have to exist in the DEFINE clause.  They are 
considered TRUE if not present.



(1) I completely changed the pattern matching engine so that it
performs backtracking. Now the engine evaluates all pattern elements
defined in PATTER against each row, saving matched pattern variables
in a string per row. For example if the pattern element A and B
evaluated to true, a string "AB" is created for current row.

This continues until all pattern matching fails or encounters the end
of full window frame/partition. After that, the pattern matching
engine creates all possible "pattern strings" and apply the regular
expression matching to each. For example if we have row 0 = "AB" row 1
= "C", possible pattern strings are: "AC" and "BC".

If it matches, the length of matching substring is saved. After all
possible trials are done, the longest matching substring is chosen and
it becomes the width (number of rows) in the reduced window frame.

See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
in nodeWindowAggs.c for more details. For now I use a naive depth
first search and probably there is a lot of rooms for optimization
(for example rewriting it without using
recursion). Suggestions/patches are welcome.


My own reviews will only focus on correctness for now.  Once we get a 
good set of regression tests all passing, I will focus more on 
optimization.  Of course, others might want to review the performance now.



Vik Fearing wrote:

I strongly disagree with this.  Window function do not need to know
how the frame is defined, and indeed they should not.
WinGetFuncArgInFrame should answer yes or no and the window function
just works on that. Otherwise we will get extension (and possibly even
core) functions that don't handle the frame properly.


So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
those window functions are not needed to be changed.

(3) Window function rpr was removed. We can use first_value instead.


Excellent.


(4) Remaining tasks/issues.

- For now I disable WinSetMarkPosition because RPR often needs to
   access a row before the mark is set. We need to fix this in the
   future.


Noted, and agreed.


- I am working on making window aggregates RPR aware now. The
   implementation is in progress and far from completeness. An example
   is below. I think row 2, 3, 4 of "count" column should be NULL
   instead of 3, 2, 0, 0. Same thing can be said to other
   rows. Probably this is an effect of moving aggregate but I still
   studying the window aggregation code.


This tells me again that RPR is not being run in the right place.  All 
windowed aggregates and frame-level window functions should Just Work 
with no modification.



SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
  WINDOW w AS (
  PARTITION BY company
  ORDER BY tdate
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  AFTER MATCH SKIP PAST LAST ROW
  INITIAL
  PATTERN (START UP+ DOWN+)
  DEFINE
   START AS TRUE,
   UP AS price > PREV(price),
   DOWN AS price < PREV(price)
);
  company  |   tdate| first_value | count
--++-+---
  company1 | 2023-07-01 | 100 | 4
  company1 | 2023-07-02 | | 3
  company1 | 2023-07-03 | | 2
  company1 | 2023-07-04 | | 0
  company1 | 2023-07-05 | | 0
  company1 | 2023-07-06 |  90 | 4
  company1 | 2023-07-07 | | 3
  company1 | 2023-07-08 | | 2
  company1 | 2023-07-09 | | 0
  company1 | 2023-07-10 | | 0
  company2 | 2023-07-01 |  50 | 4
  company2 | 2023-07-02 | | 3
  company2 | 2023-07-03 | | 2
  company2 | 2023-07-04 | | 0
  company2 | 2023-07-05 | | 0
  company2 | 2023-07-06 |  60 | 4
  company2 | 2023-07-07 | | 3
  company2 | 2023-07-08 | | 2
  company2 | 2023-07-09 | | 0
  company2 | 2023-07-10 | | 0


In this scenario, row 1's frame is the first 5 rows and specified SKIP 
PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are 
skipped) and the result of the outer count should be 0 for all of them 
because there are no rows in the frame.


When we get to adding count in the MEASURES clause, there will be a 
difference between no match and empty match, but that does not apply here.



I am going to add this thread to CommitFest and plan to add both of
you as reviewers. Thanks in advance.


My pleasure.  Thank you for working on this difficult feature.
--
Vik Fearing





Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Michael Paquier
On Fri, Jul 28, 2023 at 02:11:48PM +0530, Bharath Rupireddy wrote:
> I don't think something like [1] is complex. It makes worker_spi
> foolproof. Rather, the other approach proposed, that is to provide
> non-conflicting worker IDs to worker_spi_launch in the TAP test file,
> looks complicated to me. And it's easy for someone to come, add a test
> case with conflicting IDs input to worker_spi_launch and end up in the
> same state that we're in now.

Sure, but that's not really something that worries me for a template
such as this one, for the sake of these tests.  So I'd leave things to
be as they are, slightly simpler.  That's a minor point, for sure :)
--
Michael


signature.asc
Description: PGP signature


Re: Row pattern recognition

2023-07-28 Thread Vik Fearing

On 7/28/23 09:09, Tatsuo Ishii wrote:

We already recalculate a frame each time a row is processed even
without RPR. See ExecWindowAgg.


Yes, after each row.  Not for each function.


Ok, I understand now. Closer look at the code, I realized that each
window function calls update_frameheadpos, which computes the frame
head position. But actually it checks winstate->framehead_valid and if
it's already true (probably by other window function), then it does
nothing.


Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW,
which means the frame head is changed each time current row position
changes.


Off topic for now: I wonder why this restriction is in place and
whether we should respect or ignore it.  That is a discussion for
another time, though.


My guess is, it is because other than ROWS BETWEEN CURRENT ROW has
little or no meaning. Consider following example:


Yes, that makes sense.


I strongly disagree with this.  Window function do not need to know
how the frame is defined, and indeed they should not.

We already break the rule by defining *support functions. See
windowfuncs.c.

The support functions don't know anything about the frame, they just
know when a window function is monotonically increasing and execution
can either stop or be "passed through".


I see following code in window_row_number_support:

/*
 * The frame options can always become "ROWS BETWEEN UNBOUNDED
 * PRECEDING AND CURRENT ROW".  row_number() always just 
increments by
 * 1 with each row in the partition.  Using ROWS instead of 
RANGE
 * saves effort checking peer rows during execution.
 */
req->frameOptions = (FRAMEOPTION_NONDEFAULT |
 FRAMEOPTION_ROWS |
 
FRAMEOPTION_START_UNBOUNDED_PRECEDING |
 
FRAMEOPTION_END_CURRENT_ROW);

I think it not only knows about frame but it even changes the frame
options. This seems far from "don't know anything about the frame", no?


That's the planner support function.  The row_number() function itself 
is not even allowed to *have* a frame, per spec.  We allow it, but as 
you can see from that support function, we completely replace it.


So all of the partition-level window functions are not affected by RPR 
anyway.



I have two comments about this:

It isn't just for convenience, it is for correctness.  The window
functions do not need to know which rows they are *not* operating on.

There is no such thing as a "full" or "reduced" frame.  The standard
uses those terms to explain the difference between before and after
RPR is applied, but window functions do not get to choose which frame
they apply over.  They only ever apply over the reduced window frame.


I agree that "full window frame" and "reduced window frame" do not
exist at the same time, and in the end (after computation of reduced
frame), only "reduced" frame is visible to window
functions/aggregates. But I still do think that "full window frame"
and "reduced window frame" are important concept to explain/understand
how PRP works.


If we are just using those terms for documentation, then okay.
--
Vik Fearing





Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-07-28 Thread Etsuro Fujita
Hi Richard,

On Mon, Jul 24, 2023 at 11:45 AM Richard Guo  wrote:
> On Fri, Jul 21, 2023 at 8:51 PM Etsuro Fujita  wrote:
>> * In this bit I changed the last argument to NIL, which would be
>> nitpicking, though.
>>
>> @@ -1038,7 +1038,7 @@ postgresGetForeignPaths(PlannerInfo *root,
>>   add_path(baserel, (Path *) path);
>>
>>   /* Add paths with pathkeys */
>> - add_paths_with_pathkeys_for_rel(root, baserel, NULL);
>> + add_paths_with_pathkeys_for_rel(root, baserel, NULL, NULL);

> This was my oversight.

No.  IIUC, I think that that would work well as-proposed, but I
changed it as such, for readability.

> So the two patches both look good to me now.

Cool!  I pushed the first patch after polishing it a little bit, so
here is a rebased version of the second patch, in which I modified the
ForeignPath and CustomPath cases in reparameterize_path_by_child() to
reflect the new members fdw_restrictinfo and custom_restrictinfo, for
safety, and tweaked a comment a bit.

Thanks for looking!

Best regards,
Etsuro Fujita


0002-Allow-join-pushdown-even-if-pseudoconstant-quals-v3.patch
Description: Binary data


Re: Avoid unused value (src/fe_utils/print.c)

2023-07-28 Thread Karina Litskevich
On Tue, Jul 25, 2023 at 1:04 AM Ranier Vilela  wrote:

> Checked today, Coverity does not complain about need_recordsep.
>

Great! Thanks.
So v2 patch makes Coverity happy, and as for me doesn't make the code
less readable. Does anyone have any objections?

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/


Re: [PoC] Reducing planning time when tables have many partitions

2023-07-28 Thread Yuya Watari
Hello,

On Fri, Jul 28, 2023 at 1:27 PM Andrey Lepikhov
 wrote:
> Sorry for this. It was definitely a false alarm. In this patch,
> assertion checking adds much overhead. After switching it off, I found
> out that this feature solves my problem with a quick pass through the
> members of an equivalence class. Planning time results for the queries
> from the previous letter:
> 1 - 0.4s, 2 - 1.3s, 3 - 1.3s; (with the patches applied)
> 1 - 5s; 2 - 8.7s; 3 - 22s; (current master).
>
> I have attached flamegraph that shows query 2 planning process after
> applying this set of patches. As you can see, overhead at the
> equivalence class routines has gone.

I really appreciate testing the patches and sharing your results. The
results are interesting because they show that our optimization
effectively reduces planning time for your workload containing
different queries than I have used in my benchmarks.

Thank you again for reviewing this.

-- 
Best regards,
Yuya Watari




Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Bharath Rupireddy
On Fri, Jul 28, 2023 at 1:26 PM Michael Paquier  wrote:
>
> On Fri, Jul 28, 2023 at 10:47:39AM +0530, Bharath Rupireddy wrote:
> > +# check their existence.  Use IDs that do not overlap with the schemas 
> > created
> > +# by the previous workers.
> >
> > While using different IDs in tests is a simple fix, -1 for it. I'd
> > prefer if worker_spi uses different schema prefixes for static and
> > dynamic bg workers to avoid conflicts. We can either look at
> > MyBgworkerEntry->bgw_type in worker_spi_main and have schema name as
> > '{static, dyamic}_worker_schema_%d', id or pass schema name in
> > bgw_extra.
>
> For the sake of a test module, I am not really convinced that there is
> any need to go down to such complexity with the names of the schemas
> created.

I don't think something like [1] is complex. It makes worker_spi
foolproof. Rather, the other approach proposed, that is to provide
non-conflicting worker IDs to worker_spi_launch in the TAP test file,
looks complicated to me. And it's easy for someone to come, add a test
case with conflicting IDs input to worker_spi_launch and end up in the
same state that we're in now.

[1]
diff --git a/src/test/modules/worker_spi/t/001_worker_spi.pl
b/src/test/modules/worker_spi/t/001_worker_spi.pl
index c293871313..700530afc7 100644
--- a/src/test/modules/worker_spi/t/001_worker_spi.pl
+++ b/src/test/modules/worker_spi/t/001_worker_spi.pl
@@ -27,16 +27,16 @@ is($result, 't', "dynamic bgworker launched");
 $node->poll_query_until(
'postgres',
qq[SELECT count(*) > 0 FROM information_schema.tables
-   WHERE table_schema = 'schema4' AND table_name = 'counted';]);
+   WHERE table_schema = 'dynamic_worker_schema4' AND
table_name = 'counted';]);
 $node->safe_psql('postgres',
-   "INSERT INTO schema4.counted VALUES ('total', 0), ('delta', 1);");
+   "INSERT INTO dynamic_worker_schema4.counted VALUES ('total',
0), ('delta', 1);");
 # Issue a SIGHUP on the node to force the worker to loop once, accelerating
 # this test.
 $node->reload;
 # Wait until the worker has processed the tuple that has just been inserted.
 $node->poll_query_until('postgres',
-   qq[SELECT count(*) FROM schema4.counted WHERE type = 'delta';], '0');
-$result = $node->safe_psql('postgres', 'SELECT * FROM schema4.counted;');
+   qq[SELECT count(*) FROM dynamic_worker_schema4.counted WHERE
type = 'delta';], '0');
+$result = $node->safe_psql('postgres', 'SELECT * FROM
dynamic_worker_schema4.counted;');
 is($result, qq(total|1), 'dynamic bgworker correctly consumed tuple data');

 note "testing bgworkers loaded with shared_preload_libraries";
diff --git a/src/test/modules/worker_spi/worker_spi.c
b/src/test/modules/worker_spi/worker_spi.c
index 903dcddef9..02b4204aa2 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -135,10 +135,19 @@ worker_spi_main(Datum main_arg)
int index = DatumGetInt32(main_arg);
worktable  *table;
StringInfoData buf;
-   charname[20];
+   charname[NAMEDATALEN];

table = palloc(sizeof(worktable));
-   sprintf(name, "schema%d", index);
+
+   /*
+* Use different schema names for static and dynamic bg workers to avoid
+* name conflicts.
+*/
+   if (strcmp(MyBgworkerEntry->bgw_type, "worker_spi") == 0)
+   sprintf(name, "worker_schema%d", index);
+   else if (strcmp(MyBgworkerEntry->bgw_type, "worker_spi dynamic") == 0)
+   sprintf(name, "dynamic_worker_schema%d", index);
+
table->schema = pstrdup(name);
table->name = pstrdup("counted");

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-07-28 Thread Richard Guo
On Thu, Jul 27, 2023 at 10:10 PM Ashutosh Bapat <
ashutosh.bapat@gmail.com> wrote:

> The attached patch (0002) fixes this issue by
> 1. declaring child SpecialJoinInfo as a local variable, thus
> allocating memory on the stack instead of CurrentMemoryContext. The
> memory on the stack is freed automatically.
> 2. Freeing the Relids in SpecialJoinInfo explicitly after
> SpecialJoinInfo has been used.


It doesn't seem too expensive to translate SpecialJoinInfos, so I think
it's OK to construct and free the SpecialJoinInfo for a child join on
the fly.  So the approach in 0002 looks reasonable to me.  But there is
something that needs to be fixed in 0002.

+   bms_free(child_sjinfo->commute_above_l);
+   bms_free(child_sjinfo->commute_above_r);
+   bms_free(child_sjinfo->commute_below_l);
+   bms_free(child_sjinfo->commute_below_r);

These four members in SpecialJoinInfo only contain outer join relids.
They do not need to be translated.  So they would reference the same
memory areas in child_sjinfo as in parent_sjinfo.  We should not free
them, otherwise they would become dangling pointers in parent sjinfo.

Thanks
Richard


Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Michael Paquier
On Fri, Jul 28, 2023 at 10:47:39AM +0530, Bharath Rupireddy wrote:
> +# check their existence.  Use IDs that do not overlap with the schemas 
> created
> +# by the previous workers.
> 
> While using different IDs in tests is a simple fix, -1 for it. I'd
> prefer if worker_spi uses different schema prefixes for static and
> dynamic bg workers to avoid conflicts. We can either look at
> MyBgworkerEntry->bgw_type in worker_spi_main and have schema name as
> '{static, dyamic}_worker_schema_%d', id or pass schema name in
> bgw_extra.

For the sake of a test module, I am not really convinced that there is
any need to go down to such complexity with the names of the schemas
created.
--
Michael


signature.asc
Description: PGP signature


Re: pg_rewind fails with in-place tablespace

2023-07-28 Thread Michael Paquier
On Tue, Jul 25, 2023 at 04:36:42PM +0900, Michael Paquier wrote:
> On Wed, Jul 19, 2023 at 09:31:35PM +0800, 赵锐(惜元) wrote:
>>  To help reproduce the failure, I have attached a tap test. And I am
>>  pleased to say that I have also identified a solution for this
>>  problem, which I have included in the patch. 
>>  Thank you for your attention to this matter.
> 
> Issue reproduced here, and agreed that we'd better do something about
> that.  I am not sure if your patch is right for the job though, but
> I'll try to study that a bit more.

It took me some time to remember that for the case of a local source
we'd finish by using recurse_dir() and consider the in-place
tablespace as a regular directory, so a fix located in
libpq_traverse_files() sounds good to me.

+   if (strncmp(link_target, "pg_tblspc/", strlen("pg_tblspc/")) == 0)
+   type = FILE_TYPE_DIRECTORY;
+   else
+   type = FILE_TYPE_SYMLINK;

However this is not consistent with the other places where we detect
if an in-place tablespace is used, like pg_basebackup.c, where we rely
on the fact that the tablespace path is a relative path, using
is_absolute_path() to make the difference between a normal and
in-place tablespace.  I would choose consistency and do the same here,
checking if we have an absolute or relative path, depending on the
result of pg_tablespace_location().

Testing only for the creation of the tablespace is fine for the sake
of the report, but I would slightly more here and create a table on
this tablespace with some data, and a check_query() once pg_rewind is
done.

I am finishing with the attached.  Thoughts?
--
Michael
From 728bd093959f0d51c234f88729798165176fc0f6 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 28 Jul 2023 16:53:21 +0900
Subject: [PATCH] Fix pg_rewind with in-place tablespaces when source is remote

---
 src/bin/pg_rewind/libpq_source.c  | 11 ++-
 src/bin/pg_rewind/t/001_basic.pl  | 20 
 src/bin/pg_rewind/t/RewindTest.pm |  1 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 0d8e9ee2d1..417c74cfef 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -298,7 +298,16 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback)
 		link_target = PQgetvalue(res, i, 3);
 
 		if (link_target[0])
-			type = FILE_TYPE_SYMLINK;
+		{
+			/*
+			 * In-place tablespaces are directories located in pg_tblspc/ with
+			 * relative paths.
+			 */
+			if (is_absolute_path(link_target))
+type = FILE_TYPE_SYMLINK;
+			else
+type = FILE_TYPE_DIRECTORY;
+		}
 		else if (isdir)
 			type = FILE_TYPE_DIRECTORY;
 		else
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index 031594e14e..c7b48255a7 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -18,6 +18,12 @@ sub run_test
 	RewindTest::setup_cluster($test_mode);
 	RewindTest::start_primary();
 
+	# Create an in-place tablespace with some data on it.
+	primary_psql("CREATE TABLESPACE space_test LOCATION ''");
+	primary_psql("CREATE TABLE space_tbl (d text) TABLESPACE space_test");
+	primary_psql(
+		"INSERT INTO space_tbl VALUES ('in primary, before promotion')");
+
 	# Create a test table and insert a row in primary.
 	primary_psql("CREATE TABLE tbl1 (d text)");
 	primary_psql("INSERT INTO tbl1 VALUES ('in primary')");
@@ -78,6 +84,13 @@ sub run_test
 		"insert into drop_tbl values ('in primary, after promotion')");
 	primary_psql("DROP TABLE drop_tbl");
 
+	# Insert some data in the in-place tablespace for the old primary and
+	# the standby.
+	primary_psql(
+		"INSERT INTO space_tbl VALUES ('in primary, after promotion')");
+	standby_psql(
+		"INSERT INTO space_tbl VALUES ('in standby, after promotion')");
+
 	# Before running pg_rewind, do a couple of extra tests with several
 	# option combinations.  As the code paths taken by those tests
 	# do not change for the "local" and "remote" modes, just run them
@@ -145,6 +158,13 @@ sub run_test
 
 	RewindTest::run_pg_rewind($test_mode);
 
+	check_query(
+		'SELECT * FROM space_tbl ORDER BY d',
+		qq(in primary, before promotion
+in standby, after promotion
+),
+		'table content');
+
 	check_query(
 		'SELECT * FROM tbl1',
 		qq(in primary
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 4957791e94..8fbbd521cb 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -131,6 +131,7 @@ sub setup_cluster
 	$node_primary->append_conf(
 		'postgresql.conf', qq(
 wal_keep_size = 320MB
+allow_in_place_tablespaces = on
 ));
 	return;
 }
-- 
2.40.1



signature.asc
Description: PGP signature


Re: add timing information to pg_upgrade

2023-07-28 Thread Bharath Rupireddy
On Fri, Jul 28, 2023 at 5:21 AM Nathan Bossart  wrote:
>
> This information can be used to better understand where the time is going
> and to validate future improvements.  I'm open to suggestions on formatting
> the timing information, assuming folks are interested in this idea.
>
> Thoughts?

+1 for adding time taken.

Some comments on the patch:
1.
+gettimeofday(_start, NULL);
+gettimeofday(_end, NULL);
+start_ms = (step_start.tv_sec * 1000L) + (step_start.tv_usec / 1000L);
+end_ms = (step_end.tv_sec * 1000L) + (step_end.tv_usec / 1000L);
+elapsed_ms = end_ms - start_ms;

How about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and
INSTR_TIME_GET_MILLISEC macros instead of gettimeofday and explicit
calculations?

2.
>   Checking database user is the install user  ok (took 3 ms)
>   Checking database connection settings   ok (took 4 ms)

+report_status(PG_REPORT, "ok (took %ld ms)", elapsed_ms);

I think it's okay to just leave it with "ok  \t %ld ms", elapsed_ms);
without "took". FWIW, pg_regress does that way, see below:

ok 2 + boolean50 ms
ok 3 + char   32 ms
ok 4 + name   33 ms


>   Performing Consistency Checks
>   -
>   Checking cluster versions   ok (took 0 ms)
>   Checking database user is the install user  ok (took 3 ms)
>   Checking database connection settings   ok (took 4 ms)
>   Checking for prepared transactions  ok (took 2 ms)
>   Checking for system-defined composite types in user tables  ok (took 82 ms)
>   Checking for reg* data types in user tables ok (took 55 ms)

Just curious, out of all the reported pg_upgrade prep_status()-es,
which ones are taking more time?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Nikita Malakhov
Hi,

I'd keep Autotools build up to date, because Meson is very limited in terms
of not very
up-to-date OS versions. Anyway, it's OK now. I'm currently playing with the
patch and
reviewing sources, if you need any cooperation - please let us know. I'm +1
for committing
this patch after review.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Support to define custom wait events for extensions

2023-07-28 Thread Bharath Rupireddy
On Fri, Jul 28, 2023 at 6:36 AM Michael Paquier  wrote:
>
> I have spent more time polishing the docs and the comments.  This v9
> looks in a rather committable shape now with docs, tests and core
> routines in place.

Thanks. Here are some comments on v9 patch:

1.
- so an LWLock wait event might be reported as
- just extension rather than the
- extension-assigned name.
+ if the extension's library is not loaded; so a custom wait event might
+ be reported as just ???
+ rather than the custom name assigned.

Trying to understand why '???' is any better than 'extension' for a
registered custom wait event of an unloaded extension?

PS: Looked at other instances where '???' is being used for
representing an unknown "thing".

2. Have an example of how a custom wait event is displayed in the
example in the docs "Here is an example of how wait events can be
viewed:". We can use the worker_spi wait event type there.

3.
- so an LWLock wait event might be reported as
- just extension rather than the
- extension-assigned name.

+ . In some cases, the name
+ assigned by an extension will not be available in all server processes
+ if the extension's library is not loaded; so a custom wait event might
+ be reported as just ???

Are we missing to explicitly say what wait event will be reported for
an LWLock when the extension library is not loaded?

4.
+ Add-ins can define custom wait events under the wait event type

I see a few instances of Add-ins/add-in in xfunc.sgml. Isn't it better
to use the word extension given that glossary defines what an
extension is 
https://www.postgresql.org/docs/current/glossary.html#GLOSSARY-EXTENSION?

5.
+} WaitEventExtensionCounter;
+
+/* pointer to the shared memory */
+static WaitEventExtensionCounter *waitEventExtensionCounter;

How about naming the structure variable as
WaitEventExtensionCounterData and pointer as
WaitEventExtensionCounter? This keeps all the static variable names
consistent WaitEventExtensionNames, WaitEventExtensionNamesAllocated
and WaitEventExtensionCounter.

6.
+/* Check the wait event class. */
+Assert((wait_event_info & 0xFF00) == PG_WAIT_EXTENSION);
+
+/* This should only be called for user-defined wait event. */
+Assert(eventId >= NUM_BUILTIN_WAIT_EVENT_EXTENSION);

Maybe, we must turn the above asserts into ereport(ERROR) to protect
against an extension sending in an unregistered wait_event_info?
Especially, the first Assert((wait_event_info & 0xFF00) ==
PG_WAIT_EXTENSION); checks that the passed in wait_event_info is
previously returned by WaitEventExtensionNew. IMO, these assertions
better fit for errors.

7.
+ * Extensions can define their own wait events in this categiry.  First,
Typo - s/categiry/category

8.
+ First,
+ * they should call WaitEventExtensionNew() to get one or more wait event
+ * IDs that are allocated from a shared counter.

Can WaitEventExtensionNew() be WaitEventExtensionNew(int num_ids, int
*result) to get the required number of wait event IDs in one call
similar to RequestNamedLWLockTranche? Currently, an extension needs to
call WaitEventExtensionNew() N number of times to get N wait event
IDs. Maybe the existing WaitEventExtensionNew() is good, but just a
thought.

9.
# The expected result is a special pattern here with a newline coming from the
# first query where the shared memory state is set.
$result = $node->poll_query_until(
'postgres',
qq[SELECT worker_spi_init(); SELECT wait_event FROM
pg_stat_activity WHERE backend_type ~ 'worker_spi';],
qq[
worker_spi_main]);

This test doesn't have to be that complex with the result being a
special pattern, SELECT worker_spi_init(); can just be within a
separate safe_psql.

10.
+wsstate = ShmemInitStruct("custom_wait_event",

Name the shared memory just "worker_spi" to make it generic and
extensible. Essentially, it is a woker_spi shared memory area part of
it is for custom wait event id.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: remaining sql/json patches

2023-07-28 Thread Amit Langote
Hello,

On Thu, Jul 27, 2023 at 6:36 PM Shinoda, Noriyoshi (HPE Services Japan
- FSIP)  wrote:
> Hi,
> Thank you for developing such a great feature. The attached patch formats the 
> documentation like any other function definition:
> - Added right parenthesis to json function calls.
> - Added  to json functions.
> - Added a space to the 'expression' part of the json_scalar function.
> - Added a space to the 'expression' part of the json_serialize function.

Thanks for checking and the patch.  Will push shortly.

> It seems that the three functions added this time do not have tuples in the 
> pg_proc catalog. Is it unnecessary?

Yes.  These are not functions that get pg_proc entries, but SQL
constructs that *look like* functions, similar to XMLEXISTS(), etc.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-28 Thread Ashutosh Bapat
On Tue, Jul 11, 2023 at 4:30 PM Tom Lane  wrote:
>
> Jeevan Chalke  writes:
> > Attached patch.
>
> I would be astonished if this fixes anything.  The code still doesn't
> know which paths are referenced by which other ones, and so the place
> where we free a previously-added path can't know what to do.
>
> I've speculated about adding some form of reference counting to paths
> (maybe just a "pin" flag rather than a full refcount) so that we could
> be smarter about this.  The existing kluge for "don't free IndexPaths"
> could be replaced by setting the pin mark on any IndexPath that we
> make a bitmap path from.  Up to now it hasn't seemed necessary to
> generalize that hack, but maybe it's time.  Can you show a concrete
> case where we are freeing a still-referenced path?

Set of patches in [1] add infrastructure to reference, link and unlink
paths.The patches are raw and have some TODOs there. But I think that
infrastructure will solve this problem as a side effect. Please take a
look and let me know if this is as per your speculation. It's more
than just pinning though.

The patch set uses references to free memory consumed by paths which
remain unused. The memory consumed is substantial when partitionwise
join is used and there are thousands of partitions.

[1] 
https://www.postgresql.org/message-id/CAExHW5tUcVsBkq9qT%3DL5vYz4e-cwQNw%3DKAGJrtSyzOp3F%3DXacA%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Anthonin Bonnefoy
> Agree, something goes wrong when using Autotools (but not Meson) on
> both Linux and MacOS. I didn't investigate the issue though.
I was only using meson and forgot to keep Automake files up to date when
I've split pg_tracing.c in multiple files (span.c, explain.c...).

On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov  wrote:

> Hi,
>
> I've fixed the Autotools build, please check patch below (v2).
>
> On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <
> aleksan...@timescale.com> wrote:
>
>> Hi,
>>
>> > Also FYI, there are build warnings because functions
>> > const char * get_span_name(const Span * span, const char *qbuffer)
>> > and
>> > const char * get_operation_name(const Span * span, const char *qbuffer)
>> > do not have default inside switch and no return outside of switch.
>>
>> You are right, there are a few warnings:
>>
>> ```
>> [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
>> ../contrib/pg_tracing/span.c: In function ‘get_span_name’:
>> ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>   210 | }
>>   | ^
>> ../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
>> ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>   249 | }
>>   | ^
>> ```
>>
>> Here is the patch v2 with a quick fix.
>>
>> > but got errors calling make check and cannot install the extension
>>
>> Agree, something goes wrong when using Autotools (but not Meson) on
>> both Linux and MacOS. I didn't investigate the issue though.
>>
>> --
>> Best regards,
>> Aleksander Alekseev
>>
>
>
> --
> Regards,
>
> --
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Nikita Malakhov
Hi,

I've fixed the Autotools build, please check patch below (v2).

On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi,
>
> > Also FYI, there are build warnings because functions
> > const char * get_span_name(const Span * span, const char *qbuffer)
> > and
> > const char * get_operation_name(const Span * span, const char *qbuffer)
> > do not have default inside switch and no return outside of switch.
>
> You are right, there are a few warnings:
>
> ```
> [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
> ../contrib/pg_tracing/span.c: In function ‘get_span_name’:
> ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
> non-void function [-Wreturn-type]
>   210 | }
>   | ^
> ../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
> ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
> non-void function [-Wreturn-type]
>   249 | }
>   | ^
> ```
>
> Here is the patch v2 with a quick fix.
>
> > but got errors calling make check and cannot install the extension
>
> Agree, something goes wrong when using Autotools (but not Meson) on
> both Linux and MacOS. I didn't investigate the issue though.
>
> --
> Best regards,
> Aleksander Alekseev
>


-- 
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


pg-tracing-v2.patch
Description: Binary data