from:"Tomas Vondra"

WIP: parallel GiST index builds

2024-06-07 Thread Tomas Vondra

a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFakeLSN() to produce a new fake LSN
every time when is_parallel=true.


performance
---

Obviously, the primary goal of the patch is to speed up the builds, so
does it actually do that? For indexes of different sizes I got this
timings (in seconds):

   scale  type   0 1357
  --
   small  inet  13 7442
  numeric  239   122   67   46   36
  oid   15 8532
  text  7135   19   13   10
   medium inet 207   111   59   42   32
  numeric 3409  1714  885  618  490
  oid  214   114   60   43   33
  text 940   479  247  180  134
   large  inet2167  1459  865  632  468
  numeric38125 2025610454 7487 5846
  oid 2647  1490  808  594  475
  text   10987  6298 3376 2462 1961

Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index,
depending on the data type.

The raw duration is not particularly easy to interpret, so let's look at
the "actual speedup" which is calculated as

   (serial duration) / (parallel duration)

and the table looks like this:

   scaletype 1  3  5  7
  --
   smallinet   1.93.33.36.5
numeric2.03.65.26.6
oid1.93.05.07.5
text   2.03.75.57.1
   medium   inet   1.93.54.96.5
numeric2.03.95.57.0
oid1.93.65.06.5
text   2.03.85.27.0
   largeinet   1.52.53.44.6
numeric1.93.65.16.5
oid1.83.34.55.6
text   1.73.34.55.6

Ideally (if the build scaled linearly with the number of workers), we'd
get the number of workers + 1 (because the leader participates too).
Obviously, it's not that great - for example for text with 3 workers we
get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers.

But I think those numbers are actually pretty good - I'd definitely not
complain if my index builds got 5x faster.

But those are synthetic tests on random data, using the btree_gist
opclasses. It'd be interesting if people could do their own testing on
real-world data sets.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom f684556a910f566a8c6a6ea0dd588173ab94a245 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20240607] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builts (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Tomas Vondra

On 6/6/24 12:58, Julien Tachoires wrote:
> ...
>
> When compiled with LZ4 support (--with-lz4), this patch enables data
> compression/decompression of these temporary files. Each transaction
> change that must be written on disk (ReorderBufferDiskChange) is now
> compressed and encapsulated in a new structure.
> 

I'm a bit confused, but why tie this to having lz4? Why shouldn't this
be supported even for pglz, or whatever algorithms we add in the future?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Tomas Vondra

On 6/6/24 16:24, Alvaro Herrera wrote:
> On 2024-Jun-06, Amit Kapila wrote:
> 
>> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
>>>
>>> When the content of a large transaction (size exceeding
>>> logical_decoding_work_mem) and its sub-transactions has to be
>>> reordered during logical decoding, then, all the changes are written
>>> on disk in temporary files located in pg_replslot/.
>>> Decoding very large transactions by multiple replication slots can
>>> lead to disk space saturation and high I/O utilization.
> 
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
> 
>> Why can't one use 'streaming' option to send changes to the client
>> once it reaches the configured limit of 'logical_decoding_work_mem'?
> 
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.
> 
> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
> 
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
> 
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
> 
> (I don't mean to say that you should implement Zstd compression with
> this patch, only that you should choose the implementation so that
> adding Zstd support (or whatever) later is just a matter of adding some
> branches here and there.  With the current #ifdef you propose, it's hard
> to do that.  Maybe separate the parts that depend on the specific
> algorithm to algorithm-agnostic functions.)
> 

I haven't been following the "libpq compression" thread, but wouldn't
that also do compression for the streaming case? That was my assumption,
at least, and it seems like the right way - we probably don't want to
patch every place that sends data over network independently, right?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict Detection and Resolution

2024-06-07 Thread Tomas Vondra




On 6/3/24 09:30, Amit Kapila wrote:
> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>  wrote:
>>
>> On 5/23/24 08:36, shveta malik wrote:
>>>
>>> Conflict Resolution
>>> 
>>> a) latest_timestamp_wins:The change with later commit timestamp wins.
>>> b) earliest_timestamp_wins:   The change with earlier commit timestamp wins.
>>> c) apply:   Always apply the remote change.
>>> d) skip:Remote change is skipped.
>>> e) error:   Error out on conflict. Replication is stopped, manual
>>> action is needed.
>>>
>>
>> Why not to have some support for user-defined conflict resolution
>> methods, allowing to do more complex stuff (e.g. merging the rows in
>> some way, perhaps even with datatype-specific behavior)?
>>
>>> The change will be converted to 'UPDATE' and applied if the decision
>>> is in favor of applying remote change.
>>>
>>> It is important to have commit timestamp info available on subscriber
>>> when latest_timestamp_wins or earliest_timestamp_wins method is chosen
>>> as resolution method.  Thus ‘track_commit_timestamp’ must be enabled
>>> on subscriber, in absence of which, configuring the said
>>> timestamp-based resolution methods will result in error.
>>>
>>> Note: If the user has chosen the latest or earliest_timestamp_wins,
>>> and the remote and local timestamps are the same, then it will go by
>>> system identifier. The change with a higher system identifier will
>>> win. This will ensure that the same change is picked on all the nodes.
>>
>> How is this going to deal with the fact that commit LSN and timestamps
>> may not correlate perfectly? That is, commits may happen with LSN1 <
>> LSN2 but with T1 > T2.
>>
> 
> One of the possible scenarios discussed at pgconf.dev with Tomas for
> this was as follows:
> 
> Say there are two publisher nodes PN1, PN2, and subscriber node SN3.
> The logical replication is configured such that a subscription on SN3
> has publications from both PN1 and PN2. For example, SN3 (sub) -> PN1,
> PN2 (p1, p2)
> 
> Now, on PN1, we have the following operations that update the same row:
> 
> T1
> Update-1 on table t1 at LSN1 (1000) on time (200)
> 
> T2
> Update-2 on table t1 at LSN2 (2000) on time (100)
> 
> Then in parallel, we have the following operation on node PN2 that
> updates the same row as Update-1, and Update-2 on node PN1.
> 
> T3
> Update-3 on table t1 at LSN(1500) on time (150)
> 
> By theory, we can have a different state on subscribers depending on
> the order of updates arriving at SN3 which shouldn't happen. Say, the
> order in which they reach SN3 is: Update-1, Update-2, Update-3 then
> the final row we have is by Update-3 considering we have configured
> last_update_wins as a conflict resolution method. Now, consider the
> other order:  Update-1, Update-3, Update-2, in this case, the final
> row will be by Update-2 because when we try to apply Update-3, it will
> generate a conflict and as per the resolution method
> (last_update_wins) we need to retain Update-1.
> 
> On further thinking, the operations on node-1 PN-1 as defined above
> seem impossible because one of the Updates needs to wait for the other
> to write a commit record. So the commits may happen with LSN1 < LSN2
> but with T1 > T2 but they can't be on the same row due to locks. So,
> the order of apply should still be consistent. Am, I missing
> something?
> 

Sorry, I should have read your message before responding a couple
minutes ago. I think you're right this exact example can't happen, due
to the dependency between transactions.

But as I wrote, I'm not quite convinced this means there are not other
issues with this way of resolving conflicts. It's more likely a more
complex scenario is required.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict Detection and Resolution

2024-06-07 Thread Tomas Vondra

On 5/28/24 11:17, Nisha Moond wrote:
> On Mon, May 27, 2024 at 11:19 AM shveta malik  wrote:
>>
>> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>>  wrote:
>>>
>>> ...
>>>
>>> I don't understand the why should update_missing or update_deleted be
>>> different, especially considering it's not detected reliably. And also
>>> that even if we happen to find the row the associated TOAST data may
>>> have already been removed. So why would this matter?
>>
>> Here, we are trying to tackle the case where the row is 'recently'
>> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
>> want to opt for a different resolution in such a case as against the
>> one where the corresponding row was not even present in the first
>> place. The case where the row was deleted long back may not fall into
>> this category as there are higher chances that they have been removed
>> by vacuum and can be considered equivalent to the update_ missing
>> case.
>>
>> Regarding "TOAST column" for deleted row cases, we may need to dig
>> more. Thanks for bringing this case. Let me analyze more here.
>>
> I tested a simple case with a table with one TOAST column and found
> that when a tuple with a TOAST column is deleted, both the tuple and
> corresponding pg_toast entries are marked as ‘deleted’ (dead) but not
> removed immediately. The main tuple and respective pg_toast entry are
> permanently deleted only during vacuum. First, the main table’s dead
> tuples are vacuumed, followed by the secondary TOAST relation ones (if
> available).
> Please let us know if you have a specific scenario in mind where the
> TOAST column data is deleted immediately upon ‘delete’ operation,
> rather than during vacuum, which we are missing.
> 

I'm pretty sure you can vacuum the TOAST table directly, which means
you'll end up with a deleted tuple with TOAST pointers, but with the
TOAST entries already gone.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict Detection and Resolution

2024-06-07 Thread Tomas Vondra

On 5/27/24 07:48, shveta malik wrote:
> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>  wrote:
>>
>> On 5/23/24 08:36, shveta malik wrote:
>>> Hello hackers,
>>>
>>> Please find the proposal for Conflict Detection and Resolution (CDR)
>>> for Logical replication.
>>> >> below details.>
>>>
>>> Introduction
>>> 
>>> In case the node is subscribed to multiple providers, or when local
>>> writes happen on a subscriber, conflicts can arise for the incoming
>>> changes.  CDR is the mechanism to automatically detect and resolve
>>> these conflicts depending on the application and configurations.
>>> CDR is not applicable for the initial table sync. If locally, there
>>> exists conflicting data on the table, the table sync worker will fail.
>>> Please find the details on CDR in apply worker for INSERT, UPDATE and
>>> DELETE operations:
>>>
>>
>> Which architecture are you aiming for? Here you talk about multiple
>> providers, but the wiki page mentions active-active. I'm not sure how
>> much this matters, but it might.
> 
> Currently, we are working for multi providers case but ideally it
> should work for active-active also. During further discussion and
> implementation phase, if we find that, there are cases which will not
> work in straight-forward way for active-active, then our primary focus
> will remain to first implement it for multiple providers architecture.
> 
>>
>> Also, what kind of consistency you expect from this? Because none of
>> these simple conflict resolution methods can give you the regular
>> consistency models we're used to, AFAICS.
> 
> Can you please explain a little bit more on this.
> 

I was referring to the well established consistency models / isolation
levels, e.g. READ COMMITTED or SNAPSHOT ISOLATION. This determines what
guarantees the application developer can expect, what anomalies can
happen, etc.

I don't think any such isolation level can be implemented with a simple
conflict resolution methods like last-update-wins etc. For example,
consider an active-active where both nodes do

  UPDATE accounts SET balance=balance+1000 WHERE id=1

This will inevitably lead to a conflict, and while the last-update-wins
resolves this "consistently" on both nodes (e.g. ending with the same
result), it's essentially a lost update.

This is a very simplistic example of course, I recall there are various
more complex examples involving foreign keys, multi-table transactions,
constraints, etc. But in principle it's a manifestation of the same
inherent limitation of conflict detection and resolution etc.

Similarly, I believe this affects not just active-active, but also the
case where one node aggregates data from multiple publishers. Maybe not
to the same extent / it might be fine for that use case, but you said
the end goal is to use this for active-active. So I'm wondering what's
the plan, there.

If I'm writing an application for active-active using this conflict
handling, what assumptions can I make? Will Can I just do stuff as if on
a single node, or do I need to be super conscious about the zillion ways
things can misbehave in a distributed system?

My personal opinion is that the closer this will be to the regular
consistency levels, the better. If past experience taught me anything,
it's very hard to predict how distributed systems with eventual
consistency behave, and even harder to actually test the application in
such environment.

In any case, if there are any differences compared to the usual
behavior, it needs to be very clearly explained in the docs.

>>
>>> INSERT
>>> 
>>> To resolve INSERT conflict on subscriber, it is important to find out
>>> the conflicting row (if any) before we attempt an insertion. The
>>> indexes or search preference for the same will be:
>>> First check for replica identity (RI) index.
>>>   - if not found, check for the primary key (PK) index.
>>> - if not found, then check for unique indexes (individual ones or
>>> added by unique constraints)
>>>  - if unique index also not found, skip CDR
>>>
>>> Note: if no RI index, PK, or unique index is found but
>>> REPLICA_IDENTITY_FULL is defined, CDR will still be skipped.
>>> The reason being that even though a row can be identified with
>>> REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate
>>> rows. Hence, we should not go for conflict detection in such a case.
>>>
>>
>> It's not clear to me why would REPLICA_IDENTITY_FULL mean the table is
>> allowed to have duplicate values? It just means the

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-24 Thread Tomas Vondra




On 5/24/24 22:44, Tom Lane wrote:
> Joe Conway  writes:
>> On 5/24/24 15:45, Tom Lane wrote:
>>> I was *not* proposing doing a regular review, unless of course
>>> somebody really wants to do that.  What I am thinking about is
>>> suggesting how to make progress on patches that are stuck, or in some
>>> cases delivering the bad news that this patch seems unlikely to ever
>>> get accepted and it's time to cut our losses.  (Patches that seem to
>>> be moving along in good order probably don't need any attention in
>>> this process, beyond determining that that's the case.)  That's why
>>> I think we need some senior people doing this, as their opinions are
>>> more likely to be taken seriously.
> 
>> Maybe do a FOSDEM-style dev meeting with triage review at PG.EU would at 
>> least move us forward? Granted it is less early and perhaps less often 
>> than the thread seems to indicate, but has been tossed around before and 
>> seems doable.
> 
> Perhaps.  The throughput of an N-person meeting is (at least) a factor
> of N less than the same N people looking at patches individually.
> On the other hand, the consensus of a meeting is more likely to be
> taken seriously than a single person's opinion, senior or not.
> So it could work, but I think we'd need some prefiltering so that
> the meeting only spends time on those patches already identified as
> needing help.
> 

I personally don't think the FOSDEM triage is a very productive use of
our time - we go through patches top to bottom, often with little idea
what the current state of the patch is. We always ran out of time after
looking at maybe 1/10 of the list.

Having an in-person discussion about patches would be good, but I think
we should split the meeting into much smaller groups for that, each
looking at a different subset. And maybe it should be determined in
advance, so that people can look at those patches in advance ...


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict Detection and Resolution

2024-05-24 Thread Tomas Vondra

key row differs
> from the local row i.e.; the row has already been updated locally or
> by different nodes.
> b) update_missing: The row with the same value as that incoming
> update's key does not exist. Remote is trying to update a row which
> does not exist locally.
> c) update_deleted: The row with the same value as that incoming
> update's key does not exist. The row is already deleted. This conflict
> type is generated only if the deleted row is still detectable i.e., it
> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> it cannot detect this conflict. It will detect it as update_missing
> and will follow the default or configured resolver of update_missing
> itself.
> 

I don't understand the why should update_missing or update_deleted be
different, especially considering it's not detected reliably. And also
that even if we happen to find the row the associated TOAST data may
have already been removed. So why would this matter?


>  Conflict Resolutions:
> 
> a)   latest_timestamp_wins:The change with later commit timestamp
> wins. Can be used for ‘update_differ’.
> b)   earliest_timestamp_wins:   The change with earlier commit
> timestamp wins. Can be used for ‘update_differ’.
> c)   apply:   The remote change is always applied.  Can be used for
> ‘update_differ’.
> d)   apply_or_skip: Remote change is converted to INSERT and is
> applied. If the complete row cannot be constructed from the info
> provided by the publisher, then the change is skipped. Can be used for
> ‘update_missing’ or ‘update_deleted’.
> e)   apply_or_error: Remote change is converted to INSERT and is
> applied. If the complete row cannot be constructed from the info
> provided by the publisher, then error is raised. Can be used for
> ‘update_missing’ or ‘update_deleted’.
> f) skip: Remote change is skipped and local one is retained. Can be
> used for any conflict type.
> g)   error: Error out of conflict. Replication is stopped, manual
> action is needed.  Can be used for any conflict type.
> 
>  To support UPDATE CDR, the presence of either replica identity Index
> or primary key is required on target node.  Update CDR will not be
> supported in absence of replica identity index or primary key even
> though REPLICA IDENTITY FULL is set.  Please refer to "UPDATE" in
> "Noteworthey Scenarios" section in [1] for further details.
> 
> DELETE
> 
> Conflict Type:
> 
> delete_missing: An incoming delete is trying to delete a row on a
> target node which does not exist.
> 
> Conflict Resolutions:
> 
> a)   error : Error out on conflict. Replication is stopped, manual
> action is needed.
> b)   skip  : The remote change is skipped.
> 
> Configuring Conflict Resolution:
> 
> There are two parts when it comes to configuring CDR:
> 
> a) Enabling/Disabling conflict detection.
> b) Configuring conflict resolvers for different conflict types.
> 
>  Users can sometimes create multiple subscriptions on the same node,
> subscribing to different tables to improve replication performance by
> starting multiple apply workers. If the tables in one subscription are
> less likely to cause conflict, then it is possible that user may want
> conflict detection disabled for that subscription to avoid detection
> latency while enabling it for other subscriptions.  This generates a
> requirement to make ‘conflict detection’ configurable per
> subscription. While the conflict resolver configuration can remain
> global. All the subscriptions which opt for ‘conflict detection’ will
> follow global conflict resolver configuration.
> 
> To implement the above, subscription commands will be changed to have
> one more parameter 'conflict_resolution=on/off', default will be OFF.
> 
> To configure global resolvers, new DDL command will be introduced:
> 
> CONFLICT RESOLVER ON  IS 
> 

I very much doubt we want a single global conflict resolver, or even one
resolver per subscription. It seems like a very table-specific thing.

Also, doesn't all this whole design ignore the concurrency between
publishers? Isn't this problematic considering the commit timestamps may
go backwards (for a given publisher), which means the conflict
resolution is not deterministic (as it depends on how exactly it
interleaves)?


> -
> 
> Apart from the above three main operations and resolver configuration,
> there are more conflict types like primary-key updates, multiple
> unique constraints etc and some special scenarios to be considered.
> Complete design details can be found in [1].
> 
> [1]: https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution
> 

Hmmm, not sure it's good to have a "complete" design on wiki, and only
some subset posted to the mailing list. I haven't compared what the
differences are, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-24 Thread Tomas Vondra




On 5/24/24 19:09, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 5/20/24 16:16, Robert Haas wrote:
>>> On Sun, May 19, 2024 at 3:18 PM Tom Lane  wrote:
>>>> * Before starting this thread, Robert did a lot of very valuable
>>>> review of some individual patches.  I think what prompted him to
>>>> start the thread was the realization that he'd only made a small
>>>> dent in the problem.  Maybe we could divide and conquer: get a
>>>> dozen-or-so senior contributors to split up the list of pending
>>>> patches and then look at each one with an eye to what needs to
>>>> happen to move it along (*not* to commit it right away, although
>>>> in some cases maybe that's the thing to do).  It'd be great if
>>>> that could happen just before each commitfest, but that's probably
>>>> not practical with the current patch volume.  What I'm thinking
>>>> for the moment is to try to make that happen once a year or so.
> 
>>> I like this idea.
> 
>> Me too. Do you think it'd happen throughout the whole year, or at some
>> particular moment?
> 
> I was imagining a focused community effort spanning a few days to
> a week.  It seems more likely to actually happen if we attack it
> that way than if it's spread out as something people will do when
> they get around to it.  Of course the downside is finding a week
> when everybody is available.
> 
>> We used to do a "triage" at the FOSDEM PGDay meeting, but that used to
>> be more of an ad hoc thing to use the remaining time, with only a small
>> subset of people. But that seems pretty late in the dev cycle, I guess
>> we'd want it to happen early, perhaps during the first CF?
> 
> Yeah, early in the cycle seems more useful, although the summer might
> be the worst time for peoples' availability.
> 

I think meeting all these conditions - a week early in the cycle, but
not in the summer, when everyone can focus on this - will be difficult.

If we give up on everyone doing it at the same time, summer would be a
good time to do this - it's early in the cycle, and it tends to be a
quieter period (customers are on vacation too, so fewer incidents).

So maybe it'd be better to just set some deadline by which this needs to
be done, and make sure every pending patch has someone expected to look
at it? IMHO we're not in position to assign stuff to people, so I guess
people would just volunteer anyway - the CF app might track this.

It's not entirely clear to me if this would effectively mean doing a
regular review of those patches, or something less time consuming.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-24 Thread Tomas Vondra

On 5/20/24 16:16, Robert Haas wrote:
> On Sun, May 19, 2024 at 3:18 PM Tom Lane  wrote:
>
>...
> 
>> * Another reason for things sitting a long time is that they're too
>> big to review without an unreasonable amount of effort.  We should
>> encourage authors to break large patches into smaller stepwise
>> refinements.  It may seem like that will result in taking forever
>> to reach the end goal, but dropping a huge patchset on the community
>> isn't going to give speedy results either.
> 
> Especially if it has a high rate of subtle defects, which is common.
> 

I think breaking large patches into reasonably small parts is a very
important thing. Not only is it really difficult (or perhaps even
practically impossible) to review patches over a certain size, because
you have to grok and review everything at once. But it's also not great
from the cost/benefit POV - the improvement may be very beneficial, but
if it's one huge lump of code that never gets committable as a whole,
there's no benefit in practice. Which makes it less likely I'll even
start looking at the patch very closely, because there's a risk it'd be
just a waste of time in the end.

So I think this is another important reason to advise people to split
patches into smaller parts - not only it makes it easier to review, it
makes it possible to review and commit the parts incrementally, getting
at least some benefits early.

>> * Before starting this thread, Robert did a lot of very valuable
>> review of some individual patches.  I think what prompted him to
>> start the thread was the realization that he'd only made a small
>> dent in the problem.  Maybe we could divide and conquer: get a
>> dozen-or-so senior contributors to split up the list of pending
>> patches and then look at each one with an eye to what needs to
>> happen to move it along (*not* to commit it right away, although
>> in some cases maybe that's the thing to do).  It'd be great if
>> that could happen just before each commitfest, but that's probably
>> not practical with the current patch volume.  What I'm thinking
>> for the moment is to try to make that happen once a year or so.
> 
> I like this idea.
> 

Me too. Do you think it'd happen throughout the whole year, or at some
particular moment?

We used to do a "triage" at the FOSDEM PGDay meeting, but that used to
be more of an ad hoc thing to use the remaining time, with only a small
subset of people. But that seems pretty late in the dev cycle, I guess
we'd want it to happen early, perhaps during the first CF?

FWIW this reminds me the "CAN reports" tracking the current "conditions,
actions and needs" of a ticket. I do that for internal stuff, and I find
that quite helpful (as long as it's kept up to date).

>> Yeah, all this stuff ultimately gets done "for the good of the
>> project", which isn't the most reliable motivation perhaps.
>> I don't see a better one...
> 
> This is the really hard part.
> 

I think we have plenty of motivated people with good intentions. Some
are motivated by personal interest, some by trying to achieve stuff
related to their work/research/... I don't think the exact reasons
matter too much, and it's often a combination.

IMHO we should look at this from the other end - people are motivated to
get a patch reviewed & committed, and if we introduce a process that's
more likely to lead to that result, people will mostly adopt that.

And if we could make that process more convenient by improving the CF
app to support it, that'd be even better ... I'm mostly used to the
mailing list idea, but with the volume it's a constant struggle to keep
track of new patch versions that I wanted/promised to review, etc. The
CF app helps with that a little bit, because I can "become a reviewer"
but I still don't get notifications or anything like that :-(

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reading timestamp values from Datums gives garbage values

2024-05-20 Thread Tomas Vondra

On 5/20/24 16:37, Sushrut Shivaswamy wrote:
> Hey,
> 
> I'm trying to read a timestamp column as EPOCH.
> My query is as follows.
> ```
> SELECT EXTRACT(EPOCH FROM timestamp_column) FROM table;
> 
> column
> --
> 
> 1716213097.86486
> ```
> When running in the console this query gives valid epoch output which
> appears to be of type double.
> 
> When trying to read the query response from the Datum, I get garbage values.
> I've tried various types and none of them read the correct value.
> ```
> 
> Datum current_timestamp = SPI_getbinval(SPI_tuptable->vals[i],
> SPI_tuptable->tupdesc, 5, );
> 
> double current_time = DatumGetFloat8(current_timestamp); // prints 0
> 
> int64 time = DatumGetUint64(current_timestamp); // prints 5293917674
> ```
> 

TimestampTz is int64, so using DatumGetInt64 is probably the simplest
solution. And it's the number of microseconds, so X/1e6 should give you
the epoch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow sorted builds for btree_gist

2024-05-18 Thread Tomas Vondra

On 5/18/24 02:00, Paul A Jungwirth wrote:
> On Fri, May 17, 2024 at 12:41 PM Tomas Vondra
>  wrote:
>> I've been looking at GiST to see if there could be a good way to do
>> parallel builds - and there might be, if the opclass supports sorted
>> builds, because then we could parallelize the sort.
>>
>> But then I noticed we support this mode only for point_ops, because
>> that's the only opclass with sortsupport procedure. Which mostly makes
>> sense, because types like geometries, polygons, ... do not have a well
>> defined order.
> 
> Oh, I'm excited about this for temporal tables. It seems to me that
> ranges and multiranges should have a well-defined order (assuming
> their base types do), since you can do dictionary-style ordering
> (compare the first value, then the next, then the next, etc.). Is
> there any reason not to support those? No reason not to commit these
> btree_gist functions first though. If you aren't interested in adding
> GiST sortsupport for ranges & multiranges I'm willing to do it myself;
> just let me know.
> 

I believe that's pretty much what the existing patch [1] linked by
Andrey (and apparently running for a number of CFs) does.

[1] https://commitfest.postgresql.org/48/3686/

> Do note that the 1.7 -> 1.8 changes have been reverted though (as part
> of my temporal work), and it looks like your patch is a bit messed up
> from that. You'll want to take 1.8 for yourself, and also your 1.9
> upgrade script is trying to add the reverted stratnum functions.
> 

Yeah, I happened to branch from a slightly older master, not noticing
this is affected by the revert.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup does not detect missing files

2024-05-18 Thread Tomas Vondra




On 5/17/24 14:20, Robert Haas wrote:
> On Fri, May 17, 2024 at 1:18 AM David Steele  wrote:
>> However, I think allowing the user to optionally validate the input
>> would be a good feature. Running pg_verifybackup as a separate step is
>> going to be a more expensive then verifying/copying at the same time.
>> Even with storage tricks to copy ranges of data, pg_combinebackup is
>> going to aware of files that do not need to be verified for the current
>> operation, e.g. old copies of free space maps.
> 
> In cases where pg_combinebackup reuses a checksums from the input
> manifest rather than recomputing it, this could accomplish something.
> However, for any file that's actually reconstructed, pg_combinebackup
> computes the checksum as it's writing the output file. I don't see how
> it's sensible to then turn around and verify that the checksum that we
> just computed is the same one that we now get. It makes sense to run
> pg_verifybackup on the output of pg_combinebackup at a later time,
> because that can catch bits that have been flipped on disk in the
> meanwhile. But running the equivalent of pg_verifybackup during
> pg_combinebackup would amount to doing the exact same checksum
> calculation twice and checking that it gets the same answer both
> times.
> 
>> One more thing occurs to me -- if data checksums are enabled then a
>> rough and ready output verification would be to test the checksums
>> during combine. Data checksums aren't very good but something should be
>> triggered if a bunch of pages go wrong, especially since the block
>> offset is part of the checksum. This would be helpful for catching
>> combine bugs.
> 
> I don't know, I'm not very enthused about this. I bet pg_combinebackup
> has some bugs, and it's possible that one of them could involve
> putting blocks in the wrong places, but it doesn't seem especially
> likely. Even if it happens, it's more likely to be that
> pg_combinebackup thinks it's putting them in the right places but is
> actually writing them to the wrong offset in the file, in which case a
> block-checksum calculation inside pg_combinebackup is going to think
> everything's fine, but a standalone tool that isn't confused will be
> able to spot the damage.
> 

Perhaps more importantly, can you even verify data checksums before the
recovery is completed? I don't think you can (pg_checksums certainly
does not allow doing that). Because who knows in what shape you copied
the block?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow sorted builds for btree_gist

2024-05-18 Thread Tomas Vondra




On 5/18/24 08:51, Andrey M. Borodin wrote:
> 
> 
>> On 18 May 2024, at 00:41, Tomas Vondra
>>  wrote:
>> 
>> if the opclass supports sorted builds, because then we could
>> parallelize the sort.
> 
> Hi Tomas!
> 
> Yup, I'd also be glad to see this feature. PostGIS folks are using
> their geometry (sortsupport was developed for this) with object id
> (this disables sort build).
> 
> It was committed once [0], but then reverted, vardata opclasses were
> implemented wrong. Now it's on CF[1], Bernd is actively responding in
> the thread, but currently patch lacks tests.
> 
> Thanks for raising this!
> 

Oh, damn! I didn't notice the CF already has a patch doing this, and
that it was committed/reverted in 2021. I was completely oblivious to
that. Apologies.

Let's continue working on that patch/thread, I'll take a look in the
next CF.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Tomas Vondra

On 5/17/24 21:59, Robert Haas wrote:
> On Fri, May 17, 2024 at 3:51 PM Laurenz Albe  wrote:
>> On Fri, 2024-05-17 at 13:12 -0400, Greg Sabino Mullane wrote:
>>>> Long time ago there was a "rule" that people submitting patches are 
>>>> expected
>>>> to do reviews. Perhaps we should be more strict this.
>>>
>>> Big -1. How would we even be more strict about this? Public shaming? 
>>> Withholding a commit?
>>
>> I think it is a good rule.  I don't think that it shouldn't lead to putting
>> people on the pillory or kicking their patches, but I imagine that a 
>> committer
>> looking for somebody else's patch to work on could prefer patches by people
>> who are doing their share of reviews.
> 

Yeah, I don't have any particular idea how should the rule be "enforced"
and I certainly did not imagine public shaming or anything like that. My
thoughts were more about reminding people the reviews are part of the
deal, that's it ... maybe "more strict" was not quite what I meant.

> If you give me an automated way to find that out, I'll consider paying
> some attention to it. However, in order to sort the list of patches
> needing review by the amount of review done by the patch author, we'd
> first need to have a list of patches needing review.
> 
> And right now we don't, or at least not in any usable way.
> commitfest.postgresql.org is supposed to give us that, but it doesn't.
> 

It'd certainly help to know which patches to consider for review, but I
guess I'd still look at patches from people doing more reviews first,
even if I had to find out in what shape the patch is.

I'm far more skeptical about "automated way" to track this, though. I'm
not sure it's quite possible - reviews can have a lot of very different
forms, and deciding what is or is not a review is pretty subjective. So
it's not clear how would we quantify that. Not to mention I'm sure we'd
promptly find ways to game that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

allow sorted builds for btree_gist

2024-05-17 Thread Tomas Vondra

Hi,

I've been looking at GiST to see if there could be a good way to do
parallel builds - and there might be, if the opclass supports sorted
builds, because then we could parallelize the sort.

But then I noticed we support this mode only for point_ops, because
that's the only opclass with sortsupport procedure. Which mostly makes
sense, because types like geometries, polygons, ... do not have a well
defined order.

Still, we have btree_gist, and I don't think there's much reason to not
support sorted builds for those opclasses, where the gist opclass is
defined on top of btree ordering semantics.

So this patch does that, and the difference (compared to builds with
buiffering=on) can easily be an order of magnitude - at least that's
what my tests show:


test=# create index on test_int8 using gist (a) with (buffering = on);
CREATE INDEX
Time: 578799.450 ms (09:38.799)

test=# create index on test_int8 using gist (a) with (buffering = auto);
CREATE INDEX
Time: 53022.593 ms (00:53.023)


test=# create index on test_uuid using gist (a) with (buffering = on);
CREATE INDEX
Time: 39322.799 ms (00:39.323)

test=# create index on test_uuid using gist (a) with (buffering = auto);
CREATE INDEX
Time: 6466.341 ms (00:06.466)


The WIP patch adds enables this for data types with a usable sortsupport
procedure, which excludes time, timetz, cash, interval, bit, vbit, bool,
enum and macaddr8. I assume time, timetz and macaddr8 could be added,
it's just that I didn't find any existing sortsupport procedure. Same
for cash, but IIRC it's mostly deprecated.

Of course, people probably don't use btree_gist with a single column,
because they could just use btree. It's useful for multi-column GiST
indexes, with data types requiring GiST. And if the other opclasses also
allow sorted builds, this patch makes that possible. Of course, most
"proper GiST opclasses" don't support that, but why not - it's easy.

FWIW this is also why sorted builds likely are not a very practical way
to do parallel builds for GiST - it would help only with a small part of
cases, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 011e6eee923a6f51668f3277f470d32922923d17 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 17 May 2024 20:36:08 +0200
Subject: [PATCH v20240517] allow sorted builds for btree_gist

---
 contrib/btree_gist/Makefile |  2 +-
 contrib/btree_gist/btree_gist--1.8--1.9.sql | 82 +
 contrib/btree_gist/btree_gist.control   |  2 +-
 3 files changed, 84 insertions(+), 2 deletions(-)
 create mode 100644 contrib/btree_gist/btree_gist--1.8--1.9.sql

diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 7ac2df26c10..68190ac5e46 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -34,7 +34,7 @@ DATA = btree_gist--1.0--1.1.sql \
btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \
btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \
btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \
-   btree_gist--1.7--1.8.sql
+   btree_gist--1.7--1.8.sql btree_gist--1.8--1.9.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/btree_gist/btree_gist--1.8--1.9.sql b/contrib/btree_gist/btree_gist--1.8--1.9.sql
new file mode 100644
index 000..3fd6f33f41b
--- /dev/null
+++ b/contrib/btree_gist/btree_gist--1.8--1.9.sql
@@ -0,0 +1,82 @@
+/* contrib/btree_gist/btree_gist--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.9'" to load this file. \quit
+
+ALTER OPERATOR FAMILY gist_oid_ops USING gist ADD
+	FUNCTION 11 (oid, oid) btoidsortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_int2_ops USING gist ADD
+	FUNCTION 11 (int2, int2) btint2sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_int4_ops USING gist ADD
+	FUNCTION 11 (int4, int4) btint4sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_int8_ops USING gist ADD
+	FUNCTION 11 (int8, int8) btint8sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_float4_ops USING gist ADD
+	FUNCTION 11 (float4, float4) btfloat4sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_float8_ops USING gist ADD
+	FUNCTION 11 (float8, float8) btfloat8sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_timestamp_ops USING gist ADD
+	FUNCTION 11 (timestamp, timestamp) timestamp_sortsupport (internal) ;
+
+ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
+	FUNCTION 11 (timestamptz, timestamptz) timestamp_sortsupport (internal) ;
+
+-- ALTER OPERATOR FAMILY gist_time_ops USING gist ADD
+-- 	FUNCTION 12 (time, time) gist_stratnum_btree (int2) ;
+
+ALTER OPERATOR FAMILY gist_date_ops USING gist ADD
+	FUNCTION 11 (date, date) date_sortsupport (internal)

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Tomas Vondra

On 5/17/24 14:51, Andrey M. Borodin wrote:
> 
> 
>> On 17 May 2024, at 16:39, Robert Haas  wrote:
>>
>> I think Andrey Borodin's nearby suggestion of having a separate CfM
>> for each section of the CommitFest does a good job revealing just how
>> bad the current situation is. I agree with him: that would actually
>> work. Asking somebody, for a one-month period, to be responsible for
>> shepherding one-tenth or one-twentieth of the entries in the
>> CommitFest would be a reasonable amount of work for somebody. But we
>> will not find 10 or 20 highly motivated, well-qualified volunteers
>> every other month to do that work;
> 
> Why do you think so? Let’s just try to find more CFMs for July.
> When I felt that I’m overwhelmed, I asked for help and Alexander Alekseev 
> promptly agreed. That helped a lot.
> If I was in that position again, I would just ask 10 times on a 1st day :)
> 
>> it's a struggle to find one or two
>> highly motivated, well-qualified CommitFest managers, let alone ten or
>> twenty.
> 
> Because we are looking for one person to do a job for 10.
> 

Yes. It's probably easier to find more CF managers doing much less work.

>> So I think the right interpretation of his comment is that
>> managing the CommitFest has become about an order of magnitude more
>> difficult than what it needs to be for the task to be done well.
> 
> Let’s scale the process. Reduce responsibility area of a CFM, define it 
> clearer.
> And maybe even explicitly ask CFM to summarize patch status of each entry at 
> least once a CF.
> 

Should it even be up to the CFM to write the summary, or should he/she
be able to request an update from the patch author? Of at least have the
choice to do so.

I think we'll always struggle with the massive threads, because it's
really difficult to find the right balance between brevity and including
all the relevant details. Or rather impossible. I did try writing such
summaries for a couple of my long-running patches, and while it might
have helped, the challenge was to also explain why stuff *not* done in
some alternative way, which is one of the things usually discussed. But
the summary gets very long, because there are many alternatives.

> 
> Can I do a small poll among those who is on this thread? Would you
volunteer to summarize a status of 20 patches in July’s CF? 5 each week
or so. One per day.
> 

Not sure. For many patches it'll be trivial. And for a bunch it'll be
very very time-consuming.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-17 Thread Tomas Vondra

On 5/16/24 23:43, Peter Eisentraut wrote:
> On 16.05.24 23:06, Joe Conway wrote:
>> On 5/16/24 16:57, Jacob Champion wrote:
>>> On Thu, May 16, 2024 at 1:31 PM Joe Conway  wrote:
>>>> Maybe we should just make it a policy that *nothing* gets moved forward
>>>> from commitfest-to-commitfest and therefore the author needs to care
>>>> enough to register for the next one?
>>>
>>> I think that's going to severely disadvantage anyone who doesn't do
>>> this as their day job. Maybe I'm bristling a bit too much at the
>>> wording, but not having time to shepherd a patch is not the same as
>>> not caring.
>>
>> Maybe the word "care" was a poor choice, but forcing authors to think
>> about and decide if they have the "time to shepherd a patch" for the
>> *next CF* is exactly the point. If they don't, why clutter the CF with
>> it.
> 
> Objectively, I think this could be quite effective.  You need to prove
> your continued interest in your project by pressing a button every two
> months.
> 
> But there is a high risk that this will double the annoyance for
> contributors whose patches aren't getting reviews.  Now, not only are
> you being ignored, but you need to prove that you're still there every
> two months.
> 

Yeah, I 100% agree with this. If a patch bitrots and no one cares enough
to rebase it once in a while, then sure - it's probably fine to mark it
RwF. But forcing all contributors to do a dance every 2 months just to
have a chance someone might take a look, seems ... not great.

I try to see this from the contributors' PoV, and with this process I'm
sure I'd start questioning if I even want to submit patches.

That is not to say we don't have a problem with patches that just move
to the next CF, and that we don't need to do something about that ...

Incidentally, I've been preparing some git+CF stats because of a talk
I'm expected to do, and it's very obvious the number of committed (and
rejected) CF entries is more or very stable over time, while the number
of patches that move to the next CF just snowballs.

My impression is a lot of these contributions/patches just never get the
review & attention that would allow them to move forward. Sure, some do
bitrot and/or get abandoned, and let's RwF those. But forcing everyone
to re-register the patches over and over seems like "reject by default".
I'd expect a lot of people to stop bothering and give up, and in a way
that would "solve" the bottleneck. But I'm not sure it's the solution
we'd like ...

It does seem to me a part of the solution needs to be helping to get
those patches reviewed. I don't know how to do that, but perhaps there's
a way to encourage people to review more stuff, or review stuff from a
wider range of contributors. Say by treating reviews more like proper
contributions.

Long time ago there was a "rule" that people submitting patches are
expected to do reviews. Perhaps we should be more strict this.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-14 Thread Tomas Vondra

On 5/14/24 20:40, Melanie Plageman wrote:
> On Tue, May 14, 2024 at 2:33 PM Tomas Vondra
>  wrote:
>>
>> On 5/14/24 19:42, Melanie Plageman wrote:
>>> I've fixed this. I've also set enable_material off as I mentioned I
>>> might in my earlier mail.
>>>
>>
>> I'm not sure this (setting more and more GUCs to prevent hypothetical
>> plan changes) is a good practice. Because how do you know the plan does
>> not change for some other unexpected reason, possibly in the future?
> 
> Sure. So, you think it is better not to have enable_material = false?
> 

Right. Unless it's actually needed to force the necessary plan.

>> IMHO if the test requires a specific plan, it's better to do an actual
>> "explain (rows off, costs off)" to check that.
> 
> When you say "rows off", do you mean do something to ensure that it
> doesn't return tuples? Because I don't see a ROWS option for EXPLAIN
> in the docs.
> 

Sorry, I meant to hide the cardinality estimates in the explain, but I
got confused. "COSTS OFF" is enough.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-14 Thread Tomas Vondra

On 5/14/24 19:42, Melanie Plageman wrote:
> On Tue, May 14, 2024 at 2:18 AM Michael Paquier  wrote:
>>
>> On Mon, May 13, 2024 at 10:05:03AM -0400, Melanie Plageman wrote:
>>> Remove the assert and reset the field on which it previously asserted to
>>> avoid incorrectly emitting NULL-filled tuples from a previous scan on
>>> rescan.
>>
>>> - Assert(scan->rs_empty_tuples_pending == 0);
>>> + scan->rs_empty_tuples_pending = 0;
>>
>> Perhaps this should document the reason why the reset is done in these
>> two paths rather than let the reader guess it?  And this is about
>> avoiding emitting some tuples from a previous scan.
> 
> I've added a comment to heap_rescan() in the attached v5. Doing so
> made me realize that we shouldn't bother resetting it in
> heap_endscan(). Doing so is perhaps more confusing, because it implies
> that field may somehow be used later. I've removed the reset of
> rs_empty_tuples_pending from heap_endscan().
> 

+1

>>> +SET enable_indexonlyscan = off;
>>> +set enable_indexscan = off;
>>> +SET enable_seqscan = off;
>>
>> Nit: adjusting the casing of the second SET here.
> 
> I've fixed this. I've also set enable_material off as I mentioned I
> might in my earlier mail.
> 

I'm not sure this (setting more and more GUCs to prevent hypothetical
plan changes) is a good practice. Because how do you know the plan does
not change for some other unexpected reason, possibly in the future?

IMHO if the test requires a specific plan, it's better to do an actual
"explain (rows off, costs off)" to check that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-05-13 Thread Tomas Vondra

On 5/13/24 10:19, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> ...
>>
>> I don't understand the question. The blocks are distributed to workers
>> by the parallel table scan, and it certainly does not do that block by
>> block. But even it it did, that's not a problem for this code.
> 
> OK, I get ParallelBlockTableScanWorkerData.phsw_chunk_size is designed
> for this.
> 
>> The problem is that if the scan wraps around, then one of the TID lists
>> for a given worker will have the min TID and max TID, so it will overlap
>> with every other TID list for the same key in that worker. And when the
>> worker does the merging, this list will force a "full" merge sort for
>> all TID lists (for that key), which is very expensive.
> 
> OK.
> 
> Thanks for all the answers, they are pretty instructive!
> 

Thanks for the questions, it forces me to articulate the arguments more
clearly. I guess it'd be good to put some of this into a README or at
least a comment at the beginning of gininsert.c or somewhere close.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-11 Thread Tomas Vondra

On 5/10/24 21:48, Melanie Plageman wrote:
> On Thu, May 2, 2024 at 5:37 PM Tomas Vondra
>  wrote:
>>
>>
>>
>> On 4/24/24 22:46, Melanie Plageman wrote:
>>> On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra
>>>  wrote:
>>>>
>>>> On 4/23/24 18:05, Melanie Plageman wrote:
>>>>> The patch with a fix is attached. I put the test in
>>>>> src/test/regress/sql/join.sql. It isn't the perfect location because
>>>>> it is testing something exercisable with a join but not directly
>>>>> related to the fact that it is a join. I also considered
>>>>> src/test/regress/sql/select.sql, but it also isn't directly related to
>>>>> the query being a SELECT query. If there is a better place for a test
>>>>> of a bitmap heap scan edge case, let me know.
>>>>
>>>> I don't see a problem with adding this to join.sql - why wouldn't this
>>>> count as something related to a join? Sure, it's not like this code
>>>> matters only for joins, but if you look at join.sql that applies to a
>>>> number of other tests (e.g. there are a couple btree tests).
>>>
>>> I suppose it's true that other tests in this file use joins to test
>>> other code. I guess if we limited join.sql to containing tests of join
>>> implementation, it would be rather small. I just imagined it would be
>>> nice if tests were grouped by what they were testing -- not how they
>>> were testing it.
>>>
>>>> That being said, it'd be good to explain in the comment why we're
>>>> testing this particular plan, not just what the plan looks like.
>>>
>>> You mean I should explain in the test comment why I included the
>>> EXPLAIN plan output? (because it needs to contain a bitmapheapscan to
>>> actually be testing anything)
>>>
>>
>> No, I meant that the comment before the test describes a couple
>> requirements the plan needs to meet (no need to process all inner
>> tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it
>> does not explain why we're testing that plan.
>>
>> I could get to that by doing git-blame to see what commit added this
>> code, and then read the linked discussion. Perhaps that's enough, but
>> maybe the comment could say something like "verify we properly discard
>> tuples on rescans" or something like that?
> 
> Attached is v3. I didn't use your exact language because the test
> wouldn't actually verify that we properly discard the tuples. Whether
> or not the empty tuples are all emitted, it just resets the counter to
> 0. I decided to go with "exercise" instead.
> 

I did go over the v3 patch, did a bunch of tests, and I think it's fine
and ready to go. The one thing that might need some minor tweaks is the
commit message.

1) Isn't the subject "Remove incorrect assert" a bit misleading, as the
patch does not simply remove an assert, but replaces it with a reset of
the field the assert used to check? (The commit message does not mention
this either, at least not explicitly.)

2) The "heap AM-specific bitmap heap scan code" sounds a bit strange to
me, isn't the first "heap" unnecessary?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-05-10 Thread Tomas Vondra

On 5/10/24 07:53, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>>> I guess both of you are talking about worker process, if here are
>>> something in my mind:
>>>
>>> *btbuild* also let the WORKER dump the tuples into Sharedsort struct
>>> and let the LEADER merge them directly. I think this aim of this design
>>> is it is potential to save a mergeruns. In the current patch, worker dump
>>> to local tuplesort and mergeruns it and then leader run the merges
>>> again. I admit the goal of this patch is reasonable, but I'm feeling we
>>> need to adapt this way conditionally somehow. and if we find the way, we
>>> can apply it to btbuild as well. 
>>>
>>
>> I'm a bit confused about what you're proposing here, or how is that
>> related to what this patch is doing and/or to the what Matthias
>> mentioned in his e-mail from last week.
>>
>> Let me explain the relevant part of the patch, and how I understand the
>> improvement suggested by Matthias. The patch does the work in three phases:
> 
> What's in my mind is:
> 
> 1. WORKER-1
> 
> Tempfile 1:
> 
> key1:  1
> key3:  2
> ...
> 
> Tempfile 2:
> 
> key5:  3
> key7:  4
> ...
> 
> 2. WORKER-2
> 
> Tempfile 1:
> 
> Key2: 1
> Key6: 2
> ...
> 
> Tempfile 2:
> Key3: 3
> Key6: 4
> ..
> 
> In the above example:  if we do the the merge in LEADER, only 1 mergerun
> is needed. reading 4 tempfile 8 tuples in total and write 8 tuples.
> 
> If we adds another mergerun into WORKER, the result will be:
> 
> WORKER1:  reading 2 tempfile 4 tuples and write 1 tempfile (called X) 4
> tuples. 
> WORKER2:  reading 2 tempfile 4 tuples and write 1 tempfile (called Y) 4
> tuples. 
> 
> LEADER:  reading 2 tempfiles (X & Y) including 8 tuples and write it
> into final tempfile.
> 
> So the intermedia result X & Y requires some extra effort.  so I think
> the "extra mergerun in worker" is *not always* a win, and my proposal is
> if we need to distinguish the cases in which one we should add the
> "extra mergerun in worker" step.
> 

The thing you're forgetting is that the mergesort in the worker is
*always* a simple append, because the lists are guaranteed to be
non-overlapping, so it's very cheap. The lists from different workers
are however very likely to overlap, and hence a "full" mergesort is
needed, which is way more expensive.

And not only that - without the intermediate merge, there will be very
many of those lists the leader would have to merge.

If we do the append-only merges in the workers first, we still need to
merge them in the leader, of course, but we have few lists to merge
(only about one per worker).

Of course, this means extra I/O on the intermediate tuplesort, and it's
not difficult to imagine cases with no benefit, or perhaps even a
regression. For example, if the keys are unique, the in-worker merge
step can't really do anything. But that seems quite unlikely IMHO.

Also, if this overhead was really significant, we would not see the nice
speedups I measured during testing.

>> The trouble with (2) is that it "just copies" data from one tuplesort
>> into another, increasing the disk space requirements. In an extreme
>> case, when nothing can be combined, it pretty much doubles the amount of
>> disk space, and makes the build longer.
> 
> This sounds like the same question as I talk above, However my proposal
> is to distinguish which cost is bigger between "the cost saving from
> merging TIDs in WORKERS" and "cost paid because of the extra copy",
> then we do that only when we are sure we can benefits from it, but I
> know it is hard and not sure if it is doable. 
> 

Yeah. I'm not against picking the right execution strategy during the
index build, but it's going to be difficult, because we really don't
have the information to make a reliable decision.

We can't even use the per-column stats, because it does not say much
about the keys extracted by GIN, I think. And we need to do the decision
at the very beginning, before we write the first batch of data either to
the local or shared tuplesort.

But maybe we could wait until we need to flush the first batch of data
(in the callback), and make the decision then? In principle, if we only
flush once at the end, the intermediate sort is not needed at all (fairy
unlikely for large data sets, though).

Well, in principle, maybe we could even start writing into the local
tuplesort, and then "rethink" after a while and switch to the shared
one. We'd still need to copy data we've already written to the local
tuplesort, but hopefully that'd be just a fraction compared to doing
that for the whole table

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Tomas Vondra




On 5/9/24 17:51, Matthias van de Meent wrote:
> On Thu, 9 May 2024 at 15:13, Tomas Vondra  
> wrote:
>> Let me explain the relevant part of the patch, and how I understand the
>> improvement suggested by Matthias. The patch does the work in three phases:
>>
>>
>> 1) Worker gets data from table, split that into index items and add
>> those into a "private" tuplesort, and finally sorts that. So a worker
>> may see a key many times, with different TIDs, so the tuplesort may
>> contain many items for the same key, with distinct TID lists:
>>
>>key1: 1, 2, 3, 4
>>key1: 5, 6, 7
>>key1: 8, 9, 10
>>key2: 1, 2, 3
>>...
> 
> This step is actually split in several components/phases, too.
> As opposed to btree, which directly puts each tuple's data into the
> tuplesort, this GIN approach actually buffers the tuples in memory to
> generate these TID lists for data keys, and flushes these pairs of Key
> + TID list into the tuplesort when its own memory limit is exceeded.
> That means we essentially double the memory used for this data: One
> GIN deform buffer, and one in-memory sort buffer in the tuplesort.
> This is fine for now, but feels duplicative, hence my "let's allow
> tuplesort to merge the key+TID pairs into pairs of key+TID list"
> comment.
> 

True, although the "GIN deform buffer" (flushed by the callback if using
too much memory) likely does most of the merging already. If it only
happened in the tuplesort merge, we'd likely have far more tuples and
overhead associated with that. So we certainly won't get rid of either
of these things.

You're right the memory limits are a bit unclear, and need more thought.
I certainly have not thought very much about not using more than the
specified maintenance_work_mem amount. This includes the planner code
determining the number of workers to use - right now it simply does the
same thing as for btree/brin, i.e. assumes each workers uses 32MB of
memory and checks how many workers fit into maintenance_work_mem.

That was a bit bogus even for BRIN, because BRIN sorts only summaries,
which is typically tiny - perhaps a couple kB, much less than 32MB. But
it's still just one sort, and some opclasses may be much larger (like
bloom, for example). So I just went with it.

But for GIN it's more complicated, because we have two tuplesorts (not
sure if both can use the memory at the same time) and the GIN deform
buffer. Which probably means we need to have a per-worker allowance
considering all these buffers.

>> The trouble with (2) is that it "just copies" data from one tuplesort
>> into another, increasing the disk space requirements. In an extreme
>> case, when nothing can be combined, it pretty much doubles the amount of
>> disk space, and makes the build longer.
>>
>> What I think Matthias is suggesting, is that this "TID list merging"
>> could be done directly as part of the tuplesort in step (1). So instead
>> of just moving the "sort tuples" from the appropriate runs, it could
>> also do an optional step of combining the tuples and writing this
>> combined tuple into the tuplesort result (for that worker).
> 
> Yes, but with a slightly more extensive approach than that even, see above.
> 
>> Matthias also mentioned this might be useful when building btree indexes
>> with key deduplication.
>>
>> AFAICS this might work, although it probably requires for the "combined"
>> tuple to be smaller than the sum of the combined tuples (in order to fit
>> into the space).
> 
> *must not be larger than the sum; not "must be smaller than the sum" [^0].

Yeah, I wrote that wrong.

> For btree tuples with posting lists this is guaranteed to be true: The
> added size of a btree tuple with a posting list (containing at least 2
> values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
> (12 on 32-bit systems). The smallest btree tuple with data is also 16
> bytes (or 12 bytes on 32-bit systems), so this works out nicely.
> 
>> But at least in the GIN build in the workers this is
>> likely true, because the TID lists do not overlap (and thus not hurting
>> the compressibility).
>>
>> That being said, I still see this more as an optimization than something
>> required for the patch,
> 
> Agreed.
> 

OK

>> and I don't think I'll have time to work on this
>> anytime soon. The patch is not extremely complex, but it's not trivial
>> either. But if someone wants to take a stab at extending tuplesort to
>> allow this, I won't object ...
> 
> Same here: While I do have some ideas on where and how to implement
> this, I'm not planning on working

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Tomas Vondra

On 5/2/24 20:22, Tomas Vondra wrote:
>> 
>>> For some of the opclasses it can regress (like the jsonb_path_ops). I
>>> don't think that's a major issue. Or more precisely, I'm not surprised
>>> by it. It'd be nice to be able to disable the parallel builds in these
>>> cases somehow, but I haven't thought about that.
>>
>> Do you know why it regresses?
>>
> 
> No, but one thing that stands out is that the index is much smaller than
> the other columns/opclasses, and the compression does not save much
> (only about 5% for both phases). So I assume it's the overhead of
> writing writing and reading a bunch of GB of data without really gaining
> much from doing that.
> 

I finally got to look into this regression, but I think I must have done
something wrong before because I can't reproduce it. This is the timings
I get now, if I rerun the benchmark:

 workers   trgm   tsvector jsonb  jsonb (hash)
---
   0   1225404   10456
   17721805760
   25491434752
   34261274350
   43641164048
   53231113846
   62921113745

and the speedup, relative to serial build:

 workers   trgm   tsvector  jsonb  jsonb (hash)

   163%45%54%  108%
   245%35%45%   94%
   335%31%41%   89%
   430%29%38%   86%
   526%28%37%   83%
   624%28%35%   81%

So there's a small regression for the jsonb_path_ops opclass, but only
with one worker. After that, it gets a bit faster than serial build.
While not a great speedup, it's far better than the earlier results that
showed maybe 40% regression.

I don't know what I did wrong before - maybe I had a build with an extra
debug info or something like that? No idea why would that affect only
one of the opclasses. But this time I made doubly sure the results are
correct etc.

Anyway, I'm fairly happy with these results. I don't think it's
surprising there are cases where parallel build does not help much.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Tomas Vondra

ler than the sum of the combined tuples (in order to fit
into the space). But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch, and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Tomas Vondra

On 5/9/24 12:14, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>>
>> In 0002 the workers still do an explicit qsort() on the TID list before
>> writing the data into the shared tuplesort. But we can do better - the
>> workers can do a merge sort too. To help with this, we add the first TID
>> to the tuplesort tuple, and sort by that too - it helps the workers to
>> process the data in an order that allows simple concatenation instead of
>> the full mergesort.
>>
>> Note: There's a non-obvious issue due to parallel scans always being
>> "sync scans", which may lead to very "wide" TID ranges when the scan
>> wraps around. More about that later.
> 
> This is really amazing.
> 
>> 7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch
>>
>> There's one more efficiency problem - the parallel scans are required to
>> be synchronized, i.e. the scan may start half-way through the table, and
>> then wrap around. Which however means the TID list will have a very wide
>> range of TID values, essentially the min and max of for the key.
>>
>> Without 0006 this would cause frequent failures of the index build, with
>> the error I already mentioned:
>>
>>   ERROR: could not split GIN page; all old items didn't fit
> 
> I have two questions here and both of them are generall gin index questions
> rather than the patch here.
> 
> 1. What does the "wrap around" mean in the "the scan may start half-way
> through the table, and then wrap around".  Searching "wrap" in
> gin/README gets nothing. 
> 

The "wrap around" is about the scan used to read data from the table
when building the index. A "sync scan" may start e.g. at TID (1000,0)
and read till the end of the table, and then wraps and returns the
remaining part at the beginning of the table for blocks 0-999.

This means the callback would not see a monotonically increasing
sequence of TIDs.

Which is why the serial build disables sync scans, allowing simply
appending values to the sorted list, and even with regular flushes of
data into the index we can simply append data to the posting lists.

> 2. I can't understand the below error.
> 
>>   ERROR: could not split GIN page; all old items didn't fit
> 
> When the posting list is too long, we have posting tree strategy. so in
> which sistuation we could get this ERROR. 
> 

AFAICS the index build relies on the assumption that we only append data
to the TID list on a leaf page, and when the page gets split, the "old"
part will always fit. Which may not be true, if there was a wrap around
and we're adding low TID values to the list on the leaf page.

FWIW the error in dataBeginPlaceToPageLeaf looks like this:

  if (!append || ItemPointerCompare(, ) >= 0)
elog(ERROR, "could not split GIN page; all old items didn't fit");

It can fail simply because of the !append part.

I'm not sure why dataBeginPlaceToPageLeaf() relies on this assumption,
or with GIN details in general, and I haven't found any explanation. But
AFAIK this is why the serial build disables sync scans.

>> issue with efficiency - having such a wide TID list forces the mergesort
>> to actually walk the lists, because this wide list overlaps with every
>> other list produced by the worker.
> 
> If we split the blocks among worker 1-block by 1-block, we will have a
> serious issue like here.  If we can have N-block by N-block, and N-block
> is somehow fill the work_mem which makes the dedicated temp file, we
> can make things much better, can we? 
> 

I don't understand the question. The blocks are distributed to workers
by the parallel table scan, and it certainly does not do that block by
block. But even it it did, that's not a problem for this code.

The problem is that if the scan wraps around, then one of the TID lists
for a given worker will have the min TID and max TID, so it will overlap
with every other TID list for the same key in that worker. And when the
worker does the merging, this list will force a "full" merge sort for
all TID lists (for that key), which is very expensive.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-02 Thread Tomas Vondra




On 4/24/24 22:46, Melanie Plageman wrote:
> On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra
>  wrote:
>>
>> On 4/23/24 18:05, Melanie Plageman wrote:
>>> The patch with a fix is attached. I put the test in
>>> src/test/regress/sql/join.sql. It isn't the perfect location because
>>> it is testing something exercisable with a join but not directly
>>> related to the fact that it is a join. I also considered
>>> src/test/regress/sql/select.sql, but it also isn't directly related to
>>> the query being a SELECT query. If there is a better place for a test
>>> of a bitmap heap scan edge case, let me know.
>>
>> I don't see a problem with adding this to join.sql - why wouldn't this
>> count as something related to a join? Sure, it's not like this code
>> matters only for joins, but if you look at join.sql that applies to a
>> number of other tests (e.g. there are a couple btree tests).
> 
> I suppose it's true that other tests in this file use joins to test
> other code. I guess if we limited join.sql to containing tests of join
> implementation, it would be rather small. I just imagined it would be
> nice if tests were grouped by what they were testing -- not how they
> were testing it.
> 
>> That being said, it'd be good to explain in the comment why we're
>> testing this particular plan, not just what the plan looks like.
> 
> You mean I should explain in the test comment why I included the
> EXPLAIN plan output? (because it needs to contain a bitmapheapscan to
> actually be testing anything)
> 

No, I meant that the comment before the test describes a couple
requirements the plan needs to meet (no need to process all inner
tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it
does not explain why we're testing that plan.

I could get to that by doing git-blame to see what commit added this
code, and then read the linked discussion. Perhaps that's enough, but
maybe the comment could say something like "verify we properly discard
tuples on rescans" or something like that?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-02 Thread Tomas Vondra

On 4/30/24 14:07, Daniel Gustafsson wrote:
>> On 26 Apr 2024, at 15:04, Melanie Plageman  wrote:
> 
>> If this seems correct to you, are you okay with the rest of the fix
>> and test? We could close this open item once the patch is acceptable.
> 
> From reading the discussion and the patch this seems like the right fix to me.

I agree.

> Does the test added here aptly cover 04e72ed617be in terms its functionality?
> 

AFAIK the test fails without the fix and works with it, so I believe it
does cover the relevant functionality.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-05-02 Thread Tomas Vondra




On 5/2/24 19:12, Matthias van de Meent wrote:
> On Thu, 2 May 2024 at 17:19, Tomas Vondra  
> wrote:
>>
>> Hi,
>>
>> In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
>> when working on that I was thinking how difficult would it be to do
>> something similar to do that for other index types, like GIN. I even had
>> that on my list of ideas to pitch to potential contributors, as I was
>> fairly sure it's doable and reasonably isolated / well-defined.
>>
>> However, I was not aware of any takers, so a couple days ago on a slow
>> weekend I took a stab at it. And yes, it's doable - attached is a fairly
>> complete, tested and polished version of the feature, I think. It turned
>> out to be a bit more complex than I expected, for reasons that I'll get
>> into when discussing the patches.
> 
> This is great. I've been thinking about approximately the same issue
> recently, too, but haven't had time to discuss/implement any of this
> yet. I think some solutions may even be portable to the btree parallel
> build: it also has key deduplication (though to a much smaller degree)
> and could benefit from deduplication during the scan/ssup load phase,
> rather than only during insertion.
> 

Perhaps, although I'm not that familiar with the details of btree
builds, and I haven't thought about it when working on this over the
past couple days.

>> First, let's talk about the benefits - how much faster is that than the
>> single-process build we have for GIN indexes? I do have a table with the
>> archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
>> (raw dump is about 28GB). This does include simple text data (message
>> body), JSONB (headers) and tsvector (full-text on message body).
> 
> Sidenote: Did you include the tsvector in the table to reduce time
> spent during index creation? I would have used an expression in the
> index definition, rather than a direct column.
> 

Yes, it's a materialized column, not computed during index creation.

>> If I do CREATE index with different number of workers (0 means serial
>> build), I get this timings (in seconds):
> 
> [...]
> 
>> This shows the benefits are pretty nice, depending on the opclass. For
>> most indexes it's maybe ~3-4x faster, which is nice, and I don't think
>> it's possible to do much better - the actual index inserts can happen
>> from a single process only, which is the main limit.
> 
> Can we really not insert with multiple processes? It seems to me that
> GIN could be very suitable for that purpose, with its clear double
> tree structure distinction that should result in few buffer conflicts
> if different backends work on known-to-be-very-different keys.
> We'd probably need multiple read heads on the shared tuplesort, and a
> way to join the generated top-level subtrees, but I don't think that
> is impossible. Maybe it's work for later effort though.
> 

Maybe, but I took it as a restriction and it seemed too difficult to
relax (or at least I assume that).

> Have you tested and/or benchmarked this with multi-column GIN indexes?
> 

I did test that, and I'm not aware of any bugs/issues. Performance-wise
it depends on which opclasses are used by the columns - if you take the
speedup for each of them independently, the speedup for the whole index
is roughly the average of that.

>> For some of the opclasses it can regress (like the jsonb_path_ops). I
>> don't think that's a major issue. Or more precisely, I'm not surprised
>> by it. It'd be nice to be able to disable the parallel builds in these
>> cases somehow, but I haven't thought about that.
> 
> Do you know why it regresses?
> 

No, but one thing that stands out is that the index is much smaller than
the other columns/opclasses, and the compression does not save much
(only about 5% for both phases). So I assume it's the overhead of
writing writing and reading a bunch of GB of data without really gaining
much from doing that.

>> I do plan to do some tests with btree_gin, but I don't expect that to
>> behave significantly differently.
>>
>> There are small variations in the index size, when built in the serial
>> way and the parallel way. It's generally within ~5-10%, and I believe
>> it's due to the serial build adding the TIDs incrementally, while the
>> build adds them in much larger chunks (possibly even in one chunk with
>> all the TIDs for the key).
> 
> I assume that was '[...] while the [parallel] build adds them [...]', right?
> 

Right. The parallel build adds them in larger chunks.

>> I believe the same size variation can happen
>> if the index gets built in a different way, e.g. by inserting the data
&

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-23 Thread Tomas Vondra




On 4/23/24 18:05, Melanie Plageman wrote:
> On Mon, Apr 22, 2024 at 1:01 PM Melanie Plageman
>  wrote:
>>
>> On Thu, Apr 18, 2024 at 5:39 AM Tomas Vondra
>>  wrote:
>>>
>>> On 4/18/24 09:10, Michael Paquier wrote:
>>>> On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote:
>>>>> I've added an open item [1], because what's one open item when you can
>>>>> have two? (me)
>>>>
>>>> And this is still an open item as of today.  What's the plan to move
>>>> forward here?
>>>
>>> AFAIK the plan is to replace the asserts with actually resetting the
>>> rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago.
>>> I assume she was busy with the post-freeze AM reworks last week, so this
>>> was on a back burner.
>>
>> yep, sorry. Also I took a few days off. I'm just catching up today. I
>> want to pop in one of Richard or Tomas' examples as a test, since it
>> seems like it would add some coverage. I will have a patch soon.
> 
> The patch with a fix is attached. I put the test in
> src/test/regress/sql/join.sql. It isn't the perfect location because
> it is testing something exercisable with a join but not directly
> related to the fact that it is a join. I also considered
> src/test/regress/sql/select.sql, but it also isn't directly related to
> the query being a SELECT query. If there is a better place for a test
> of a bitmap heap scan edge case, let me know.
> 

I don't see a problem with adding this to join.sql - why wouldn't this
count as something related to a join? Sure, it's not like this code
matters only for joins, but if you look at join.sql that applies to a
number of other tests (e.g. there are a couple btree tests).

That being said, it'd be good to explain in the comment why we're
testing this particular plan, not just what the plan looks like.

> One other note: there is some concurrency effect in the parallel
> schedule group containing "join" where you won't trip the assert if
> all the tests in that group in the parallel schedule are run. But, if
> you would like to verify that the test exercises the correct code,
> just reduce the group containing "join".
> 

That is ... interesting. Doesn't that mean that most test runs won't
actually detect the problem? That would make the test a bit useless.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: createdb compares strategy as case-sensitive

2024-04-21 Thread Tomas Vondra

On 4/21/24 17:10, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 4/21/24 00:19, Tom Lane wrote:
>>> I'm not suggesting that this is an interesting security vulnerability,
>>> because if you can control the arguments to createdb it's probably
>>> game over long since.  But wrapping the arguments is good for
>>> delivering on-point error messages.  So I'd add a fmtId() call to
>>> LOCALE_PROVIDER too.
> 
>> OK, the attached 0001 patch does these three things - adds the fmtId()
>> for locale_provider, make the comparison case-insensitive for strategy
>> and also removes the comma from the hint.
> 
> LGTM.
> 

Pushed, after tweaking the commit message a bit.

>> The createdb vs. CREATE DATABASE difference made me look if we have any
>> regression tests for CREATE DATABASE, and we don't. I guess it would be
>> good to have some, so I added a couple, for some of the parameters, see
>> 0002. But there's a problem with the locale stuff - this seems to work
>> in plain "make check", but pg_upgrade creates the clusters with
>> different providers etc. which changes the expected output. I'm not sure
>> there's a good way to deal with this ...
> 
> Probably not worth the trouble, really.
> 

Agreed.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: createdb compares strategy as case-sensitive

2024-04-21 Thread Tomas Vondra



On 4/21/24 00:19, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 4/20/24 22:40, Tom Lane wrote:
>>> Seems reasonable.  The alternative could be to remove createdb.c's use
>>> of fmtId() here, but I don't think that's actually better.
> 
>> Why? It seems to me this is quite close to e.g. LOCALE_PROVIDER, and we
>> don't do fmtId() for that. So why should we do that for STRATEGY?
> 
> Hah, nothing like being randomly inconsistent with adjacent code.
> Every other argument handled by createdb gets wrapped by either
> fmtId or appendStringLiteralConn.
> 
> I think the argument for this is it ensures that the switch value as
> accepted by createdb is the argument that CREATE DATABASE will see.
> Compare
> 
> $ createdb --strategy="foo bar" mydb
> createdb: error: database creation failed: ERROR:  invalid create database 
> strategy "foo bar"
> HINT:  Valid strategies are "wal_log", and "file_copy".
> 
> $ createdb --locale-provider="foo bar" mydb
> createdb: error: database creation failed: ERROR:  syntax error at or near ";"
> LINE 1: CREATE DATABASE mydb LOCALE_PROVIDER foo bar;
> ^
> 
> I'm not suggesting that this is an interesting security vulnerability,
> because if you can control the arguments to createdb it's probably
> game over long since.  But wrapping the arguments is good for
> delivering on-point error messages.  So I'd add a fmtId() call to
> LOCALE_PROVIDER too.
> 
> BTW, someone's taken the Oxford comma too far in that HINT.
> Nobody writes a comma when there are only two alternatives.
> 

OK, the attached 0001 patch does these three things - adds the fmtId()
for locale_provider, make the comparison case-insensitive for strategy
and also removes the comma from the hint.

The createdb vs. CREATE DATABASE difference made me look if we have any
regression tests for CREATE DATABASE, and we don't. I guess it would be
good to have some, so I added a couple, for some of the parameters, see
0002. But there's a problem with the locale stuff - this seems to work
in plain "make check", but pg_upgrade creates the clusters with
different providers etc. which changes the expected output. I'm not sure
there's a good way to deal with this ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom e7a18c780591673592b2a5eb1e129fcceb17f0fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 21 Apr 2024 14:19:32 +0200
Subject: [PATCH 1/2] createdb: compare strategy case-insensitive

When specifying the createdb strategy, the documentation suggests valid
options are FILE_COPY and WAL_LOG, but the code does case-sensitive
comparison and accepts only "file_copy" and "wal_log" as valid.

Fixed by doing a case-insensitive comparison using pg_strcasecmp(), same
as for other string parameters.

While at it, apply fmtId() to a nearby "locale_provider". This already
did the comparison in case-insensitive way, but the value would not be
double-quoted, confusing the parser and the error message.

Backpatch to 15, where the strategy was introduced.

Backpatch-through: 15
---
 src/backend/commands/dbcommands.c |  6 +++---
 src/bin/scripts/createdb.c|  2 +-
 src/bin/scripts/t/020_createdb.pl | 10 ++
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 8229dfa1f22..cd06d1270c5 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1018,15 +1018,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		char	   *strategy;
 
 		strategy = defGetString(dstrategy);
-		if (strcmp(strategy, "wal_log") == 0)
+		if (pg_strcasecmp(strategy, "wal_log") == 0)
 			dbstrategy = CREATEDB_WAL_LOG;
-		else if (strcmp(strategy, "file_copy") == 0)
+		else if (pg_strcasecmp(strategy, "file_copy") == 0)
 			dbstrategy = CREATEDB_FILE_COPY;
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 	 errmsg("invalid create database strategy \"%s\"", strategy),
-	 errhint("Valid strategies are \"wal_log\", and \"file_copy\".")));
+	 errhint("Valid strategies are \"wal_log\" and \"file_copy\".")));
 	}
 
 	/* If encoding or locales are defaulted, use source's setting */
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 007061e756f..30426860efa 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -237,7 +237,7 @@ main(int argc, char *argv[])
 		appendStringLiteralConn(, lc_ctype, conn);
 	}
 	if (locale_provider)
-		appendPQExpBuffer(, " LOCALE_PROVIDER %s", locale_provider);

Re: createdb compares strategy as case-sensitive

2024-04-20 Thread Tomas Vondra




On 4/20/24 22:40, Tom Lane wrote:
> Tomas Vondra  writes:
>> While doing some testing with createdb, I noticed it only accepts
>> file_copy/wal_log as valid strategies, not FILE_COPY/WAL_LOG (which is
>> what the docs say). The same thing applies to CREATE DATABASE.
> 
> Hmm, actually it does work in CREATE DATABASE:
> 
> regression=# create database foo STRATEGY = FILE_COPY;
> CREATE DATABASE
> 
> but it fails in createdb because that does
> 
>   if (strategy)
>   appendPQExpBuffer(, " STRATEGY %s", fmtId(strategy));
> 
> and fmtId will double-quote the strategy if it's upper-case, and then
> the backend grammar doesn't case-fold it, and kaboom.
> 

Oh, right. I should have tested CREATE DATABASE instead of just assuming
it has the same issue ...

>> The problem is that createdb() does the check using strcmp() which is
>> case-sensitive. IMHO this should do pg_strcasecmp() which is what we do
>> for other string parameters nearby.
> 
> Seems reasonable.  The alternative could be to remove createdb.c's use
> of fmtId() here, but I don't think that's actually better.
> 

Why? It seems to me this is quite close to e.g. LOCALE_PROVIDER, and we
don't do fmtId() for that. So why should we do that for STRATEGY?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

createdb compares strategy as case-sensitive

2024-04-20 Thread Tomas Vondra

Hi,

While doing some testing with createdb, I noticed it only accepts
file_copy/wal_log as valid strategies, not FILE_COPY/WAL_LOG (which is
what the docs say). The same thing applies to CREATE DATABASE.

The problem is that createdb() does the check using strcmp() which is
case-sensitive. IMHO this should do pg_strcasecmp() which is what we do
for other string parameters nearby.

Patch attached. This should be backpatched to 15, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 4f4eac5ed7547d27a789624a237bab1ab51b2e4b Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 20 Apr 2024 21:45:30 +0200
Subject: [PATCH] createdb: compare strategy case-insensitive

When specifying the createdb strategy, the documentation suggests valid
options are FILE_COPY and WAL_LOG, but the code does case-sensitive
comparison and accepts only "file_copy" and "wal_log" as valid.

Fixed by doing a case-insensitive comparison using pg_strcasecmp(), same
as for other string parameters.

Backpatch to 15, where the strategy was introduced.

Backpatch-through: 15
---
 src/backend/commands/dbcommands.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 8229dfa1f22..c19f091c696 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1018,9 +1018,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		char	   *strategy;
 
 		strategy = defGetString(dstrategy);
-		if (strcmp(strategy, "wal_log") == 0)
+		if (pg_strcasecmp(strategy, "wal_log") == 0)
 			dbstrategy = CREATEDB_WAL_LOG;
-		else if (strcmp(strategy, "file_copy") == 0)
+		else if (pg_strcasecmp(strategy, "file_copy") == 0)
 			dbstrategy = CREATEDB_FILE_COPY;
 		else
 			ereport(ERROR,
-- 
2.44.0

Re: brininsert optimization opportunity

2024-04-19 Thread Tomas Vondra

On 4/19/24 00:13, Tomas Vondra wrote:
> ...
> 
> It's a bit too late for me to push this now, I'll do so early tomorrow.
> 

FWIW I've pushed both patches, which resolves the open item, so I've
moved it to the "resolved" part on wiki.

thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

2024-04-18 Thread Tomas Vondra

Hi,

Here's two patched to deal with this open item. 0001 is a trivial fix of
typos and wording, I moved it into a separate commit for clarity. 0002
does the actual fix - adding the index_insert_cleanup(). It's 99% the
patch Alvaro shared some time ago, with only some minor formatting
tweaks by me.

I've also returned to this Alvaro's comment:

> Lastly, I kinda disagree with the notion that only some of the callers
> of aminsert should call aminsertcleanup, even though btree doesn't
> have an aminsertcleanup and thus it can't affect TOAST or catalogs.

which was a reaction to my earlier statement about places calling
index_insert():

> There's a call in toast_internals.c, but that seems OK because that
> only deals with btree indexes (and those don't need any cleanup). The
> same logic applies to unique_key_recheck(). The rest goes through
> execIndexing.c, which should do the cleanup in ExecCloseIndices().

I think Alvaro is right, so I went through all index_insert() callers
and checked which need the cleanup. Luckily there's not that many of
them, only 5 in total call index_insert() directly:

1) toast_save_datum (src/backend/access/common/toast_internals.c)

  This is safe, because the index_insert() passes indexInfo=NULL, so
  there can't possibly be any cache. If we ever decide to pass a valid
  indexInfo, we can add the cleanup, now it seems pointless.

  Note: If someone created a BRIN index on a TOAST table, that'd already
  crash, because BRIN blindly dereferences the indexInfo. Maybe that
  should be fixed, but we don't support CREATE INDEX on TOAST tables.

2) heapam_index_validate_scan (src/backend/access/heap/heapam_handler.c)

  Covered by the committed fix, adding cleanup to validate_index.

3) CatalogIndexInsert (src/backend/catalog/indexing.c)

  Covered by all callers also calling CatalogCloseIndexes, which in turn
  calls ExecCloseIndices and cleanup.

4) unique_key_recheck (src/backend/commands/constraint.c)

  This seems like the only place missing the cleanup call.

5) ExecInsertIndexTuples (src/backend/executor/execIndexing.c)

  Should be covered by ExecCloseIndices, called after the insertions.


So it seems only (4) unique_key_recheck needs the extra call (it can't
really happen higher because the indexInfo is a local variable). So the
0002 patch adds the call.

The patch also adds a test for this (or rather tweaks an existing one).


It's a bit too late for me to push this now, I'll do so early tomorrow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 0f89677b99b4b70ddfcc6c2cd08f433584bf65aa Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 18 Apr 2024 22:22:37 +0200
Subject: [PATCH 1/2] Fix a couple typos in BRIN code

Typos introduced by commits c1ec02be1d79, b43757171470 and dae761a87eda.

Author: Alvaro Herrera
Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/202401091043.e3nrqiad6gb7@alvherre.pgsql
---
 doc/src/sgml/indexam.sgml  |  2 +-
 src/backend/access/brin/brin.c | 10 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 76ac0fcddd7..18cf23296f2 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -370,7 +370,7 @@ aminsert (Relation indexRelation,
initially). After the index insertions complete, the memory will be freed
automatically. If additional cleanup is required (e.g. if the cache contains
buffers and tuple descriptors), the AM may define an optional function
-   indexinsertcleanup, called before the memory is released.
+   aminsertcleanup, called before the memory is released.
   
 
   
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 32722f0961b..4f708bba658 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -222,7 +222,7 @@ static bool add_values_to_range(Relation idxRel, BrinDesc *bdesc,
 BrinMemTuple *dtup, const Datum *values, const bool *nulls);
 static bool check_null_keys(BrinValues *bval, ScanKey *nullkeys, int nnullkeys);
 static void brin_fill_empty_ranges(BrinBuildState *state,
-   BlockNumber prevRange, BlockNumber maxRange);
+   BlockNumber prevRange, BlockNumber nextRange);
 
 /* parallel index builds */
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
@@ -1151,8 +1151,8 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 *
 	 * XXX plan_create_index_workers makes the number of workers dependent on
 	 * maintenance_work_mem, requiring 32MB for each worker. That makes sense
-	 * for btree, but not for BRIN, which can do away with much less memory.
-	 * So maybe make that somehow less strict, optionally?
+	 * for btree, but not for BRIN, which can do with much less memory. So
+	 * maybe make that somehow less strict, optionally?
 	 */
 	if (indexInfo->ii_ParallelWorkers > 0)

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-18 Thread Tomas Vondra

On 4/18/24 09:10, Michael Paquier wrote:
> On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote:
>> I've added an open item [1], because what's one open item when you can
>> have two? (me)
> 
> And this is still an open item as of today.  What's the plan to move
> forward here?

AFAIK the plan is to replace the asserts with actually resetting the
rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago.
I assume she was busy with the post-freeze AM reworks last week, so this
was on a back burner.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: brininsert optimization opportunity

2024-04-18 Thread Tomas Vondra

On 4/18/24 09:07, Michael Paquier wrote:
> On Thu, Feb 29, 2024 at 01:20:39PM +0100, Alvaro Herrera wrote:
>> I think it's not an issue, or rather that we should not try to guess.
>> Instead make it a simple rule: if aminsert is called, then
>> aminsertcleanup must be called afterwards, period.
>>
>> I agree it would be nice to have a way to verify, but it doesn't seem
>> 100% essential.  After all, it's not very common to add new calls to
>> aminsert.
> 
> This thread is listed as an open item.  What's the follow-up plan?
> The last email of this thread is dated as of the 29th of February,
> which was 6 weeks ago.

Apologies, I got distracted by the other patches. The bug is still
there, I believe the patch shared by Alvaro in [1] is the right way to
deal with it. I'll take care of that today/tomorrow.


[1]
https://www.postgresql.org/message-id/202401091043.e3nrqiad6gb7@alvherre.pgsql

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-17 Thread Tomas Vondra

On 4/15/24 20:35, Tomas Vondra wrote:
> ...
>
> Attached is the cleanup I thought about doing earlier in this patch [1]
> to make the code more like btree. The diff might make it seem like a big
> change, but it really just moves the merge code into a separate function
> and makes it use using the conditional variable. I still believe the old
> code is correct, but this seems like an improvement so plan to push this
> soon and resolve the open item.
>

I've now pushed this cleanup patch, after rewording the commit message a
little bit, etc. I believe this resolves the open item tracking this, so
I've moved it to the "resolved" part.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-17 Thread Tomas Vondra

On 4/16/24 22:33, Tomas Vondra wrote:
> On 4/15/24 20:35, Tomas Vondra wrote:
>> On 4/15/24 10:18, Tomas Vondra wrote:
>
> ...
> 
> That is, no parallel index builds on temporary tables. Which means the
> test is not actually testing anything :-( Much more stable, but not very
> useful for finding issues.
> 
> I think the best way to stabilize the test is to just not delete the
> rows. That means we won't have any "empty" ranges (runs of pages without
> any tuples).
> 

I just pushed a revert and a patch to stabilize the test in a different
way - Matthias mentioned to me off-list that DELETE is not the only way
to generate empty ranges in a BRIN index, because partial indexes have
the same effect. After playing with that a bit, that seems to work fine
(works with parallel builds, not affected by cleanup), so done that way.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-16 Thread Tomas Vondra

On 4/15/24 20:35, Tomas Vondra wrote:
> On 4/15/24 10:18, Tomas Vondra wrote:
>> ...
>>
>> I'll try a bit more to make this work without the temp table.
>>
> 
> Considering the earlier discussion in e2933a6e1, I think making the
> table TEMP is the best fix, so I'll do that. Thanks for remembering that
> change, Alexander!
> 

D'oh! I pushed this fix to stabilize the test earlier today, but I just
realized it unfortunately makes the test useless. The idea of the test
was to build BRIN indexes with/without parallelism, and check that the
indexes are exactly the same.

The instability comes from deletes, which I added to get "empty" ranges
in the table, which may not be cleaned up in time for the CREATE INDEX
commands, depending on what else is happening. A TEMPORARY table does
not have this issue (as observed in e2933a6e1), but there's the minor
problem that plan_create_index_workers() does this:

  /*
   * Determine if it's safe to proceed.
   *
   * Currently, parallel workers can't access the leader's temporary
   * tables. Furthermore, any index predicate or index expressions must
   * be parallel safe.
   */
  if (heap->rd_rel->relpersistence == RELPERSISTENCE_TEMP ||
!is_parallel_safe(root, (Node *) RelationGetIndexExpressions(index)) ||
!is_parallel_safe(root, (Node *) RelationGetIndexPredicate(index)))
  {
parallel_workers = 0;
goto done;
  }

That is, no parallel index builds on temporary tables. Which means the
test is not actually testing anything :-( Much more stable, but not very
useful for finding issues.

I think the best way to stabilize the test is to just not delete the
rows. That means we won't have any "empty" ranges (runs of pages without
any tuples).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-15 Thread Tomas Vondra

On 4/15/24 10:18, Tomas Vondra wrote:
> ...
>
> I'll try a bit more to make this work without the temp table.
> 

Considering the earlier discussion in e2933a6e1, I think making the
table TEMP is the best fix, so I'll do that. Thanks for remembering that
change, Alexander!

Attached is the cleanup I thought about doing earlier in this patch [1]
to make the code more like btree. The diff might make it seem like a big
change, but it really just moves the merge code into a separate function
and makes it use using the conditional variable. I still believe the old
code is correct, but this seems like an improvement so plan to push this
soon and resolve the open item.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom c37d96c4d3a151b3a30f28b971e6058d20978b76 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 15 Apr 2024 17:59:31 +0200
Subject: [PATCH] Cleanup parallel BRIN index build code

Commit b43757171470 introduced parallel builds for BRIN indexes, using
code fairly similar to BTREE. But there happened to be a couple minor
unnecessary differences, particularly in when the leader waits for the
workers and merges the results.

Unlike the BTREE code, the leader never waited on the workersdonecv
condition variable, but simply called WaitForParallelWorkersToFinish()
in _brin_end_parallel() before merging the per-worker results. While
this works correctly, it's probably better to do the merging earlier,
after waiting on the condition variable. This way _brin_end_parallel()
is responsible only for exiting the parallel mode and accumulating WAL
usage data, same as in BTREE.

The mering of per-worker results now happens in _brin_parallel_merge(),
while _brin_parallel_heapscan() is responsible for waiting for the
workers to finish scanning the heap.

Discussion: https://postgr.es/m/3733d042-71e1-6ae6-5fac-00c12db62...@enterprisedb.com
---
 src/backend/access/brin/brin.c | 130 -
 1 file changed, 97 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 041415a40e7..32722f0961b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -229,6 +229,8 @@ static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Rela
  bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
 static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _brin_parallel_heapscan(BrinBuildState *buildstate);
+static double _brin_parallel_merge(BrinBuildState *buildstate);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
 			   Relation heap, Relation index);
 static void _brin_parallel_scan_and_build(BrinBuildState *buildstate,
@@ -1201,6 +1203,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 	   TUPLESORT_NONE);
 
+		/* scan the relation and merge per-worker results */
+		reltuples = _brin_parallel_merge(state);
+
 		_brin_end_parallel(state->bs_leader, state);
 	}
 	else		/* no parallel index build */
@@ -1233,14 +1238,10 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 			   state->bs_currRangeStart,
 			   state->bs_maxRangeStart);
-
-		/* track the number of relation tuples */
-		state->bs_reltuples = reltuples;
 	}
 
 	/* release resources */
 	idxtuples = state->bs_numtuples;
-	reltuples = state->bs_reltuples;
 	brinRevmapTerminate(state->bs_rmAccess);
 	terminate_brin_buildstate(state);
 
@@ -2329,6 +2330,22 @@ check_null_keys(BrinValues *bval, ScanKey *nullkeys, int nnullkeys)
 	return true;
 }
 
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's BrinLeader, which caller must use to shut down parallel
+ * mode by passing it to _brin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
 static void
 _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 bool isconcurrent, int request)
@@ -2517,27 +2534,87 @@ static void
 _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 {
 	int			i;
-	BrinTuple  *btup;
-	BrinMemTuple *memtuple = NULL;
-	Size		tuplen;
-	BrinShared *brinshared = brinleader->brinshared;
-	BlockNumber prevblkno = InvalidBlockNumber;
-	MemoryContext rangeCxt,
-oldC

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-15 Thread Tomas Vondra

On 4/15/24 08:00, Alexander LAW wrote:
> Hello Tomas,
> 
> 14.04.2024 20:09, Tomas Vondra wrote:
>> I've pushed this, including backpatching the two fixes. I've reduced the
>> amount of data needed by the test, and made sure it works on 32-bits too
>> (I was a bit worried it might be sensitive to that, but that seems not
>> to be the case).
> 
> I've discovered that that test addition brings some instability to the
> test.
> With the following pageinspect/Makefile modification:
> -REGRESS = page btree brin gin gist hash checksum oldextversions
> +REGRESS = page btree brin $(shell printf 'brin %.0s' `seq 99`) gin gist
> hash checksum oldextversions
> 
> echo "autovacuum_naptime = 1" > /tmp/temp.config
> TEMP_CONFIG=/tmp/temp.config make -s check -C contrib/pageinspect
> fails for me as below:
> ...
> ok 17    - brin  127 ms
> not ok 18    - brin  140 ms
> ok 19    - brin  125 ms
> ...
> # 4 of 107 tests failed.
> 
> The following change
> -CREATE TABLE brin_parallel_test (a int, b text, c bigint) WITH
> (fillfactor=40);
> +CREATE TEMP TABLE brin_parallel_test (a int, b text, c bigint) WITH
> (fillfactor=40);
> (similar to e2933a6e1) makes the test pass reliably for me.
> 

Thanks! This reproduces the issue for me.

I believe this happens because the test does "DELETE + VACUUM" to
generate "gaps" in the table, to get empty ranges in the BRIN. I guess
what's happening is that something (autovacuum or likely something else)
blocks the explicit VACUUM from cleaning some of the pages with deleted
tuples, but then the cleanup happens shortly after between building the
the serial/parallel indexes. That would explain the differences reported
by the regression test.

When I thought about this while writing the test, my reasoning was that
even if the explicit vacuum occasionally fails to clean something, it
should affect all the indexes equally. Which is why I wrote the test to
compare the results using EXCEPT, not checking the exact output.

I'm not a huge fan of temporary tables in regression tests, because it
disappears at the end, making it impossible to inspect the data after a
failure. But the only other option I could think of is disabling
autovacuum on the table, but that does not seem to prevent the failures.

I'll try a bit more to make this work without the temp table.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-14 Thread Tomas Vondra

I've pushed this, including backpatching the two fixes. I've reduced the
amount of data needed by the test, and made sure it works on 32-bits too
(I was a bit worried it might be sensitive to that, but that seems not
to be the case).

There's still the question of maybe removing the differences between the
BTREE and BRIN code for parallel builds, I mentioned in [1]. That's more
of a cosmetic issue, but I'll add it as an open item for myself.


regards


[1]
https://www.postgresql.org/message-id/3733d042-71e1-6ae6-5fac-00c12db62db6%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-13 Thread Tomas Vondra

On 4/13/24 11:19, Tomas Vondra wrote:
> On 4/13/24 10:36, Andres Freund wrote:
>> Hi,
>>
>> While preparing a differential code coverage report between 16 and HEAD, one
>> thing that stands out is the parallel brin build code. Neither on
>> coverage.postgresql.org nor locally is that code reached during our tests.
>>
> 
> Thanks for pointing this out, it's definitely something that I need to
> improve (admittedly, should have been part of the patch). I'll also look
> into eliminating the difference between BTREE and BRIN parallel builds,
> mentioned in my last message in this thread.
> 

Here's a couple patches adding a test for the parallel CREATE INDEX with
BRIN. The actual test is 0003/0004 - I added the test to pageinspect,
because that allows cross-checking the index to one built without
parallelism, which I think is better than just doing CREATE INDEX
without properly testing it produces correct results.

It's not entirely trivial because for some opclasses (e.g. minmax-multi)
the results depend on the order in which values are added, and order in
which summaries from different workers are merged.

Funnily enough, while adding the test, I ran into two pre-existing bugs.
One is that brin_bloom_union forgot to update the number of bits set in
the bitmap, another one is that 6bcda4a721 changes PG_DETOAST_DATUM to
the _PACKED version, which however does the wrong thing. Both of which
are mostly harmless - it only affects the output function, which is
unused outside pageinspect. No impact on query correctness etc.

The test needs a bit more work to make sure it works on 32-bit machines
etc. which I think may affect available space on a page, which in turn
might affect the minmax-multi summaries. But I'll take care this early
next week.

Funnily

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 450f992c12d43c14aa1b4b72cfd9d3ce0251eed1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 13 Apr 2024 22:15:26 +0200
Subject: [PATCH v20240413 1/4] Update nbits_set in brin_bloom_union

When merging BRIN bloom summaries, it's not enough to merge the bitmaps,
we need to update the nbits_set counter too.

This is mostly harmless, as it does not affect correctness - as the
counter is used only in the out function, and that's used only in
pageinspect to show basic info about the summary.

Backpatch-through: 14
---
 src/backend/access/brin/brin_bloom.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/access/brin/brin_bloom.c b/src/backend/access/brin/brin_bloom.c
index ebf33016279..9e3bc567303 100644
--- a/src/backend/access/brin/brin_bloom.c
+++ b/src/backend/access/brin/brin_bloom.c
@@ -687,10 +687,20 @@ brin_bloom_union(PG_FUNCTION_ARGS)

 	nbytes = (filter_a->nbits) / 8;

+	filter_a->nbits_set = 0;
+
 	/* simply OR the bitmaps */
 	for (i = 0; i < nbytes; i++)
+	{
 		filter_a->data[i] |= filter_b->data[i];

+		for (int bit = 0; bit < 8; bit++)
+		{
+			if (filter_a->data[i] & (0x01 << bit))
+filter_a->nbits_set++;
+		}
+	}
+
 	PG_RETURN_VOID();
 }

-- 
2.44.0

From 07739713893bb0f6be5853324e8897159cf2ad07 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 13 Apr 2024 22:18:37 +0200
Subject: [PATCH v20240413 2/4] Use correct PG_DETOAST_DATUM macro in BRIN

Commit 6bcda4a721 replaced PG_DETOAST_DATUM with PG_DETOAST_DATUM_PACKED
in two places, in output function for minmax-multi and bloom. But this
is incorrect - the structs backing both summaries include 4B header, so
we need to detoast them fully. With _PACKED we may end up with 1B
header, so the fields will be shifted and have incorrect values.

Backpatch-through: 16
---
 src/backend/access/brin/brin_bloom.c| 2 +-
 src/backend/access/brin/brin_minmax_multi.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/brin/brin_bloom.c b/src/backend/access/brin/brin_bloom.c
index 9e3bc567303..1c2437cef74 100644
--- a/src/backend/access/brin/brin_bloom.c
+++ b/src/backend/access/brin/brin_bloom.c
@@ -807,7 +807,7 @@ brin_bloom_summary_out(PG_FUNCTION_ARGS)
 	StringInfoData str;

 	/* detoast the data to get value with a full 4B header */
-	filter = (BloomFilter *) PG_DETOAST_DATUM_PACKED(PG_GETARG_DATUM(0));
+	filter = (BloomFilter *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));

 	initStringInfo();
 	appendStringInfoChar(, '{');
diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index c5962c00d64..e5d95de5d84 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -3023,7 +3023,7 @@ brin_minmax_multi_summary_out(PG_FUNCTION_ARGS)
 	 * Detoast to get value with full 4B header (can't be stored in a toast
 	 * table, but can use 1B header).
 	 */
-	ranges = (SerializedRanges *) PG_DETOAST_DATUM_PACKED(PG_GETARG_DATUM(0));
+	ranges = (Seria

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra

On 4/13/24 15:02, Robins Tharakan wrote:
> On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
>>
>> Master failed today for the first time since the compiler upgrade.
>> Again reltuples == 48.
> 
> Here's what I can add over the past few days:
> - Almost all failures are either reltuples=48 or SIGABRTs
> - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
>   - A little too coincidental? Recent crashes have stack-trace if
> interested.
> 
> Barring the initial failures (during move to gcc 13.2), in the past week:
> - v15 somehow hasn't had a failure yet
> - v14 / v16 have got only 1 failure each
> - but v12 / v13 are lit up - failed multiple times.
> 

I happened to have an unused rpi5, so I installed Armbian aarch64 with
gcc 13.2.0, built with exactly the same configure options as parula, and
did ~300 loops of "make check" without a single failure.

So either parula has packages in a different way, or maybe it's a more
of a timing issue and rpi5 is way slower than graviton3.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra




On 4/9/24 05:48, David Rowley wrote:
> On Mon, 8 Apr 2024 at 23:56, Robins Tharakan  wrote:
>> #3  0x0083ed84 in WaitLatch (latch=, 
>> wakeEvents=wakeEvents@entry=41, timeout=60, 
>> wait_event_info=wait_event_info@entry=150994946) at latch.c:538
>> #4  0x00907404 in pg_sleep (fcinfo=) at misc.c:406
> 
>> #17 0x0086a944 in exec_simple_query 
>> (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at 
>> postgres.c:1274
> 
> I have no idea why WaitLatch has timeout=60.  That should be no
> higher than timeout=100 for "SELECT pg_sleep(0.1);".  I have no
> theories aside from a failing RAM module, cosmic ray or a well-timed
> clock change between the first call to gettimeofday() in pg_sleep()
> and the next one.
> 
> I know this animal is running debug_parallel_query = regress, so that
> 0.1 Const did have to get serialized and copied to the worker, so
> there's another opportunity for the sleep duration to be stomped on,
> but that seems pretty unlikely.
> 

AFAIK that GUC is set only for HEAD, so it would not explain the
failures on the other branches.

> I can't think of a reason why the erroneous  reltuples=48 would be
> consistent over 2 failing runs if it were failing RAM or a cosmic ray.
> 

Yeah, that seems very unlikely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-13 Thread Tomas Vondra




On 4/13/24 01:23, David Steele wrote:
> On 4/12/24 22:27, Tomas Vondra wrote:
>>
>>
>> On 4/12/24 08:42, David Steele wrote:
>>> On 4/11/24 20:26, Tomas Vondra wrote:
>>>> On 4/11/24 03:52, David Steele wrote:
>>>>> On 4/11/24 10:23, Tom Kincaid wrote:
>>>>>>
>>>>>> The extensive Beta process we have can be used to build confidence we
>>>>>> need in a feature that has extensive review and currently has no
>>>>>> known
>>>>>> issues or outstanding objections.
>>>>>
>>>>> I did have objections, here [1] and here [2]. I think the complexity,
>>>>> space requirements, and likely performance issues involved in restores
>>>>> are going to be a real problem for users. Some of these can be
>>>>> addressed
>>>>> in future releases, but I can't escape the feeling that what we are
>>>>> releasing here is half-baked.
>>>>>
>>>> I do not think it's half-baked. I certainly agree there are
>>>> limitations,
>>>> and there's all kinds of bells and whistles we could add, but I think
>>>> the fundamental infrastructure is corrent and a meaningful step
>>>> forward.
>>>> Would I wish it to handle .tar for example? Sure I would. But I think
>>>> it's something we can add in the future - if we require all of this to
>>>> happen in a single release, it'll never happen.
>>>
>>> I'm not sure that I really buy this argument, anyway. It is not uncommon
>>> for significant features to spend years in development before they are
>>> committed. This feature went from first introduction to commit in just
>>> over six months. Obviously Robert had been working on it for a while,
>>> but for a feature this large six months is a sprint.
>>>
>>
>> Sure, but it's also not uncommon for significant features to be
>> developed incrementally, over multiple releases, introducing the basic
>> infrastructure first, and then expanding the capabilities later. I'd
>> cite logical decoding/replication and parallel query as examples of this
>> approach.
>>
>> It's possible there's some fundamental flaw in the WAL summarization?
>> Sure, I can't rule that out, although I find it unlikely. Could there be
>> bugs? Sure, that's possible, but that applies to all code.
>>
>> But it seems to me all the comments are about the client side, not about
>> the infrastructure. Which is fair, I certainly agree it'd be nice to
>> handle more use cases with less effort, but I still think the patch is a
>> meaningful step forward.
> 
> Yes, my comments are all about the client code. I like the
> implementation of the WAL summarizer a lot. I don't think there is a
> fundamental flaw in the design, either, but I wouldn't be surprised if
> there are bugs. That's life in software development biz.
> 

Agreed.

> Even for the summarizer, though, I do worry about the complexity of
> maintaining it over time. It seems like it would be very easy to
> introduce a bug and have it go unnoticed until it causes problems in the
> field. A lot of testing was done outside of the test suite for this
> feature and I'm not sure if we can rely on that focus with every release.
> 

I'm not sure there's a simpler way to implement this. I haven't really
worked on that part (not until the CoW changes a couple weeks ago), but
I think Robert was very conscious of the complexity.

I don't think expect this code to change very often, but I agree it's
not great to rely on testing outside the regular regression test suite.
But I'm not sure how much more we can do, really - for example my
testing was very much "randomized stress testing" with a lot of data and
long runs, looking for unexpected stuff. That's not something we could
do in the usual regression tests, I think.

But if you have suggestions how to extend the testing ...

> For me an incremental approach would be to introduce the WAL summarizer
> first. There are already plenty of projects that do page-level
> incremental (WAL-G, pg_probackup, pgBackRest) and could help shake out
> the bugs. Then introduce the client tools later when they are more
> robust. Or, release the client tools now but mark them as experimental
> or something so people know that changes are coming and they don't get
> blindsided by that in the next release. Or, at the very least, make the
> caveats very clear so users can make an informed choice.
> 

I don't think introducing just the summarizer, without any client tools,
would really work. How would we even test the summarizer, for

Re: post-freeze damage control

2024-04-13 Thread Tomas Vondra




On 4/13/24 01:03, David Steele wrote:
> On 4/12/24 22:12, Tomas Vondra wrote:
>> On 4/11/24 23:48, David Steele wrote:
>>> On 4/11/24 20:26, Tomas Vondra wrote:
>>>
>>>> FWIW that discussion also mentions stuff that I think the feature
>>>> should
>>>> not do. In particular, I don't think the ambition was (or should be) to
>>>> make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
>>>> more as an interface to "backup steps" correctly rather than a complete
>>>> backup solution that'd manage backup registry, retention, etc.
>>>
>>> Right -- this is exactly my issue. pg_basebackup was never easy to use
>>> as a backup solution and this feature makes it significantly more
>>> complicated. Complicated enough that it would be extremely difficult for
>>> most users to utilize in a meaningful way.
>>
>> Perhaps, I agree we could/should try to do better job to do backups, no
>> argument there. But I still don't quite see why introducing such
>> infrastructure to "manage backups" should be up to the patch adding
>> incremental backups. I see it as something to build on top of
>> pg_basebackup/pg_combinebackup, not into those tools.
> 
> I'm not saying that managing backups needs to be part of pg_basebackup,
> but I am saying without that it is not a complete backup solution.
> Therefore introducing advanced features that the user then has to figure
> out how to manage puts a large burden on them. Implementing
> pg_combinebackup inefficiently out of the gate just makes their life
> harder.
> 

I agree with this in general, but I fail to see how it'd be the fault of
this patch. It merely extends what pg_basebackup did before, so if it's
not a complete solution now, it wasn't a complete solution before.

Sure, I 100% agree it'd be great to have a more efficient
pg_combinebackup with fewer restrictions. But if we make these
limitations clear, I think it's better to have this than having nothing.

>>> But they'll try because it is a new pg_basebackup feature and they'll
>>> assume it is there to be used. Maybe it would be a good idea to make it
>>> clear in the documentation that significant tooling will be required to
>>> make it work.
>>
>> Sure, I'm not against making it clearer pg_combinebackup is not a
>> complete backup solution, and documenting the existing restrictions.
> 
> Let's do that then. I think it would make sense to add caveats to the
> pg_combinebackup docs including space requirements, being explicit about
> the format required (e.g. plain), and also possible mitigation with COW
> filesystems.
> 

OK. I'll add this as an open item, for me and Robert.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for BRIN indexes

2024-04-13 Thread Tomas Vondra

On 4/13/24 10:36, Andres Freund wrote:
> Hi,
> 
> While preparing a differential code coverage report between 16 and HEAD, one
> thing that stands out is the parallel brin build code. Neither on
> coverage.postgresql.org nor locally is that code reached during our tests.
>

Thanks for pointing this out, it's definitely something that I need to
improve (admittedly, should have been part of the patch). I'll also look
into eliminating the difference between BTREE and BRIN parallel builds,
mentioned in my last message in this thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add notes to pg_combinebackup docs

2024-04-12 Thread Tomas Vondra




On 4/12/24 11:12, Magnus Hagander wrote:
> On Tue, Apr 9, 2024 at 11:46 AM Tomas Vondra 
> wrote:
> 
>>
>>
>> On 4/9/24 09:59, Martín Marqués wrote:
>>> Hello,
>>>
>>> While doing some work/research on the new incremental backup feature
>>> some limitations were not listed in the docs. Mainly the fact that
>>> pg_combienbackup works with plain format and not tar.
>>>
>>
>> Right. The docs mostly imply this by talking about output directory and
>> backup directories, but making it more explicit would not hurt.
>>
>> FWIW it'd be great if we could make incremental backups work with tar
>> format in the future too. People probably don't want to keep around the
>> expanded data directory or extract everything before combining the
>> backups is not very convenient. Reading and writing the tar would make
>> this simpler.
>>
>>> Around the same time, Tomas Vondra tested incremental backups with a
>>> cluster where he enabled checksums after taking the previous full
>>> backup. After combining the backups the synthetic backup had pages
>>> with checksums and other pages without checksums which ended in
>>> checksum errors.
>>>
>>
>> I'm not sure just documenting this limitation is sufficient. We can't
>> make the incremental backups work in this case (it's as if someone
>> messes with cluster without writing stuff into WAL), but I think we
>> should do better than silently producing (seemingly) corrupted backups
> 
> 
> +1. I think that should be an open item that needs to get sorted.
> 
> 
> I say seemingly, because the backup is actually fine, the only problem
>> is it has checksums enabled in the controlfile, but the pages from the
>> full backup (and the early incremental backups) have no checksums.
>>
>> What we could do is detect this in pg_combinebackup, and either just
>> disable checksums with a warning and hint to maybe enable them again. Or
>> maybe just print that the user needs to disable them.
>>
> 
> I don't think either of these should be done automatically. Something like
> pg_combinebackup simply failing and requiring you to say
> "--disable-checksums" to have it do that would be the way to go, IMO.
> (once we can reliably detect it of course)
> 

You mean pg_combinebackup would have "--disable-checksums" switch? Yeah,
that'd work, I think. It's probably better than producing a backup that
would seem broken when the user tries to start the instance.

Also, I realized the user probably can't disable the checksums without
starting the instance to finish recovery. And if there are invalid
checksums, I'm not sure that would actually work.

> 
> I was thinking maybe we could detect this while taking the backups, and
>> force taking a full backup if checksums got enabled since the last
>> backup. But we can't do that because we only have the manifest from the
>> last backup, and the manifest does not include info about checksums.
>>
> 
> Can we forcibly read and parse it out of pg_control?
> 

You mean when taking the backup, or during pg_combinebackup?

During backup, it depends. For the instance we should be able to just
get that from the instance, no need to get it from pg_control. But for
the backup (used as a baseline for the increment) we can't read the
pg_control - the only thing we have is the manifest.

During pg_combinebackup we obviously can read pg_control for all the
backups to combine, but at that point it feels a bit too late - it does
not seem great to do backups, and then at recovery to tell the user the
backups are actually not valid.

I think it'd be better to detect this while taking the basebackup.

> 
> It's a bit unfortunate we don't have a way to enable checksums online.
>> That'd not have this problem IIRC, because it writes proper WAL. Maybe
>> it's time to revive that idea ... I recall there were some concerns
>> about tracking progress to allow resuming stuff, but maybe not having
>> anything because in some (rare?) cases it'd need to do more work does
>> not seem like a great trade off.
>>
>>
> For that one I still think it would be perfectly acceptable to have no
> resume at all, but that's a whole different topic :)
> 

I very much agree.

It seems a bit strange that given three options to enable checksums

1) offline
2) online without resume
3) online with resume

the initial argument was that we need to allow resuming the process
because on large systems it might take a lot of time, and we'd lose all
the work if the system restarts. But then we concluded that it's too
complex and it's better if the large systems have to do an extended
outage to enable checksums ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add notes to pg_combinebackup docs

2024-04-12 Thread Tomas Vondra

On 4/12/24 11:50, David Steele wrote:
> On 4/12/24 19:09, Magnus Hagander wrote:
>> On Fri, Apr 12, 2024 at 12:14 AM David Steele >
>> ...>>
>>  > But yeah, having to keep the backups as expanded directories is
>> not
>>  > great, I'd love to have .tar. Not necessarily because of the disk
>>     space
>>  > (in my experience the compression in filesystems works quite
>> well for
>>  > this purpose), but mostly because it's more compact and allows
>>     working
>>  > with backups as a single piece of data (e.g. it's much cleared
>>     what the
>>  > checksum of a single .tar is, compared to a directory).
>>
>>     But again, object stores are commonly used for backup these days and
>>     billing is based on data stored rather than any compression that
>> can be
>>     done on the data. Of course, you'd want to store the compressed
>> tars in
>>     the object store, but that does mean storing an expanded copy
>> somewhere
>>     to do pg_combinebackup.
>>
>> Object stores are definitely getting more common. I wish they were
>> getting a lot more common than they actually are, because they
>> simplify a lot.  But they're in my experience still very far from
>> being a majority.
> 
> I see it the other way, especially the last few years. The majority seem
> to be object stores followed up closely by NFS. Directly mounted storage
> on the backup host appears to be rarer.
> 

One thing I'd mention is that not having built-in support for .tar and
.tgz backups does not mean it's impossible to use pg_combinebackup with
archives. You can mount them using e.g. "ratarmount" and then use that
as source directories for pg_combinebackup.

It's not entirely friction-less because AFAICS it's necessary to do the
backup in plain format and then do the .tar to have the expected "flat"
directory structure (and not manifest + 2x tar). But other than that it
seems to work fine (based on my limited testing).


FWIW the "archivemount" performs terribly, so adding this capability
into pg_combinebackup is clearly far from trivial.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-12 Thread Tomas Vondra

On 4/12/24 08:42, David Steele wrote:
> On 4/11/24 20:26, Tomas Vondra wrote:
>> On 4/11/24 03:52, David Steele wrote:
>>> On 4/11/24 10:23, Tom Kincaid wrote:
>>>>
>>>> The extensive Beta process we have can be used to build confidence we
>>>> need in a feature that has extensive review and currently has no known
>>>> issues or outstanding objections.
>>>
>>> I did have objections, here [1] and here [2]. I think the complexity,
>>> space requirements, and likely performance issues involved in restores
>>> are going to be a real problem for users. Some of these can be addressed
>>> in future releases, but I can't escape the feeling that what we are
>>> releasing here is half-baked.
>>>
>> I do not think it's half-baked. I certainly agree there are limitations,
>> and there's all kinds of bells and whistles we could add, but I think
>> the fundamental infrastructure is corrent and a meaningful step forward.
>> Would I wish it to handle .tar for example? Sure I would. But I think
>> it's something we can add in the future - if we require all of this to
>> happen in a single release, it'll never happen.
> 
> I'm not sure that I really buy this argument, anyway. It is not uncommon
> for significant features to spend years in development before they are
> committed. This feature went from first introduction to commit in just
> over six months. Obviously Robert had been working on it for a while,
> but for a feature this large six months is a sprint.
> 

Sure, but it's also not uncommon for significant features to be
developed incrementally, over multiple releases, introducing the basic
infrastructure first, and then expanding the capabilities later. I'd
cite logical decoding/replication and parallel query as examples of this
approach.

It's possible there's some fundamental flaw in the WAL summarization?
Sure, I can't rule that out, although I find it unlikely. Could there be
bugs? Sure, that's possible, but that applies to all code.

But it seems to me all the comments are about the client side, not about
the infrastructure. Which is fair, I certainly agree it'd be nice to
handle more use cases with less effort, but I still think the patch is a
meaningful step forward.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-12 Thread Tomas Vondra




On 4/11/24 23:48, David Steele wrote:
> On 4/11/24 20:26, Tomas Vondra wrote:
>>
>> On 4/11/24 03:52, David Steele wrote:
>>> On 4/11/24 10:23, Tom Kincaid wrote:
>>>>
>>>> The extensive Beta process we have can be used to build confidence we
>>>> need in a feature that has extensive review and currently has no known
>>>> issues or outstanding objections.
>>>
>>> I did have objections, here [1] and here [2]. I think the complexity,
>>> space requirements, and likely performance issues involved in restores
>>> are going to be a real problem for users. Some of these can be addressed
>>> in future releases, but I can't escape the feeling that what we are
>>> releasing here is half-baked.
>>
>> I haven't been part of those discussions, and that part of the thread is
>> a couple months old already, so I'll share my view here instead.
>>
>> I do not think it's half-baked. I certainly agree there are limitations,
>> and there's all kinds of bells and whistles we could add, but I think
>> the fundamental infrastructure is corrent and a meaningful step forward.
>> Would I wish it to handle .tar for example? Sure I would. But I think
>> it's something we can add in the future - if we require all of this to
>> happen in a single release, it'll never happen.
> 
> Fair enough, but the current release is extremely limited and it would
> be best if that was well understood by users.
> 
>> FWIW that discussion also mentions stuff that I think the feature should
>> not do. In particular, I don't think the ambition was (or should be) to
>> make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
>> more as an interface to "backup steps" correctly rather than a complete
>> backup solution that'd manage backup registry, retention, etc.
> 
> Right -- this is exactly my issue. pg_basebackup was never easy to use
> as a backup solution and this feature makes it significantly more
> complicated. Complicated enough that it would be extremely difficult for
> most users to utilize in a meaningful way.
> 

Perhaps, I agree we could/should try to do better job to do backups, no
argument there. But I still don't quite see why introducing such
infrastructure to "manage backups" should be up to the patch adding
incremental backups. I see it as something to build on top of
pg_basebackup/pg_combinebackup, not into those tools.

> But they'll try because it is a new pg_basebackup feature and they'll
> assume it is there to be used. Maybe it would be a good idea to make it
> clear in the documentation that significant tooling will be required to
> make it work.
> 

Sure, I'm not against making it clearer pg_combinebackup is not a
complete backup solution, and documenting the existing restrictions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add notes to pg_combinebackup docs

2024-04-11 Thread Tomas Vondra

On 4/11/24 02:01, David Steele wrote:
> On 4/9/24 19:44, Tomas Vondra wrote:
>>
>> On 4/9/24 09:59, Martín Marqués wrote:
>>> Hello,
>>>
>>> While doing some work/research on the new incremental backup feature
>>> some limitations were not listed in the docs. Mainly the fact that
>>> pg_combienbackup works with plain format and not tar.
>>>
>>
>> Right. The docs mostly imply this by talking about output directory and
>> backup directories, but making it more explicit would not hurt.
>>
>> FWIW it'd be great if we could make incremental backups work with tar
>> format in the future too. People probably don't want to keep around the
>> expanded data directory or extract everything before combining the
>> backups is not very convenient. Reading and writing the tar would make
>> this simpler.
> 
> I have a hard time seeing this feature as being very useful, especially
> for large databases, until pg_combinebackup works on tar (and compressed
> tar). Right now restoring an incremental requires at least twice the
> space of the original cluster, which is going to take a lot of users by
> surprise.
> 

I do agree it'd be nice if pg_combinebackup worked with .tar directly,
without having to extract the directories first. No argument there, but
as I said in the other thread, I believe that's something we can add
later. That's simply how incremental development works.

I can certainly imagine other ways to do pg_combinebackup, e.g. by
"merging" the increments into the data directory, instead of creating a
copy. But again, I don't think that has to be in v1.

> I know you have made some improvements here for COW filesystems, but my
> experience is that Postgres is generally not run on such filesystems,
> though that is changing a bit.
> 

I'd say XFS is a pretty common choice, for example. And it's one of the
filesystems that work great with pg_combinebackup.

However, who says this has to be the filesystem the Postgres instance
runs on? Who in their right mind put backups on the same volume as the
instance anyway? At which point it can be a different filesystem, even
if it's not ideal for running the database.

FWIW I think it's fine to tell users that to minimize the disk space
requirements, they should use a CoW filesystem and --copy-file-range.
The docs don't say that currently, that's true.

All of this also depends on how people do the restore. With the CoW
stuff they can do a quick (and small) copy on the backup server, and
then copy the result to the actual instance. Or they can do restore on
the target directly (e.g. by mounting a r/o volume with backups), in
which case the CoW won't really help.

But yeah, having to keep the backups as expanded directories is not
great, I'd love to have .tar. Not necessarily because of the disk space
(in my experience the compression in filesystems works quite well for
this purpose), but mostly because it's more compact and allows working
with backups as a single piece of data (e.g. it's much cleared what the
checksum of a single .tar is, compared to a directory).

>>> Around the same time, Tomas Vondra tested incremental backups with a
>>> cluster where he enabled checksums after taking the previous full
>>> backup. After combining the backups the synthetic backup had pages
>>> with checksums and other pages without checksums which ended in
>>> checksum errors.
>>
>> I'm not sure just documenting this limitation is sufficient. We can't
>> make the incremental backups work in this case (it's as if someone
>> messes with cluster without writing stuff into WAL), but I think we
>> should do better than silently producing (seemingly) corrupted backups.
>>
>> I say seemingly, because the backup is actually fine, the only problem
>> is it has checksums enabled in the controlfile, but the pages from the
>> full backup (and the early incremental backups) have no checksums.
>>
>> What we could do is detect this in pg_combinebackup, and either just
>> disable checksums with a warning and hint to maybe enable them again. Or
>> maybe just print that the user needs to disable them.
>>
>> I was thinking maybe we could detect this while taking the backups, and
>> force taking a full backup if checksums got enabled since the last
>> backup. But we can't do that because we only have the manifest from the
>> last backup, and the manifest does not include info about checksums.
> 
> I'd say making a new full backup is the right thing to do in this case.
> It should be easy enough to store the checksum state of the cluster in
> the manifest.
> 

Agreed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-11 Thread Tomas Vondra

On 4/11/24 03:52, David Steele wrote:
> On 4/11/24 10:23, Tom Kincaid wrote:
>>
>> The extensive Beta process we have can be used to build confidence we
>> need in a feature that has extensive review and currently has no known
>> issues or outstanding objections.
> 
> I did have objections, here [1] and here [2]. I think the complexity,
> space requirements, and likely performance issues involved in restores
> are going to be a real problem for users. Some of these can be addressed
> in future releases, but I can't escape the feeling that what we are
> releasing here is half-baked.
> 

I haven't been part of those discussions, and that part of the thread is
a couple months old already, so I'll share my view here instead.

I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.

FWIW that discussion also mentions stuff that I think the feature should
not do. In particular, I don't think the ambition was (or should be) to
make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
more as an interface to "backup steps" correctly rather than a complete
backup solution that'd manage backup registry, retention, etc.

> Also, there are outstanding issues here [3] and now here [4].
> 

I agree with some of this, I'll respond in the threads.

regards
Tomas

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-09 Thread Tomas Vondra

On 4/9/24 01:33, Michael Paquier wrote:
> On Tue, Apr 09, 2024 at 01:16:02AM +0200, Tomas Vondra wrote:
>> I don't feel too particularly worried about this. Yes, backups are super
>> important because it's often the only thing you have left when things go
>> wrong, and the incremental aspect is all new. The code I've seen while
>> doing the CoW-related patches seemed very precise and careful, and the
>> one bug we found & fixed does not make it bad.
>>
>> Sure, I can't rule there being more bugs, but I've been doing some
>> pretty extensive stress testing of this (doing incremental backups +
>> combinebackup, and comparing the results against the source, and that
>> sort of stuff). And so far only that single bug this way. I'm still
>> doing this randomized stress testing, with more and more complex
>> workloads etc. and I'll let keep doing that for a while.
>>
>> Maybe I'm a bit too happy-go-lucky, but IMO the risk here is limited.
> 
> Even if there's a critical bug, there are still other ways to take
> backups, so there is an exit route even if a problem is found and even
> if this problem requires a complex solution to be able to work
> correctly.
> 

I think it's a bit more nuanced, because it's about backups/restore. The
bug might be subtle, and you won't learn about it until the moment when
you need to restore (or perhaps even long after that). At which point
"You might have taken the backup in some other way." is not really a
viable exit route.

Anyway, I'm still not worried about this particular feature, and I'll
keep doing the stress testing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-09 Thread Tomas Vondra

On 4/9/24 11:25, Matthias van de Meent wrote:
> On Mon, 8 Apr 2024 at 20:15, Tomas Vondra  
> wrote:
>>
>>
>> On 4/8/24 17:48, Matthias van de Meent wrote:
>>> On Mon, 8 Apr 2024 at 17:21, Tomas Vondra  
>>> wrote:
>>>>
>>>> Maybe it'd be better to start by expanding the existing rule about not
>>>> committing patches introduced for the first time in the last CF.
>>>
>>> I don't think adding more hurdles about what to include into the next
>>> release is a good solution. Why the March CF, and not earlier? Or
>>> later? How about unregistered patches? Changes to the docs? Bug fixes?
>>> The March CF already has a submission deadline of "before march", so
>>> that already puts a soft limit on the patches considered for the april
>>> deadline.
>>>
>>>> What if
>>>> we said that patches in the last CF must not go through significant
>>>> changes, and if they do it'd mean the patch is moved to the next CF?
>>>
>>> I also think there is already a big issue with a lack of interest in
>>> getting existing patches from non-committers committed, reducing the
>>> set of patches that could be considered is just cheating the numbers
>>> and discouraging people from contributing. For one, I know I have
>>> motivation issues keeping up with reviewing other people's patches
>>> when none (well, few, as of this CF) of my patches get reviewed
>>> materially and committed. I don't see how shrinking the window of
>>> opportunity for significant review from 9 to 7 months is going to help
>>> there.
>>>
>>
>> I 100% understand how frustrating the lack of progress can be, and I
>> agree we need to do better. I tried to move a number of stuck patches
>> this CF, and I hope (and plan) to do more of this in the future.
> 
> Thanks in advance.
> 
>> But I don't quite see how would this rule modification change the
>> situation for non-committers. AFAIK we already have the rule that
>> (complex) patches should not appear in the last CF for the first time,
>> and I'd argue that a substantial rework of a complex patch is not that
>> far from a completely new patch. Sure, the reworks may be based on a
>> thorough review, so there's a lot of nuance. But still, is it possible
>> to properly review if it gets reworked at the very end of the CF?
> 
> Possible? Probably, if there is a good shared understanding of why the
> previous patch version's approach didn't work well, and if the next
> approach is well understood as well. But it's not likely, that I'll
> agree with.
> 
> But the main issue I have with your suggestion is that the March
> commitfest effectively contains all new patches starting from January,
> and the leftovers not committed by February. If we start banning all
> new patches and those with significant reworks from the March
> commitfest, then combined with the lack of CF in May there wouldn't be
> any chance for new patches in the first half of the year, and an
> effective block on rewrites for 6 months- the next CF is only in July.
> Sure, there is a bit of leeway there, as some patches get committed
> before the commitfest they're registered in is marked active, but our
> development workflow is already quite hostile to incidental
> contributor's patches [^0], and increasing the periods in which
> authors shouldn't expect their patches to be reviewed due to a major
> release that's planned in >5 months is probably not going to help the
> case.
> 

But I don't think I suggested to ban such patches from the March CF
entirely. Surely those patches can still be submitted, reviewed, and
even reworked in the last CF. All I said is it might be better to treat
those patches as not committable by default. Sorry if that wasn't clear
enough ...

Would this be an additional restriction? I'm not quite sure. Surely if
the intent is to only commit patches that we agree are in "sufficiently
good" shape, and getting them into such shape requires time & reviews,
this would not be a significant change.

FWIW I'm not a huge fan of hard unbreakable rules, so there should be
some leeway when justified, but the bar would be somewhat higher (clear
consensus, RMT having a chance to say no, ...).

>>> So, I think that'd be counter-productive, as this would get the
>>> perverse incentive to band-aid over (architectural) issues to limit
>>> churn inside the patch, rather than fix those issues in a way that's
>>> appropriate for the project as a whole.
>>>
>>
>> Surely those architectural shortcomings should be identified in a review
>> - whic

Re: Add notes to pg_combinebackup docs

2024-04-09 Thread Tomas Vondra

On 4/9/24 09:59, Martín Marqués wrote:
> Hello,
> 
> While doing some work/research on the new incremental backup feature
> some limitations were not listed in the docs. Mainly the fact that
> pg_combienbackup works with plain format and not tar.
> 

Right. The docs mostly imply this by talking about output directory and
backup directories, but making it more explicit would not hurt.

FWIW it'd be great if we could make incremental backups work with tar
format in the future too. People probably don't want to keep around the
expanded data directory or extract everything before combining the
backups is not very convenient. Reading and writing the tar would make
this simpler.

> Around the same time, Tomas Vondra tested incremental backups with a
> cluster where he enabled checksums after taking the previous full
> backup. After combining the backups the synthetic backup had pages
> with checksums and other pages without checksums which ended in
> checksum errors.
> 

I'm not sure just documenting this limitation is sufficient. We can't
make the incremental backups work in this case (it's as if someone
messes with cluster without writing stuff into WAL), but I think we
should do better than silently producing (seemingly) corrupted backups.

I say seemingly, because the backup is actually fine, the only problem
is it has checksums enabled in the controlfile, but the pages from the
full backup (and the early incremental backups) have no checksums.

What we could do is detect this in pg_combinebackup, and either just
disable checksums with a warning and hint to maybe enable them again. Or
maybe just print that the user needs to disable them.

I was thinking maybe we could detect this while taking the backups, and
force taking a full backup if checksums got enabled since the last
backup. But we can't do that because we only have the manifest from the
last backup, and the manifest does not include info about checksums.

It's a bit unfortunate we don't have a way to enable checksums online.
That'd not have this problem IIRC, because it writes proper WAL. Maybe
it's time to revive that idea ... I recall there were some concerns
about tracking progress to allow resuming stuff, but maybe not having
anything because in some (rare?) cases it'd need to do more work does
not seem like a great trade off.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: post-freeze damage control

2024-04-08 Thread Tomas Vondra

On 4/8/24 21:47, Robert Haas wrote:
> On Mon, Apr 8, 2024 at 10:42 AM Heikki Linnakangas  wrote:
>> Can you elaborate, which patches you think were not ready? Let's make
>> sure to capture any concrete concerns in the Open Items list.
> 
> ...
> 
> - Incremental backup could have all sorts of terrible data-corrupting
> bugs. 55a5ee30cd65886ff0a2e7ffef4ec2816fbec273 was a full brown-paper
> bag level fix, so the chances of there being further problems are
> probably high.
> 

I don't feel too particularly worried about this. Yes, backups are super
important because it's often the only thing you have left when things go
wrong, and the incremental aspect is all new. The code I've seen while
doing the CoW-related patches seemed very precise and careful, and the
one bug we found & fixed does not make it bad.

Sure, I can't rule there being more bugs, but I've been doing some
pretty extensive stress testing of this (doing incremental backups +
combinebackup, and comparing the results against the source, and that
sort of stuff). And so far only that single bug this way. I'm still
doing this randomized stress testing, with more and more complex
workloads etc. and I'll let keep doing that for a while.

Maybe I'm a bit too happy-go-lucky, but IMO the risk here is limited.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Tomas Vondra




On 4/8/24 21:32, Jelte Fennema-Nio wrote:
> On Mon, 8 Apr 2024 at 20:15, Tomas Vondra  
> wrote:
>> I 100% understand how frustrating the lack of progress can be, and I
>> agree we need to do better. I tried to move a number of stuck patches
>> this CF, and I hope (and plan) to do more of this in the future.
>>
>> But I don't quite see how would this rule modification change the
>> situation for non-committers.
> 
> The problem that I feel I'm seeing is that committers mostly seem to
> materially review large patchsets in the last two commit fests. This
> might be not based in reality, but it is definitely how it feels to
> me. If someone has stats on this, feel free to share.
> 
> I'll sketch a situation: There's a big patch that some non-committer
> submitted that has been sitting on the mailinglist for 6 months or
> more, only being reviewed by other non-committers, which the submitter
> quickly addresses. Then in the final commit fest it is finally
> reviewed by a committer, and they say it requires significant changes.
> Right now, the submitter can decide to quickly address those changes,
> and hope to keep the attention of this committer, to hopefully get it
> merged before the deadline after probably a few more back and forths.
> But this new rule would mean that those significant changes would be a
> reason not to put it in the upcoming release. Which I expect would
> first of all really feel like a slap in the face to the submitter,
> because it's not their fault that those changes were only requested in
> the last commit fest. This would then likely result in the submitter
> not following up quickly (why make time right at this moment, if
> you're suddenly going to have another full year), which would then
> cause the committer to lose context of the patch and thus interest in
> the patch. And then finally getting into the exact same situation next
> year in the final commit fest, when some other committer didn't agree
> with the redesign of the first one and request a new one pushing it
> another year.
> 

FWIW I have no doubt this problem is very real. It has never been easy
to get stuff reviewed/committed, and I doubt it improved in last couple
years, considering how the traffic on pgsql-hackers exploded :-(

> So yeah, I definitely agree with Matthias. I definitely feel like his
> rule would seriously impact contributions made by non-committers.
> 
> Maybe a better solution to this problem would be to spread impactful
> reviews by committers more evenly throughout the year. Then there
> wouldn't be such a rush to address them in the last commit fest.

Right. I think that's mostly what I was aiming for, although I haven't
made it very clear/explicit. But yeah, if the consequence of the "rule"
was that some of the patches are neglected entirely, that'd be pretty
terrible - both for the project and for the contributors.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Tomas Vondra



On 4/8/24 17:48, Matthias van de Meent wrote:
> On Mon, 8 Apr 2024 at 17:21, Tomas Vondra  
> wrote:
>>
>> ...
>>
>> For me the main problem with the pre-freeze crush is that it leaves
>> pretty much no practical chance to do meaningful review/testing, and
>> some of the patches likely went through significant changes (at least
>> judging by the number of messages and patch versions in the associated
>> threads). That has to have a cost later ...
>>
>> That being said, I'm not sure the "progressive deadline" proposed by
>> Heikki would improve that materially, and it seems like a lot of effort
>> to maintain etc. And even if someone updates the CF app with all the
>> details, does it even give others sufficient opportunity to review the
>> new patch versions? I don't think so. (It anything, it does not seem
>> fair to expect others to do last-minute reviews under pressure.)
>>
>> Maybe it'd be better to start by expanding the existing rule about not
>> committing patches introduced for the first time in the last CF.
> 
> I don't think adding more hurdles about what to include into the next
> release is a good solution. Why the March CF, and not earlier? Or
> later? How about unregistered patches? Changes to the docs? Bug fixes?
> The March CF already has a submission deadline of "before march", so
> that already puts a soft limit on the patches considered for the april
> deadline.
> 
>> What if
>> we said that patches in the last CF must not go through significant
>> changes, and if they do it'd mean the patch is moved to the next CF?
> 
> I also think there is already a big issue with a lack of interest in
> getting existing patches from non-committers committed, reducing the
> set of patches that could be considered is just cheating the numbers
> and discouraging people from contributing. For one, I know I have
> motivation issues keeping up with reviewing other people's patches
> when none (well, few, as of this CF) of my patches get reviewed
> materially and committed. I don't see how shrinking the window of
> opportunity for significant review from 9 to 7 months is going to help
> there.
> 

I 100% understand how frustrating the lack of progress can be, and I
agree we need to do better. I tried to move a number of stuck patches
this CF, and I hope (and plan) to do more of this in the future.

But I don't quite see how would this rule modification change the
situation for non-committers. AFAIK we already have the rule that
(complex) patches should not appear in the last CF for the first time,
and I'd argue that a substantial rework of a complex patch is not that
far from a completely new patch. Sure, the reworks may be based on a
thorough review, so there's a lot of nuance. But still, is it possible
to properly review if it gets reworked at the very end of the CF?

> So, I think that'd be counter-productive, as this would get the
> perverse incentive to band-aid over (architectural) issues to limit
> churn inside the patch, rather than fix those issues in a way that's
> appropriate for the project as a whole.
> 

Surely those architectural shortcomings should be identified in a review
- which however requires time to do properly, and thus is an argument
for ensuring there's enough time for such review (which is in direct
conflict with the last-minute crush, IMO).

Once again, I 100% agree we need to do better in handling patches from
non-committers, absolutely no argument there. But does it require this
last-minute crush? I don't think so, it can't be at the expense of
proper review and getting it right. A complex patch needs to be
submitted early in the cycle, not in the last CF. If it's submitted
early, but does not receive enough interest/reviews, I think we need to
fix & improve that - not to rework/push it at the very end.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Tomas Vondra

On 4/8/24 16:59, Tom Lane wrote:
> Heikki Linnakangas  writes:
>> On 08/04/2024 16:43, Tom Lane wrote:
>>> I was just about to pen an angry screed along the same lines.
>>> The commit flux over the past couple days, and even the last
>>> twelve hours, was flat-out ridiculous.  These patches weren't
>>> ready a week ago, and I doubt they were ready now.
> 
>> Can you elaborate, which patches you think were not ready? Let's make 
>> sure to capture any concrete concerns in the Open Items list.
> 
> [ shrug... ]  There were fifty-some commits on the last day,
> some of them quite large, and you want me to finger particular ones?
> I can't even have read them all yet.
> 
>> Yeah, I should have done that sooner, but realistically, there's nothing 
>> like a looming deadline as a motivator. One idea to avoid the mad rush 
>> in the future would be to make the feature freeze deadline more 
>> progressive. For example:
>> April 1: If you are still working on a feature that you still want to 
>> commit, you must explicitly flag it in the commitfest as "I plan to 
>> commit this very soon".
>> April 4: You must give a short status update about anything that you 
>> haven't committed yet, with an explanation of how you plan to proceed 
>> with it.
>> April 5-8: Mandatory daily status updates, explicit approval by the 
>> commitfest manager needed each day to continue working on it.
>> April 8: Hard feature freeze deadline
> 
>> This would also give everyone more visibility, so that we're not all 
>> surprised by the last minute flood of commits.
> 
> Perhaps something like that could help, but it seems like a lot
> of mechanism.  I think the real problem is just that committers
> need to re-orient their thinking a little.  We must get *less*
> willing to commit marginal patches, not more so, as the deadline
> approaches.
> 

For me the main problem with the pre-freeze crush is that it leaves
pretty much no practical chance to do meaningful review/testing, and
some of the patches likely went through significant changes (at least
judging by the number of messages and patch versions in the associated
threads). That has to have a cost later ...

That being said, I'm not sure the "progressive deadline" proposed by
Heikki would improve that materially, and it seems like a lot of effort
to maintain etc. And even if someone updates the CF app with all the
details, does it even give others sufficient opportunity to review the
new patch versions? I don't think so. (It anything, it does not seem
fair to expect others to do last-minute reviews under pressure.)

Maybe it'd be better to start by expanding the existing rule about not
committing patches introduced for the first time in the last CF. What if
we said that patches in the last CF must not go through significant
changes, and if they do it'd mean the patch is moved to the next CF?
Perhaps with exceptions to be granted by the RMT when appropriate?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra




On 4/7/24 23:09, Andres Freund wrote:
> Hi,
> 
> On 2024-04-07 22:35:47 +0200, Tomas Vondra wrote:
>> I haven't investigated, but I'd considering it works on 64-bit, I guess
>> it's not considering alignment somewhere. I can dig more if needed.
> 
> I think I may the problem:
> 
> 
> #define KeeperBlock(set) ((BumpBlock *) ((char *) (set) + 
> sizeof(BumpContext)))
> #define IsKeeperBlock(set, blk) (KeeperBlock(set) == (blk))
> 
> BumpContextCreate():
> ...
>   /* Fill in the initial block's block header */
>   block = (BumpBlock *) (((char *) set) + MAXALIGN(sizeof(BumpContext)));
>   /* determine the block size and initialize it */
>   firstBlockSize = allocSize - MAXALIGN(sizeof(BumpContext));
>   BumpBlockInit(set, block, firstBlockSize);
> ...
>   ((MemoryContext) set)->mem_allocated = allocSize;
> 
> void
> BumpCheck(MemoryContext context)
> ...
>   if (IsKeeperBlock(bump, block))
>   total_allocated += block->endptr - (char *) bump;
> ...
> 
> I suspect that KeeperBlock() isn't returning true, because IsKeeperBlock 
> misses
> the MAXALIGN(). I think that about fits with:
> 
>> #4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
>> 808 Assert(total_allocated == context->mem_allocated);
>> (gdb) p total_allocated
>> $1 = 8120
>> (gdb) p context->mem_allocated
>> $2 = 8192
> 

Yup, changing it to this:

#define KeeperBlock(set) ((BumpBlock *) ((char *) (set) +
MAXALIGN(sizeof(BumpContext

fixes the issue for me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra





On 4/7/24 22:35, Tomas Vondra wrote:
> On 4/7/24 14:37, David Rowley wrote:
>> On Sun, 7 Apr 2024 at 22:05, John Naylor  wrote:
>>>
>>> On Sat, Apr 6, 2024 at 7:37 PM David Rowley  wrote:
>>>>
>>>  I'm planning on pushing these, pending a final look at 0002 and 0003
>>>> on Sunday morning NZ time (UTC+12), likely in about 10 hours time.
>>>
>>> +1
>>
>> I've now pushed all 3 patches.   Thank you for all the reviews on
>> these and for the extra MemoryContextMethodID bit, Matthias.
>>
>>> I haven't looked at v6, but I've tried using it in situ, and it seems
>>> to work as well as hoped:
>>>
>>> https://www.postgresql.org/message-id/CANWCAZZQFfxvzO8yZHFWtQV%2BZ2gAMv1ku16Vu7KWmb5kZQyd1w%40mail.gmail.com
>>
>> I'm already impressed with the radix tree work.  Nice to see bump
>> allowing a little more memory to be saved for TID storage.
>>
>> David
> 
> There seems to be some issue with this on 32-bit machines. A couple
> animals (grison, mamba) already complained about an assert int
> BumpCheck() during initdb, I get the same crash on my rpi5 running
> 32-bit debian - see the backtrace attached.
> 
> I haven't investigated, but I'd considering it works on 64-bit, I guess
> it's not considering alignment somewhere. I can dig more if needed.
> 

I did try running it under valgrind, and there doesn't seem to be
anything particularly wrong - just a bit of noise about calculating CRC
on uninitialized bytes.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra

On 4/7/24 14:37, David Rowley wrote:
> On Sun, 7 Apr 2024 at 22:05, John Naylor  wrote:
>>
>> On Sat, Apr 6, 2024 at 7:37 PM David Rowley  wrote:
>>>
>>  I'm planning on pushing these, pending a final look at 0002 and 0003
>>> on Sunday morning NZ time (UTC+12), likely in about 10 hours time.
>>
>> +1
> 
> I've now pushed all 3 patches.   Thank you for all the reviews on
> these and for the extra MemoryContextMethodID bit, Matthias.
> 
>> I haven't looked at v6, but I've tried using it in situ, and it seems
>> to work as well as hoped:
>>
>> https://www.postgresql.org/message-id/CANWCAZZQFfxvzO8yZHFWtQV%2BZ2gAMv1ku16Vu7KWmb5kZQyd1w%40mail.gmail.com
> 
> I'm already impressed with the radix tree work.  Nice to see bump
> allowing a little more memory to be saved for TID storage.
> 
> David

There seems to be some issue with this on 32-bit machines. A couple
animals (grison, mamba) already complained about an assert int
BumpCheck() during initdb, I get the same crash on my rpi5 running
32-bit debian - see the backtrace attached.

I haven't investigated, but I'd considering it works on 64-bit, I guess
it's not considering alignment somewhere. I can dig more if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companywarning: Can't open file /SYSV003414ed (deleted) during file-backed mapping 
note processing
[New LWP 20363]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by 
`/mnt/data/postgres/tmp_install/home/debian/pg-master/bin/postgres --boot -F 
-c'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=4152716800, signo=6, 
no_tid=) at pthread_kill.c:44
44  pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=4152716800, signo=6, 
no_tid=) at pthread_kill.c:44
#1  0xf721bcfc in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#2  0xf72050a0 in __GI_abort () at abort.c:79
#3  0x008b29fc in ExceptionalCondition (conditionName=0xb1c3cc "total_allocated 
== context->mem_allocated", fileName=0xb1bf98 "bump.c", lineNumber=808) at 
assert.c:66
#4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
#5  0x008ef1bc in BumpReset (context=0x131e330) at bump.c:258
#6  0x008ef310 in BumpDelete (context=0x131e330) at bump.c:287
#7  0x008fab58 in MemoryContextDeleteOnly (context=0x131e330) at mcxt.c:528
#8  0x008faa00 in MemoryContextDelete (context=0x131e330) at mcxt.c:482
#9  0x008fac1c in MemoryContextDeleteChildren (context=0x131c328) at mcxt.c:548
#10 0x008fa75c in MemoryContextReset (context=0x131c328) at mcxt.c:389
#11 0x0090a1a8 in tuplesort_free (state=0x131a3b8) at tuplesort.c:958
#12 0x0090a1fc in tuplesort_end (state=0x131a3b8) at tuplesort.c:973
#13 0x00149250 in _bt_spooldestroy (btspool=0x1222b20) at nbtsort.c:517
#14 0x00149200 in _bt_spools_heapscan (heap=0xebf9f250, index=0xebf426e0, 
buildstate=0xffde9ed0, indexInfo=0x12e8f90) at nbtsort.c:504
#15 0x00148e30 in btbuild (heap=0xebf9f250, index=0xebf426e0, 
indexInfo=0x12e8f90) at nbtsort.c:321
#16 0x001fd164 in index_build (heapRelation=0xebf9f250, 
indexRelation=0xebf426e0, indexInfo=0x12e8f90, isreindex=false, parallel=false) 
at index.c:3021
#17 0x001e0404 in build_indices () at bootstrap.c:962
#18 0x001da92c in boot_yyparse () at bootparse.y:388
#19 0x001de988 in BootstrapModeMain (argc=6, argv=0x119c114, check_only=false) 
at bootstrap.c:357
#20 0x004389e4 in main (argc=7, argv=0x119c110) at main.c:186
(gdb) up
#1  0xf721bcfc in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26  ../sysdeps/posix/raise.c: No such file or directory.
(gdb) up
#2  0xf72050a0 in __GI_abort () at abort.c:79
79  abort.c: No such file or directory.
(gdb) up
#3  0x008b29fc in ExceptionalCondition (conditionName=0xb1c3cc "total_allocated 
== context->mem_allocated", fileName=0xb1bf98 "bump.c", lineNumber=808) at 
assert.c:66
66  abort();
(gdb) up
#4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
808 Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 8120
(gdb) p context->mem_allocated
$2 = 8192

Re: pg_combinebackup --copy-file-range

2024-04-07 Thread Tomas Vondra

On 4/7/24 19:46, Tomas Vondra wrote:
> On 4/5/24 21:43, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> 2) The prefetching is not a huge improvement, at least not for these
>> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
>> like it helps, but those values are relative to the baseline, so when
>> the non-prefetching value is 5% and with prefetching 10%, that means the
>> prefetching makes it slower. And that's very often true.
>>
>> This is visible more clearly in prefetching.pdf, comparing the
>> non-prefetching and prefetching results for each patch, not to baseline.
>> That's makes it quite clear there's a lot of "red" where prefetching
>> makes it slower. It certainly does help for larger increments (which
>> makes sense, because the modified blocks are distributed randomly, and
>> thus come from random files, making long streaks unlikely).
>>
>> I've imagined the prefetching could be made a bit smarter to ignore the
>> streaks (=sequential patterns), but once again - this only matters with
>> the batching, which we don't have. And without the batching it looks
>> like a net loss (that's the first column in the prefetching PDF).
>>
>> I did start thinking about prefetching because of ZFS, where it was
>> necessary to get decent performance. And that's still true. But (a) even
>> with the explicit prefetching it's still 2-3x slower than any of these
>> filesystems, so I assume performance-sensitive use cases won't use it.
>> And (b) the prefetching seems necessary in all cases, no matter how
>> large the increment is. Which goes directly against the idea of looking
>> at how random the blocks are and prefetching only the sufficiently
>> random patterns. That doesn't seem like a great thing.
>>
> 
> I finally got a more complete ZFS results, and I also decided to get
> some numbers without the ZFS tuning I did. And boy oh boy ...
> 
> All the tests I did with ZFS were tuned the way I've seen recommended
> when using ZFS for PostgreSQL, that is
> 
>   zfs set recordsize=8K logbias=throughput compression=none
> 
> and this performed quite poorly - pg_combinebackup took 4-8x longer than
> with the traditional filesystems (btrfs, xfs, ext4) and the only thing
> that improved that measurably was prefetching.
> 
> But once I reverted back to the default recordsize of 128kB the
> performance is waay better - entirely comparable to ext4/xfs, while
> btrfs remains faster with --copy-file-range --no-manigest (by a factor
> of 2-3x).
> 

I forgot to explicitly say that I think confirms the decision to not
push the patch adding the explicit prefetching to pg_combinebackup. It's
not needed/beneficial even for ZFS, when using a suitable configuration.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-07 Thread Tomas Vondra




On 4/7/24 16:24, Melanie Plageman wrote:
> On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra
>  wrote:
>>
>>
>>
>> On 4/7/24 06:17, Melanie Plageman wrote:
>>> On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote:
>>>> On 4/6/24 23:34, Melanie Plageman wrote:
>>>>> ...
>>>>>>
>>>>>> I realized it makes more sense to add a FIXME (I used XXX. I'm not when
>>>>>> to use what) with a link to the message where Andres describes why he
>>>>>> thinks it is a bug. If we plan on fixing it, it is good to have a record
>>>>>> of that. And it makes it easier to put a clear and accurate comment.
>>>>>> Done in 0009.
>>>>>>
>>>>>>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
>>>>>>> per above (tuple vs. tuples etc.), and the question about the recheck
>>>>>>> flag. If you can do these tweaks, I'll get that committed today and we
>>>>>>> can try to get a couple more patches in tomorrow.
>>>>>
>>>>> Attached v19 rebases the rest of the commits from v17 over the first
>>>>> nine patches from v18. All patches 0001-0009 are unchanged from v18. I
>>>>> have made updates and done cleanup on 0010-0021.
>>>>>
>>>>
>>>> I've pushed 0001-0005, I'll get back to this tomorrow and see how much
>>>> more we can get in for v17.
>>>
>>> Thanks! I thought about it a bit more, and I got worried about the
>>>
>>>   Assert(scan->rs_empty_tuples_pending == 0);
>>>
>>> in heap_rescan() and heap_endscan().
>>>
>>> I was worried if we don't complete the scan it could end up tripping
>>> incorrectly.
>>>
>>> I tried to come up with a query which didn't end up emitting all of the
>>> tuples on the page (using a LIMIT clause), but I struggled to come up
>>> with an example that qualified for the skip fetch optimization and also
>>> returned before completing the scan.
>>>
>>> I could work a bit harder tomorrow to try and come up with something.
>>> However, I think it might be safer to just change these to:
>>>
>>>   scan->rs_empty_tuples_pending = 0
>>>
>>
>> Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1
>> FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a
>> correlated subquery?
> 
> Unfortunately (or fortunately, I guess) that exact thing won't work
> because even constant values in the target list disqualify it for the
> skip fetch optimization.
> 
> Being a bit too lazy to look at planner code this morning, I removed
> the target list requirement like this:
> 
> -   need_tuples = (node->ss.ps.plan->qual != NIL ||
> -  node->ss.ps.plan->targetlist != NIL);
> +   need_tuples = (node->ss.ps.plan->qual != NIL);
> 
> And can easily trip the assert with this:
> 
> create table foo (a int);
> insert into foo select i from generate_series(1,10)i;
> create index on foo(a);
> vacuum foo;
> select 1 from (select 2 from foo limit 3);
> 
> Anyway, I don't know if we could find a query that does actually hit
> this. The only bitmap heap scan queries in the regress suite that meet
> the
>BitmapHeapScanState->ss.ps.plan->targetlist == NIL
> condition are aggregates (all are count(*)).
> 
> I'll dig a bit more later, but do you think this is worth adding an
> open item for? Even though I don't have a repro yet?
> 

Try this:

create table t (a int, b int) with (fillfactor=10);
insert into t select mod((i/22),2), (i/22) from generate_series(0,1000)
S(i);
create index on t(a);
vacuum analyze t;

set enable_indexonlyscan = off;
set enable_seqscan = off;
explain (analyze, verbose) select 1 from (values (1)) s(x) where exists
(select * from t where a = x);

KABOOM!

#2  0x78a16ac5fafe in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#3  0x78a16ac4887f in __GI_abort () at abort.c:79
#4  0x00bb2c5a in ExceptionalCondition (conditionName=0xc42ba8
"scan->rs_empty_tuples_pending == 0", fileName=0xc429c8 "heapam.c",
lineNumber=1090) at assert.c:66
#5  0x004f68bb in heap_endscan (sscan=0x19af3a0) at heapam.c:1090
#6  0x0077a94c in table_endscan (scan=0x19af3a0) at
../../../src/include/access/tableam.h:1001

So yeah, this assert is not quite correct. It's not breaking anything at
the moment, so we can fix it now or add it as an open item.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-07 Thread Tomas Vondra

On 4/7/24 15:11, Melanie Plageman wrote:
> On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra
>  wrote:
>>
>> On 4/7/24 06:17, Melanie Plageman wrote:
>>>> What bothers me on 0006-0008 is that the justification in the commit
>>>> messages is "future commit will do something". I think it's fine to have
>>>> a separate "prepareatory" patches (I really like how you structured the
>>>> patches this way), but it'd be good to have them right before that
>>>> "future" commit - I'd like not to have one in v17 and then the "future
>>>> commit" in v18, because that's annoying complication for backpatching,
>>>> (and probably also when implementing the AM?) etc.
>>>
>>> Yes, I was thinking about this also.
>>>
>>
>> Good we're on the same page.
> 
> Having thought about this some more I think we need to stop here for
> 17. v20-0001 and v20-0002 both make changes to the table AM API that
> seem bizarre and unjustifiable without the other changes. Like, here
> we changed all your parameters because someday we are going to do
> something! You're welcome!
> 

OK, I think that's essentially the "temporary breakage" that should not
span multiple releases, I mentioned ~yesterday. I appreciate you're
careful about this.

> Also, the iterators in the TableScanDescData might be something I
> could live with in the source code for a couple months before we make
> the rest of the changes in July+. But, adding them does push the
> TableScanDescData->rs_parallel member into the second cacheline, which
> will be true in versions of Postgres people are using for years. I
> didn't perf test, but seems bad.
> 

I haven't though about how it affects cachelines, TBH. I'd expect it to
have minimal impact, because while it makes this struct larger it should
make some other struct (used in essentially the same places) smaller. So
I'd guess this to be a zero sum game, but perhaps I'm wrong.

For me the main question was "Is this the right place for this, even if
it's only temporary?"

> So, yes, unfortunately, I think we should pick up on the BHS saga in a
> few months. Or, actually, we should start focusing on that parallel
> BHS + 0 readahead bug and whether or not we are going to fix it.
>

Yes, the July CF is a good time to focus on this, early in the cycle.

> Sorry for the about-face.
> 

No problem. I very much prefer this over something that may not be quite
ready yet.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-07 Thread Tomas Vondra




On 4/7/24 06:17, Melanie Plageman wrote:
> On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote:
>> On 4/6/24 23:34, Melanie Plageman wrote:
>>> ...
>>>>
>>>> I realized it makes more sense to add a FIXME (I used XXX. I'm not when
>>>> to use what) with a link to the message where Andres describes why he
>>>> thinks it is a bug. If we plan on fixing it, it is good to have a record
>>>> of that. And it makes it easier to put a clear and accurate comment.
>>>> Done in 0009.
>>>>
>>>>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
>>>>> per above (tuple vs. tuples etc.), and the question about the recheck
>>>>> flag. If you can do these tweaks, I'll get that committed today and we
>>>>> can try to get a couple more patches in tomorrow.
>>>
>>> Attached v19 rebases the rest of the commits from v17 over the first
>>> nine patches from v18. All patches 0001-0009 are unchanged from v18. I
>>> have made updates and done cleanup on 0010-0021.
>>>
>>
>> I've pushed 0001-0005, I'll get back to this tomorrow and see how much
>> more we can get in for v17.
> 
> Thanks! I thought about it a bit more, and I got worried about the
> 
>   Assert(scan->rs_empty_tuples_pending == 0);
> 
> in heap_rescan() and heap_endscan().
> 
> I was worried if we don't complete the scan it could end up tripping
> incorrectly.
> 
> I tried to come up with a query which didn't end up emitting all of the
> tuples on the page (using a LIMIT clause), but I struggled to come up
> with an example that qualified for the skip fetch optimization and also
> returned before completing the scan.
> 
> I could work a bit harder tomorrow to try and come up with something.
> However, I think it might be safer to just change these to:
> 
>   scan->rs_empty_tuples_pending = 0
> 

Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1
FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a
correlated subquery?

It seemed OK to me and the buildfarm did not turn red, so I'd leave this
until after the code freeze.

>> What bothers me on 0006-0008 is that the justification in the commit
>> messages is "future commit will do something". I think it's fine to have
>> a separate "prepareatory" patches (I really like how you structured the
>> patches this way), but it'd be good to have them right before that
>> "future" commit - I'd like not to have one in v17 and then the "future
>> commit" in v18, because that's annoying complication for backpatching,
>> (and probably also when implementing the AM?) etc.
> 
> Yes, I was thinking about this also.
> 

Good we're on the same page.

>> AFAICS for v19, the "future commit" for all three patches (0006-0008) is
>> 0012, which introduces the unified iterator. Is that correct?
> 
> Actually, those patches (v19 0006-0008) were required for v19 0009,
> which is why I put them directly before it. 0009 eliminates use of the
> TBMIterateResult for control flow in BitmapHeapNext().
> 

Ah, OK. Thanks for the clarification.

> I've rephrased the commit messages to not mention future commits and
> instead focus on what the changes in the commit are enabling.
> 
> v19-0006 actually squashed very easily with v19-0009 and is actually
> probably better that way. It is still easy to understand IMO.
> 
> In v20, I've attached just the functionality from v19 0006-0009 but in
> three patches instead of four.
> 

Good. I'll take a look today.

>> Also, for 0008 I'm not sure we could even split it between v17 and v18,
>> because even if heapam did not use the iterator, what if some other AM
>> uses it? Without 0012 it'd be a problem for the AM, no?
> 
> The iterators in the TableScanDescData were introduced in v19-0009. It
> is okay for other AMs to use it. In fact, they will want to use it. It
> is still initialized and set up in BitmapHeapNext(). They would just
> need to call tbm_iterate()/tbm_shared_iterate() on it.
> 
> As for how table AMs will cope without the TBMIterateResult passed to
> table_scan_bitmap_next_tuple() (which is what v19 0008 did): they can
> save the location of the tuples to be scanned somewhere in their scan
> descriptor. Heap AM already did this and actually didn't use the
> TBMIterateResult at all.
> 

The reason I feel a bit uneasy about putting this in TableScanDescData
is I see that struct as a "description of the scan" (~input parameters
defining what the scan should do), while for runtime state we have
ScanState in exec

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra

On 4/6/24 23:34, Melanie Plageman wrote:
> ...
>>
>> I realized it makes more sense to add a FIXME (I used XXX. I'm not when
>> to use what) with a link to the message where Andres describes why he
>> thinks it is a bug. If we plan on fixing it, it is good to have a record
>> of that. And it makes it easier to put a clear and accurate comment.
>> Done in 0009.
>>
>>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
>>> per above (tuple vs. tuples etc.), and the question about the recheck
>>> flag. If you can do these tweaks, I'll get that committed today and we
>>> can try to get a couple more patches in tomorrow.
> 
> Attached v19 rebases the rest of the commits from v17 over the first
> nine patches from v18. All patches 0001-0009 are unchanged from v18. I
> have made updates and done cleanup on 0010-0021.
> 

I've pushed 0001-0005, I'll get back to this tomorrow and see how much
more we can get in for v17.

What bothers me on 0006-0008 is that the justification in the commit
messages is "future commit will do something". I think it's fine to have
a separate "prepareatory" patches (I really like how you structured the
patches this way), but it'd be good to have them right before that
"future" commit - I'd like not to have one in v17 and then the "future
commit" in v18, because that's annoying complication for backpatching,
(and probably also when implementing the AM?) etc.

AFAICS for v19, the "future commit" for all three patches (0006-0008) is
0012, which introduces the unified iterator. Is that correct?

Also, for 0008 I'm not sure we could even split it between v17 and v18,
because even if heapam did not use the iterator, what if some other AM
uses it? Without 0012 it'd be a problem for the AM, no?

Would it make sense to move 0009 before these three patches? That seems
like a meaningful change on it's own, right?

FWIW I don't think it's very likely I'll commit the UnifiedTBMIterator
stuff. I do agree with the idea in general, but I think I'd need more
time to think about the details. Sorry about that ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra

On 4/6/24 15:40, Melanie Plageman wrote:
> On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote:
>>
>>
>> On 4/6/24 01:53, Melanie Plageman wrote:
>>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote:
>>>> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote:
>>>>> On 4/4/24 00:57, Melanie Plageman wrote:
>>>>>> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote:
>>>>> I'd focus on the first ~8-9 commits or so for now, we can commit more if
>>>>> things go reasonably well.
>>>>
>>>> Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would
>>>> love to know if you agree with the direction before I spend more time.
>>>
>>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is
>>> much easier to understand.
>>>
>>
>> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the
>> review comments are marked with XXX, so grep for that in the patches.
>> There's two separate patches - the first one suggests a code change, so
>> it was better to not merge that with your code. The second has just a
>> couple XXX comments, I'm not sure why I kept it separate.
>>
>> A couple review comments:
>>
>> * I think 0001-0009 are 99% ready to. I reworded some of the commit
>> messages a bit - I realize it's a bit bold, considering you're native
>> speaker and I'm not. If you could check I didn't make it worse, that
>> would be great.
> 
> Attached v17 has *only* patches 0001-0009 with these changes. I will
> work on applying the remaining patches, addressing feedback, and adding
> comments next.
> 
> I have reviewed and incorporated all of your feedback on these patches.
> Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to
> commit messages (comma splice removal and single word adjustments) as
> well as the changes listed below:
> 
> I have changed the following:
> 
> - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and
>   endscan
> 

OK

> - 0004 (your 0005)-- I followed up with Tom, but for now I have just
>   removed the XXX and also reworded the message a bit
> 

After the exercise I described a couple minutes ago, I think I'm
convinced the assumption is unnecessary and we should use the correct
recheck. Not that it'd make any difference in practice, considering none
of the opclasses ever changes the recheck.

Maybe the most prudent thing would be to skip this commit and maybe
leave this for later, but I'm not forcing you to do that if it would
mean a lot of disruption for the following patches.

> - 0006 (your 0007) fixed up the variable name (you changed valid ->
>   valid_block but it had gotten changed back)
> 

OK

> I have open questions on the following:
> 
> - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of
>   SO_NEED_TUPLE and need_tuple)?
> 

I think SO_NEED_TUPLES is more accurate, as we need all tuples from the
block. But either would work.

> - 0009 (your 0010)
>   - Should I mention in the commit message that we added blockno and
>   pfblockno in the BitmapHeapScanState only for validation or is 
> that
>   too specific?
> 

For the commit message I'd say it's too specific, I'd put it in the
comment before the struct.

>   - Should I mention that a future (imminent) commit will remove the
>   iterators from TableScanDescData and put them in 
> HeapScanDescData? I
>   imagine folks don't want those there, but it is easier for the
>   progression of commits to put them there first and then move 
> them
> 

I'd try not to mention future commits as justification too often, if we
don't know that the future commit lands shortly after.

>   - I'm worried this comment is vague and or potentially not totally
>   correct. Should we remove it? I don't think we have conclusive 
> proof
>   that this is true.
>   /*
>* Adjusting the prefetch iterator before invoking
>* table_scan_bitmap_next_block() keeps prefetch distance higher across
>* the parallel workers.
>*/
> 

TBH it's not clear to me what "higher across parallel workers" means.
But it sure shouldn't claim things that we think may not be correct. I
don't have a good idea how to reword it, though.

> 
>> * I'm not sure extra_flags is the right way to pass the flag in 0003.
>> The "extra_" name is a bit weird, and no other table AM functions do it
>> this way and pass explicit bool flags instead. So my first "review"
>> commit does it

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra

On 4/6/24 02:51, Tomas Vondra wrote:
> 
> * The one question I'm somewhat unsure about is why Tom chose to use the
> "wrong" recheck flag in the 2017 commit, when the correct recheck flag
> is readily available. Surely that had a reason, right? But I can't think
> of one ...
> 

I've been wondering about this a bit more, so I decided to experiment
and try to construct a case for which the current code prefetches the
wrong blocks, and the patch fixes that. But I haven't been very
successful so far :-(

My understanding was that the current code should do the wrong thing if
I alternate all-visible and not-all-visible pages. This understanding is
not correct, as I learned, because the thing that needs to change is the
recheck flag, not visibility :-( I'm still posting what I tried, perhaps
you will have an idea how to alter it to demonstrate the incorrect
behavior with current master.

The test was very simple:

  create table t (a int, b int) with (fillfactor=10);
  insert into t select mod((i/22),2), (i/22)
from generate_series(0,1000) s(i);
  create index on t (a);

which creates a table with 46 pages, 22 rows per page, column "a"
alternates between 0/1 on pages, column "b" increments on each page (so
"b" identifies page).

and then

  delete from t where mod(b,8) = 0;

which deletes tuples on pages 0, 8, 16, 24, 32, 40, so these pages will
need to be prefetched as not-all-visible by this query

  explain analyze select count(1) from t where a = 0

when forced to do bitmap heap scan. The other even-numbered pages remain
all-visible. I added a bit of logging into BitmapPrefetch(), but even
with master I get this:

LOG:  prefetching block 8 0 current block 6 0
LOG:  prefetching block 16 0 current block 14 0
LOG:  prefetching block 24 0 current block 22 0
LOG:  prefetching block 32 0 current block 30 0
LOG:  prefetching block 40 0 current block 38 0

So it prefetches the correct pages (the other value is the recheck flag
for that block from the iterator result).

Turns out (and I realize the comment about the assumption actually
states that, I just failed to understand it) the thing that would have
to differ for the blocks is the recheck flag.

But that can't actually happen because that's set by the AM/opclass and
the built-in ones do essentially this:

.../hash.c:  scan->xs_recheck = true;
.../nbtree.c:  scan->xs_recheck = false;

gist opclasses (e.g. btree_gist):

/* All cases served by this function are exact */
*recheck = false;

spgist opclasses (e.g. geo_spgist):

/* All tests are exact. */
out->recheck = false;

If there's an opclass that alters the recheck flag, it's well hidden and
I missed it.

Anyway, after this exercise and learning more about the recheck flag, I
think I agree the assumption is unnecessary. It's pretty harmless
because none of the built-in opclasses alters the recheck flag, but the
correct recheck flag is readily available. I'm still a bit puzzled why
the 2017 commit even relied on this assumption, though.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-05 Thread Tomas Vondra



On 4/6/24 01:53, Melanie Plageman wrote:
> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote:
>> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote:
>>>
>>>
>>> On 4/4/24 00:57, Melanie Plageman wrote:
>>>> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote:
>>>>> On Fri, Mar 29, 2024 at 12:05:15PM +0100, Tomas Vondra wrote:
>>>>>>
>>>>>>
>>>>>> On 3/29/24 02:12, Thomas Munro wrote:
>>>>>>> On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
>>>>>>>  wrote:
>>>>>>>> I think there's some sort of bug, triggering this assert in heapam
>>>>>>>>
>>>>>>>>   Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
>>>>>>>
>>>>>>> Thanks for the repro.  I can't seem to reproduce it (still trying) but
>>>>>>> I assume this is with Melanie's v11 patch set which had
>>>>>>> v11-0016-v10-Read-Stream-API.patch.
>>>>>>>
>>>>>>> Would you mind removing that commit and instead applying the v13
>>>>>>> stream_read.c patches[1]?  v10 stream_read.c was a little confused
>>>>>>> about random I/O combining, which I fixed with a small adjustment to
>>>>>>> the conditions for the "if" statement right at the end of
>>>>>>> read_stream_look_ahead().  Sorry about that.  The fixed version, with
>>>>>>> eic=4, with your test query using WHERE a < a, ends its scan with:
>>>>>>>
>>>>>>
>>>>>> I'll give that a try. Unfortunately unfortunately the v11 still has the
>>>>>> problem I reported about a week ago:
>>>>>>
>>>>>>   ERROR:  prefetch and main iterators are out of sync
>>>>>>
>>>>>> So I can't run the full benchmarks :-( but master vs. streaming read API
>>>>>> should work, I think.
>>>>>
>>>>> Odd, I didn't notice you reporting this ERROR popping up. Now that I
>>>>> take a look, v11 (at least, maybe also v10) had this very sill mistake:
>>>>>
>>>>>   if (scan->bm_parallel == NULL &&
>>>>>       scan->rs_pf_bhs_iterator &&
>>>>>       hscan->pfblockno > hscan->rs_base.blockno)
>>>>>        elog(ERROR, "prefetch and main iterators are out of sync");
>>>>>
>>>>> It errors out if the prefetch block is ahead of the current block --
>>>>> which is the opposite of what we want. I've fixed this in attached v12.
>>>>>
>>>>> This version also has v13 of the streaming read API. I noticed one
>>>>> mistake in my bitmapheap scan streaming read user -- it freed the
>>>>> streaming read object at the wrong time. I don't know if this was
>>>>> causing any other issues, but it at least is fixed in this version.
>>>>
>>>> Attached v13 is rebased over master (which includes the streaming read
>>>> API now). I also reset the streaming read object on rescan instead of
>>>> creating a new one each time.
>>>>
>>>> I don't know how much chance any of this has of going in to 17 now, but
>>>> I thought I would start looking into the regression repro Tomas provided
>>>> in [1].
>>>>
>>>
>>> My personal opinion is that we should try to get in as many of the the
>>> refactoring patches as possible, but I think it's probably too late for
>>> the actual switch to the streaming API.
>>
>> Cool. In the attached v15, I have dropped all commits that are related
>> to the streaming read API and included *only* commits that are
>> beneficial to master. A few of the commits are merged or reordered as
>> well.
>>
>> While going through the commits with this new goal in mind (forget about
>> the streaming read API for now), I realized that it doesn't make much
>> sense to just eliminate the layering violation for the current block and
>> leave it there for the prefetch block. I had de-prioritized solving this
>> when I thought we would just delete the prefetch code and replace it
>> with the streaming read.
>>
>> Now that we aren't doing that, I've spent the day trying to resolve the
>> issues with pushing the prefetch code into heapam.c that I cited in [1].
>> 0010 - 0013 are the

Re: incremental backup breakage in BlockRefTableEntryGetBlocks

2024-04-04 Thread Tomas Vondra

On 4/4/24 19:38, Robert Haas wrote:
> Hi,
> 
> Yesterday, Tomas Vondra reported to me off-list that he was seeing
> what appeared to be data corruption after taking and restoring an
> incremental backup. Overnight, Jakub Wartak further experimented with
> Tomas's test case, did some initial analysis, and made it very easy to
> reproduce. I spent this morning tracking down the problem, for which I
> attach a patch.
> 

Thanks, I can confirm this fixes the issue I've observed/reported. On
master 10 out of 10 runs failed, with the patch no failures.

The test is very simple:

1) init pgbench
2) full backup
3) run short pgbench
4) incremental backup
5) compare pg_dumpall on the instance vs. restored backup

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-04-04 Thread Tomas Vondra

On 4/4/24 12:25, Jakub Wartak wrote:
> On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> Here's a much more polished and cleaned up version of the patches,
>> fixing all the issues I've been aware of, and with various parts merged
>> into much more cohesive parts (instead of keeping them separate to make
>> the changes/evolution more obvious).
> 
> OK, so three runs of incrementalbackupstests - as stated earlier -
> also passed with OK for v20240403 (his time even with
> --enable-casserts)
> 
> pg_combinebackup flags tested were:
> 1) --copy-file-range --manifest-checksums=CRC32C
> 2) --copy-file-range --manifest-checksums=NONE
> 3) default, no flags (no copy-file-range)
> 

Thanks!

>> I changed how I think about this a bit - I don't really see the CoW copy
>> methods as necessary faster than the regular copy (even though it can be
>> with (5)). I think the main benefits are the space savings, enabled by
>> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
>> that I don't think the performance is an issue - everything has a cost.
> 
> I take i differently: incremental backups without CoW fs would be clearly :
> - inefficient in terms of RTO (SANs are often a key thing due to
> having fast ability to "restore" the clone rather than copying the
> data from somewhere else)
> - pg_basebackup without that would be unusuable without space savings
> (e.g. imagine daily backups @ 10+TB DWHs)
> 

Right, although this very much depends on the backup scheme. If you only
take incremental backups, and then also a full backup once in a while,
the CoW stuff probably does not help much. The alignment (the only thing
affecting basebackups) may allow deduplication, but that's all I think.

If the scheme is more complex, and involves "merging" the increments
into the full backup, then this does help a lot. It'd even be possible
to cheaply clone instances this way, I think. But I'm not sure how often
would people do that on the same volume, to benefit from the CoW.

>> On 4/3/24 15:39, Jakub Wartak wrote:
>>> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>>>  wrote:
> [..]
>> Thanks. Would be great if you could run this on the attached version of
>> the patches, ideally for each of them independently, so make sure it
>> doesn't get broken+fixed somewhere on the way.
> 
> Those are semi-manual test runs (~30 min? per run), the above results
> are for all of them applied at once. So my take is all of them work
> each one does individually too.
> 

Cool, thanks.

> FWIW, I'm also testing your other offlist incremental backup
> corruption issue, but that doesnt seem to be related in any way to
> copy_file_range() patches here.
> 

Yes, that's entirely independent, happens with master too.

>>>> 2) prefetch
>>>> ---
>>> [..]
>>>> I think this means we may need a "--prefetch" option, that'd force
>>>> prefetching, probably both before pread and copy_file_range. Otherwise
>>>> people on ZFS are doomed and will have poor performance.
>>>
>>> Right, we could optionally cover in the docs later-on various options
>>> to get the performance (on XFS use $this, but without $that and so
>>> on). It's kind of madness dealing with all those performance
>>> variations.
>>>
>>
>> Yeah, something like that. I'm not sure we want to talk too much about
>> individual filesystems in our docs, because those things evolve over
>> time too.
> 
> Sounds like Wiki then.
> 
> BTW, after a quick review: could we in 05 have something like common
> value then (to keep those together via some .h?)
> 
> #defineBATCH_SIZE PREFETCH_TARGET ?
> 

Yes, that's one of the things I'd like to refine a bit more. Making it
more consistent / clearer that these things are interdependent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Tomas Vondra

Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.

I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.

On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_

Re: using extended statistics to improve join estimates

2024-04-02 Thread Tomas Vondra

On 4/2/24 10:23, Andy Fan wrote:
> 
>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>> Rebased over 269b532ae and muted compiler warnings.
> 
> Thank you Justin for the rebase!
> 
> Hello Tomas,
> 
> Thanks for the patch! Before I review the path at the code level, I want
> to explain my understanding about this patch first.
> 

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

> Before this patch, we already use MCV information for the eqjoinsel, it
> works as combine the MCV on the both sides to figure out the mcv_freq
> and then treat the rest equally, but this doesn't work for MCV in
> extended statistics, this patch fill this gap. Besides that, since
> extended statistics means more than 1 columns are involved, if 1+
> columns are Const based on RestrictInfo, we can use such information to
> filter the MCVs we are interesting, that's really cool. 
> 

Yes, I think that's an accurate description of what the patch does.

> I did some more testing, all of them are inner join so far, all of them
> works amazing and I am suprised this patch didn't draw enough
> attention. I will test more after I go though the code.
> 

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.

As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

> At for the code level, I reviewed them in the top-down manner and almost
> 40% completed. Here are some findings just FYI. For efficiency purpose,
> I provide each feedback with a individual commit, after all I want to
> make sure my comment is practical and coding and testing is a good way
> to archive that. I tried to make each of them as small as possible so
> that you can reject or accept them convinently.
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.
> 
> Hope this kind of review is helpful.
> 

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-04-02 Thread Tomas Vondra

it out soonish...)
> 

It's entirely possible I'm just too stupid and it works just fine for
everyone else. But maybe not, and I'd say an implementation that is this
difficult to configure is almost as if it didn't exist at all. The linux
read-ahead works by default pretty great.

So I don't see how to make this work without explicit prefetch ... Of
course, we could also do no prefetch and tell users it's up to ZFS to
make this work, but I don't think it does them any service.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Tomas Vondra




On 3/31/24 06:46, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
>  wrote:
>> I'm on 2.2.2 (on Linux). But there's something wrong, because the
>> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>>
>> I'm not sure it's a ZFS config issue, though, because it's not CPU or
>> I/O bound, and I see this on both machines. And some simple dd tests
>> show the zpool can do 10x the throughput. Could this be due to the file
>> header / pool alignment?
> 
> Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
> the same 128kB record as you copy_file_range 16 x 8kB blocks?
> (Guessing you might be using the default recordsize?)
> 

No, I reduced the record size to 8kB. And the pgbench init takes about
the same as on other filesystems on this hardware, I think. ~10 minutes
for scale 5000.

>> I admit I'm not very familiar with the format, but you're probably right
>> there's a header, and header_length does not seem to consider alignment.
>> make_incremental_rfile simply does this:
>>
>> /* Remember length of header. */
>> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>> sizeof(rf->truncation_block_length) +
>> sizeof(BlockNumber) * rf->num_blocks;
>>
>> and sendFile() does the same thing when creating incremental basebackup.
>> I guess it wouldn't be too difficult to make sure to align this to
>> BLCKSZ or something like this. I wonder if the file format is documented
>> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>>
>> Anyway, is that really a problem? I mean, in my tests the CoW stuff
>> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
>> that's why it took longer on XFS ...
> 
> Yeah I'm not sure, I assume it did more allocating and copying because
> of that.  It doesn't matter and it would be fine if a first version
> weren't as good as possible, and fine if we tune the format later once
> we know more, ie leaving improvements on the table.  I just wanted to
> share the observation.  I wouldn't be surprised if the block-at-a-time
> coding makes it slower and maybe makes the on disk data structures
> worse, but I dunno I'm just guessing.
> 
> It's also interesting but not required to figure out how to tune ZFS
> well for this purpose right now...

No idea. Any idea if there's some good ZFS statistics to check?


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-03-30 Thread Tomas Vondra

On 3/31/24 03:03, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
>  wrote:
>> So I decided to take a stab at Thomas' idea, i.e. reading the data to
>> ...
>> I'll see how this works on EXT4/ZFS next ...
> 
> Wow, very cool!  A couple of very quick thoughts/notes:
> 
> ZFS: the open source version only gained per-file block cloning in
> 2.2, so if you're on an older release I expect copy_file_range() to
> work but not really do the magic thing.  On the FreeBSD version you
> also have to turn cloning on with a sysctl because people were worried
> about bugs in early versions so by default you still get actual
> copying, not sure if you need something like that on the Linux
> version...  (Obviously ZFS is always completely COW-based, but before
> the new block cloning stuff it could only share blocks by its own
> magic deduplication if enabled, or by cloning/snapshotting a whole
> dataset/mountpoint; there wasn't a way to control it explicitly like
> this.)
> 

I'm on 2.2.2 (on Linux). But there's something wrong, because the
pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.

I'm not sure it's a ZFS config issue, though, because it's not CPU or
I/O bound, and I see this on both machines. And some simple dd tests
show the zpool can do 10x the throughput. Could this be due to the file
header / pool alignment?

> Alignment: block sharing on any fs requires it.  I haven't re-checked
> recently but IIRC the incremental file format might have a
> non-block-sized header?  That means that if you copy_file_range() from
> both the older backup and also the incremental backup, only the former
> will share blocks, and the latter will silently be done by copying to
> newly allocated blocks.  If that's still true, I'm not sure how hard
> it would be to tweak the format to inject some padding and to make
> sure that there isn't any extra header before each block.

I admit I'm not very familiar with the format, but you're probably right
there's a header, and header_length does not seem to consider alignment.
make_incremental_rfile simply does this:

/* Remember length of header. */
rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
sizeof(rf->truncation_block_length) +
sizeof(BlockNumber) * rf->num_blocks;

and sendFile() does the same thing when creating incremental basebackup.
I guess it wouldn't be too difficult to make sure to align this to
BLCKSZ or something like this. I wonder if the file format is documented
somewhere ... It'd certainly be nicer to tweak before v18, if necessary.

Anyway, is that really a problem? I mean, in my tests the CoW stuff
seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
that's why it took longer on XFS ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra

On 3/29/24 22:39, Thomas Munro wrote:
> ...
> 
>> I don't recall seeing a system with disabled readahead, but I'm
>> sure there are cases where it may not really work - it clearly can't
>> work with direct I/O, ...
> 
> Right, for direct I/O everything is slow right now including seq scan.
> We need to start asynchronous reads in the background (imagine
> literally just a bunch of background "I/O workers" running preadv() on
> your behalf to get your future buffers ready for you, or equivalently
> Linux io_uring).  That's the real goal of this project: restructuring
> so we have the information we need to do that, ie teach every part of
> PostgreSQL to predict the future in a standard and centralised way.
> Should work out better than RA heuristics, because we're not just
> driving in a straight line, we can turn corners too.
> 
>> ... but I've also not been very successful with
>> prefetching on ZFS.
> 
> posix_favise() did not do anything in OpenZFS before 2.2, maybe you
> have an older version?
> 

Sorry, I meant the prefetch (readahead) built into ZFS. I may be wrong
but I don't think the regular RA (in linux kernel) works for ZFS, right?

I was wondering if we could use this (posix_fadvise) to improve that,
essentially by issuing fadvise even for sequential patterns. But now
that I think about that, if posix_fadvise works since 2.2, maybe RA
works too now?)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra

On 3/29/24 23:03, Thomas Munro wrote:
> On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro  wrote:
>> On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra
>>  wrote:
>>> ... Maybe there should be some flag to force
>>> issuing fadvise even for sequential patterns, perhaps at the tablespace
>>> level? ...
>>
>> Yeah, I've wondered about trying harder to "second guess" the Linux
>> RA.  At the moment, read_stream.c detects *exactly* sequential reads
>> (see seq_blocknum) to suppress advice, but if we knew/guessed the RA
>> window size, we could (1) detect it with the same window that Linux
>> will use to detect it, and (2) [new realisation from yesterday's
>> testing] we could even "tickle" it to wake it up in certain cases
>> where it otherwise wouldn't, by temporarily using a smaller
>> io_combine_limit if certain patterns come along.  I think that sounds
>> like madness (I suspect that any place where the latter would help is
>> a place where you could turn RA up a bit higher for the same effect
>> without weird kludges), or another way to put it would be to call it
>> "overfitting" to the pre-existing quirks; but maybe it's a future
>> research idea...
> 

I don't know if I'd call this overfitting - yes, we certainly don't want
to tailor this code to only work with the linux RA, but OTOH it's the RA
is what most systems do. And if we plan to rely on that, we probably
have to "respect" how it works ...

Moving to a "clean" approach that however triggers regressions does not
seem like a great thing for users. I'm not saying the goal has to be "no
regressions", that would be rather impossible. At this point I still try
to understand what's causing this.

BTW are you suggesting that increasing the RA distance could maybe fix
the regressions? I can give it a try, but I was assuming that 128kB
readahead would be enough for combine_limit=8kB.

> I guess I missed a step when responding that suggestion: I don't think
> we could have an "issue advice always" flag, because it doesn't seem
> to work out as well as letting the kernel do it, and a global flag
> like that would affect everything else including sequential scans
> (once the streaming seq scan patch goes in).  But suppose we could do
> that, maybe even just for BHS.  In my little test yesterday had to
> issue a lot of them, patched eic=4, to beat the kernel's RA with
> unpatched eic=0:
> 
> eic unpatched patched
> 041729572
> 1   30846   10376
> 2   184355562
> 4   189803503
> 
> So if we forced fadvise to be issued with a GUC, it still wouldn't be
> good enough in this case.  So we might need to try to understand what
> exactly is waking the RA up for unpatched but not patched, and try to
> tickle it by doing a little less I/O combining (for example just
> setting io_combine_limit=1 gives the same number for eic=0, a major
> clue), but that seems to be going down a weird path, and tuning such a
> copying algorithm seems too hard.

Hmmm. I admit I didn't think about the "always prefetch" flag too much,
but I did imagine it'd only affect some places (e.g. BHS, but not for
sequential scans). If it could be done by lowering the combine limit,
that could work too - in fact, I was wondering if we should have combine
limit as a tablespace parameter too.

But I think adding such knobs should be only the last resort - I myself
don't know how to set these parameters, how could we expect users to
pick good values? Better to have something that "just works".

I admit I never 100% understood when exactly the kernel RA kicks in, but
I always thought it's enough for the patterns to be only "close enough"
to sequential. Isn't the problem that this only skips fadvise for 100%
sequential patterns, but keeps prefetching for cases the RA would deal
on it's own? So maybe we should either relax the conditions when to skip
fadvise, or combine even pages that are not perfectly sequential (I'm
not sure if that's possible only for fadvise), though.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-29 Thread Tomas Vondra




On 3/29/24 02:12, Thomas Munro wrote:
> On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
>  wrote:
>> I think there's some sort of bug, triggering this assert in heapam
>>
>>   Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
> 
> Thanks for the repro.  I can't seem to reproduce it (still trying) but
> I assume this is with Melanie's v11 patch set which had
> v11-0016-v10-Read-Stream-API.patch.
> 
> Would you mind removing that commit and instead applying the v13
> stream_read.c patches[1]?  v10 stream_read.c was a little confused
> about random I/O combining, which I fixed with a small adjustment to
> the conditions for the "if" statement right at the end of
> read_stream_look_ahead().  Sorry about that.  The fixed version, with
> eic=4, with your test query using WHERE a < a, ends its scan with:
> 

I'll give that a try. Unfortunately unfortunately the v11 still has the
problem I reported about a week ago:

  ERROR:  prefetch and main iterators are out of sync

So I can't run the full benchmarks :-( but master vs. streaming read API
should work, I think.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: incorrect results and different plan with 2 very similar queries

2024-03-28 Thread Tomas Vondra




On 3/27/24 23:10, Dave Cramer wrote:
> Dave Cramer
> 
> 
> On Wed, 27 Mar 2024 at 17:57, David Rowley  wrote:
> 
>> On Thu, 28 Mar 2024 at 10:33, Dave Cramer  wrote:
>>> There is a report on the pgjdbc github JDBC Driver shows erratic
>> behavior when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184 (
>> github.com)
>>>
>>> Here are the plans.
>>>
>>> JDBC - Nested Loop (incorrect result)
>>>
>>> Index Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 28))
>>
>>> JDBC - Hash Right (correct result)
>>>
>>> Recheck Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 29))
>>
>> I don't see any version details or queries, but going by the
>> conditions above, the queries don't appear to be the same, so
>> different results aren't too surprising and not a demonstration that
>> there's any sort of bug.
>>
> 
> Sorry, you are correct. Version is 12.14. Here is the query
> 
> SELECT
>   p.partseqno_i
> , p.partno
> , p.partmatch
> , pfe.average_price
> , pfe.sales_price
> , pfe.purch_price
> , pfe.average_price_2
> , pfe.avg_repair_cost
> , pfe.average_price_func
> , pfe.fsv
> , pfe.fsv_func
> , p.status
> 
> FROM part p
> LEFT JOIN part_fa_entity pfe ON (p.partno = pfe.partno)
> WHERE 1=1
> AND (p.mutation >= (CURRENT_DATE - '1971-12-31'::date)-27) ORDER BY p.partno
> 

I guess the confusing bit is that the report does not claim that those
queries are expected to produce the same result, but that the parameter
value affects which plan gets selected, and one of those plans produces
incorrect result.

I think the simplest explanation might be that one of the indexes on
part_fa_entity is corrupted and fails to lookup some rows by partno.
That would explain why the plan with seqscan works fine, but nestloop
with index scan is missing these rows.

They might try a couple things:

1) set enable_nestloop=off, see if the results get correct

2) try bt_index_check on i_39773, might notice some corruption

3) rebuild the index

If it's not this, they'll need to build a reproducer. It's really
difficult to deduce what's going on just from query plans for different
parameter values.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade --copy-file-range

2024-03-28 Thread Tomas Vondra




On 3/28/24 21:45, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
>  wrote:
>> The patch I shared a couple minutes ago should fix this, effectively
>> restoring the original debug behavior. I liked the approach with calling
>> strategy_implementation a bit more, I wonder if it'd be better to go
>> back to that for the "accelerated" copy methods, somehow.
> 
> Somehow I don't see this patch?
> 

It's here:

https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com

I did change the subject to reflect that it's no longer about
pg_upgrade, maybe that breaks the threading for you somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra

On 3/27/24 20:37, Melanie Plageman wrote:
> On Mon, Mar 25, 2024 at 12:07:09PM -0400, Melanie Plageman wrote:
>> On Sun, Mar 24, 2024 at 06:37:20PM -0400, Melanie Plageman wrote:
>>> On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra
>>>  wrote:
>>>>
>>>> BTW when you say "up to 'Make table_scan_bitmap_next_block() async
>>>> friendly'" do you mean including that patch, or that this is the first
>>>> patch that is not one of the independently useful patches.
>>>
>>> I think the code is easier to understand with "Make
>>> table_scan_bitmap_next_block() async friendly". Prior to that commit,
>>> table_scan_bitmap_next_block() could return false even when the bitmap
>>> has more blocks and expects the caller to handle this and invoke it
>>> again. I think that interface is very confusing. The downside of the
>>> code in that state is that the code for prefetching is still in the
>>> BitmapHeapNext() code and the code for getting the current block is in
>>> the heap AM-specific code. I took a stab at fixing this in v9's 0013,
>>> but the outcome wasn't very attractive.
>>>
>>> What I will do tomorrow is reorder and group the commits such that all
>>> of the commits that are useful independent of streaming read are first
>>> (I think 0014 and 0015 are independently valuable but they are on top
>>> of some things that are only useful to streaming read because they are
>>> more recently requested changes). I think I can actually do a bit of
>>> simplification in terms of how many commits there are and what is in
>>> each. Just to be clear, v9 is still reviewable. I am just going to go
>>> back and change what is included in each commit.
>>
>> So, attached v10 does not include the new version of streaming read API.
>> I focused instead on the refactoring patches commit regrouping I
>> mentioned here.
> 
> Attached v11 has the updated Read Stream API Thomas sent this morning
> [1]. No other changes.
> 

I think there's some sort of bug, triggering this assert in heapam

  Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

I haven't looked for the root cause, and it's not exactly deterministic,
but try this:

  create table t (a int, b text);

  insert into t select 1 * random(), md5(i::text)
    from generate_series(1,1000) s(i);^C

  create index on t (a);

  explain analyze select * from t where a = 200;
  explain analyze select * from t where a < 200;

and then vary the condition a bit (different values, inequalities,
etc.). For me it hits the assert in a couple tries.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company0x79ff48a73d37 in epoll_wait () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.37-18.fc38.x86_64 libgcc-13.2.1-4.fc38.x86_64 libicu-72.1-2.fc38.x86_64 
libstdc++-13.2.1-4.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0  0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x79ff4899eafe in raise () from /lib64/libc.so.6
#2  0x79ff4898787f in abort () from /lib64/libc.so.6
#3  0x009ba563 in ExceptionalCondition 
(conditionName=conditionName@entry=0xa209e0 
"BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno", 
fileName=fileName@entry=0xa205b4 "heapam_handler.c", 
lineNumber=lineNumber@entry=2221) at assert.c:66
#4  0x0057759f in heapam_scan_bitmap_next_block (exact_pages=, lossy_pages=, recheck=, scan=0x1e73548) at 
heapam_handler.c:2221
#5  heapam_scan_bitmap_next_tuple (scan=0x1e73548, slot=, 
recheck=, lossy_pages=, exact_pages=) at heapam_handler.c:2359
#6  0x006f58bb in table_scan_bitmap_next_tuple (exact_pages=, lossy_pages=, recheck=, slot=, scan=) at ../../../src/include/access/tableam.h:2022
#7  BitmapHeapNext (node=0x1e71fc8) at nodeBitmapHeapscan.c:202
#8  0x006e46b8 in ExecProcNodeInstr (node=0x1e71fc8) at 
execProcnode.c:480
#9  0x006dd6fa in ExecProcNode (node=0x1e71fc8) at 
../../../src/include/executor/executor.h:274
#10 ExecutePlan (execute_once=, dest=0xb755e0 , 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1e71fc8, 
estate=0x1e71d80) at execMain.c:1644
#11 standard_ExecutorRun (queryDesc=0x1e6c500, direction=, 
count=0, execute_once=) at execMain.c:363
#12 0x0067af84 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x1e6c3f0, into=into@entry=0x0, 
es=es@entry=0x1db3138, queryString=queryString@entry=0x1d87f60 "explain analyze 
select * from t wh

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra

On 3/28/24 06:20, Thomas Munro wrote:
> With the unexplained but apparently somewhat systematic regression
> patterns on certain tests and settings, I wonder if they might be due
> to read_stream.c trying to form larger reads, making it a bit lazier.
> It tries to see what the next block will be before issuing the
> fadvise.  I think that means that with small I/O concurrency settings,
> there might be contrived access patterns where it loses, and needs
> effective_io_concurrency to be set one notch higher to keep up, or
> something like that.

Yes, I think we've speculated this might be the root cause before, but
IIRC we didn't manage to verify it actually is the problem.

FWIW I don't think the tests use synthetic data, but I don't think it's
particularly contrived.

> One way to test that idea would be to run the
> tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> eagerly when io_combine_limit is reached, so I suppose it should be
> exactly as eager as master.  The only difference then should be that
> it automatically suppresses sequential fadvise calls.

Sure, I'll give that a try. What are some good values to test? Perhaps
32 and 1, i.e. the default and "no coalescing"?

If this turns out to be the problem, does that mean we would consider
using a more conservative default value? Is there some "auto tuning" we
could do? For example, could we reduce the value combine limit if we
start not finding buffers in memory, or something like that?

I recognize this may not be possible with buffered I/O, due to not
having any insight into page cache. And maybe it's misguided anyway,
because how would we know if the right response is to increase or reduce
the combine limit?

Anyway, doesn't the combine limit work against the idea that
effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
then process 32 pages before we actually need X (and we expect the page
to already be in memory, thanks to the gap). But with the combine limit
set to 32, is this still true?

I've tried going through read_stream_* to determine how this will
behave, but read_stream_look_ahead/read_stream_start_pending_read does
not make this very clear. I'll have to experiment with some tracing.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Tomas Vondra

On 3/28/24 16:00, Alexander Lakhin wrote:
> ...
>
> Using the trick Thomas proposed in [1] (see my modification attached), I
> could reproduce the failure easily on my workstation with no specific
> conditions:
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup LOG:  !!!ConditionalLockBufferForCleanup()
> returning false
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup CONTEXT:  while scanning block 29 of relation
> "public.tenk2"
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup STATEMENT:  VACUUM ANALYZE tenk2;
> ...
>   relname | relpages | reltuples | autovacuum_count | autoanalyze_count
>  -+--+---+--+---
> - tenk2   |  345 | 1 |    0 | 0
> + tenk2   |  345 |  9996 |    0 | 0
>  (1 row)
> 
> So it looks to me like a possible cause of the failure, and I wonder
> whether checks for query plans should be immune to such changes or results
> of VACUUM ANALYZE should be 100% stable?
> 

Yeah. I think it's good to design the data/queries in such a way that
the behavior does not flip due to minor noise like in this case.

But I'm a bit confused - how come the estimates do change at all? The
analyze simply fetches 30k rows, and tenk only has 10k of them. So we
should have *exact* numbers, and it should be exactly the same for all
the analyze runs. So how come it changes like this?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra

On 3/26/24 21:17, Euler Taveira wrote:
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
>> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
>> waited long enough?
> 
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> primary
> is gone during the standby recovery process, there is a way to bail out. The
> recovery-timeout is 0 (infinite) by default so you have an infinite wait 
> without
> this check. The idea behind this implementation is to avoid exiting in this
> critical code path. If it times out here you might have to rebuild the standby
> and start again.

- This seems like something that should definitely be documented in the
comment before wait_for_end_recovery(). At the moment it only talks
about timeout, and nothing about NUM_CONN_ATTEMPTS.

- The NUM_CONN_ATTEMPTS name seems rather misleading, considering it
does not really count connection attempts, but number of times we have
not seen 1 in pg_catalog.pg_stat_wal_receiver.

- Not sure I follow the logic - it tries to avoid exiting by setting
infinite timeout, but it still exists based on NUM_CONN_ATTEMPTS. Isn't
that somewhat contradictory?

- Isn't the NUM_CONN_ATTEMPTS actually making it more fragile, i.e. more
likely to exit? For example, what if there's a short networking hiccup,
so that the standby can't connect to the primary.

- It seems a bit strange that even with the recovery timeout set, having
the limit of 10 "connection attempts" effectively establishes a separate
hard-coded limit of 10 seconds. Seems a bit surprising if I set recovery
limit to 1 minute, and it just dies after 10 seconds.

> Amit suggested [1] that we use a value as recovery-timeout but
> how high is a good value? I've already saw some long recovery process using
> pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
> worried about a small percentage of cases and we should use 1h as default, for
> example. It would reduce the complexity since the recovery process lacks some
> progress indicators (LSN is not sufficient in this case and there isn't a
> function to provide the current state -- stop applying WAL, reach target, new
> timeline, etc).
> 
> If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
> by default otherwise we will have some reports saying the tool is hanging when
> in reality the primary has gone and WAL should be streamed.
> 

I don't think there's a default timeout value that would work for
everyone. Either it's going to be too short for some cases, or it'll
take too long for some other cases.

I think there are two obvious default values for the timeout - infinity,
and 60 seconds, which is the default we use for other CLI tools (like
pg_ctl and so on). Considering the negative impact of exiting, I'd say
it's better to default to infinity. It's always possible to Ctrl-C or
terminate the process in some other way, if needed.

As for people complaining about infinite recovery - perhaps it'd be
sufficient to mention this in the messages printed by the tool, to make
it clearer. Or maybe even print something in the loop, because right now
it's entirely silent so it's easy to believe it's stuck. Perhaps not on
every loop, but at least in verbose mode it should print something.

>> IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
>> pg_createsubscriber, and that should do the trick.
> 
> That's a good idea. Tests are not exercising the recovery-timeout option.
> 
>> Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
>> things like ubsan/valgrind already use to deal with exactly this kind of
>> timeout problem.
>>
>> Or is there a deeper problem with deciding if the system is in recovery?
> 
> As I said with some recovery progress indicators it would be easier to make 
> some
> decisions like wait a few seconds because the WAL has already been applied and
> it is creating a new timeline. The recovery timeout decision is a shot in the
> dark because we might be aborting pg_createsubscriber when the target server 
> is
> about to set RECOVERY_STATE_DONE.
> 

Isn't it enough to check data in pg_stat_replication on the primary?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra

der has not 
>> existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM 
>> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: 
>> got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does 
>> not
>> block the flow to ensure the replication is started. So there is a 
>> possibility
>> that the checking is run before walsender is launched.
>>
>> One possible approach is to wait until the replication starts. Alternative 
>> one is
>> to ease the condition.
> 
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
> 

Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?

IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.

Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.

Or is there a deeper problem with deciding if the system is in recovery?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade --copy-file-range

2024-03-26 Thread Tomas Vondra

On 3/25/24 15:31, Robert Haas wrote:
> On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra
>  wrote:
>> OK, that makes sense. Here's a patch that should work like this - in
>> copy_file we check if we need to calculate checksums, and either use the
>> requested copy method, or fall back to the block-by-block copy.
> 
> +Use efficient file cloning (also known as reflinks on
> +some systems) instead of copying files to the new cluster.  This can
> 
> new cluster -> output directory
> 

Ooops, forgot to fix this. Will do in next version.

> I think your version kind of messes up the debug logging. In my
> version, every call to copy_file() would emit either "would copy
> \"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\"
> using strategy %s". In your version, the dry_run mode emits a string
> similar to the former, but creates separate translatable strings for
> each copy method instead of using the same one with a different value
> of %s. In non-dry-run mode, I think your version loses the debug
> logging altogether.
> 

Yeah. Sorry for not being careful enough about that, I was focusing on
the actual copy logic and forgot about this.

The patch I shared a couple minutes ago should fix this, effectively
restoring the original debug behavior. I liked the approach with calling
strategy_implementation a bit more, I wonder if it'd be better to go
back to that for the "accelerated" copy methods, somehow.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup --copy-file-range

2024-03-26 Thread Tomas Vondra



On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
>  wrote:
> 
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
>>>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
>>>> wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
> 
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
> 
> My assessment:
> 
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
> 
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
> 
> 106831  full
> 2823incr.1 # captured after some time with pgbench -R 100
> 165 incr.2 # captured after some time with pgbench -R 100
> 
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
> 
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 
> 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
> 

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
> 
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
> 
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184]   |  ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184]   | ^~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184]99 |case COPY_MODE_COPYFILE:
> [15:47:27.184]   |^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
> 

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
> 

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
> 
>

Re: Add bump memory context type and use it for tuplesorts

2024-03-25 Thread Tomas Vondra

On 3/25/24 12:41, David Rowley wrote:
> On Tue, 5 Mar 2024 at 15:42, David Rowley  wrote:
>> The query I ran was:
>>
>> select chksz,mtype,pg_allocate_memory_test_reset(chksz,
>> 1024*1024,1024*1024*1024, mtype) from (values(8),(16),(32),(64))
>> sizes(chksz),(values('aset'),('generation'),('slab'),('bump'))
>> cxt(mtype) order by mtype,chksz;
> 
> Andres and I were discussing this patch offlist in the context of
> "should we have bump".  Andres wonders if it would be better to have a
> function such as palloc_nofree() (we didn't actually discuss the
> name), which for aset, would forego rounding up to the next power of 2
> and not bother checking the freelist and only have a chunk header for
> MEMORY_CONTEXT_CHECKING builds. For non-MEMORY_CONTEXT_CHECKING
> builds, the chunk header could be set to some other context type such
> as one of the unused ones or perhaps a dedicated new one that does
> something along the lines of BogusFree() which raises an ERROR if
> anything tries to pfree or repalloc it.
> 
> An advantage of having this instead of bump would be that it could be
> used for things such as the parser, where we make a possibly large
> series of small allocations and never free them again.
> 

I may be missing something, but I don't quite understand how this would
be simpler to use in places like parser. Wouldn't it require all the
places to start explicitly calling palloc_nofree()? How is that better
than having a specialized memory context?

> Andres ask me to run some benchmarks to mock up AllocSetAlloc() to
> have it not check the freelist to see how the performance of it
> compares to BumpAlloc().  I did this in the attached 2 patches.  The
> 0001 patch just #ifdefs that part of AllocSetAlloc out, however
> properly implementing this is more complex as aset.c currently stores
> the freelist index in the MemoryChunk rather than the chunk_size.  I
> did this because it saved having to call AllocSetFreeIndex() in
> AllocSetFree() which made a meaningful performance improvement in
> pfree().  The 0002 patch effectively reverses that change out so that
> the chunk_size is stored.  Again, these patches are only intended to
> demonstrate the performance and check how it compares to bump.
> 
> I'm yet uncertain why, but I find that the first time I run the query
> quoted above, the aset results are quite a bit slower than on
> subsequent runs.  Other context types don't seem to suffer from this.
> The previous results I sent in [1] were of the initial run after
> starting up the database.
> 
> The attached graph shows the number of seconds it takes to allocate a
> total of 1GBs of memory in various chunk sizes, resetting the context
> after 1MBs has been allocated, so as to keep the test sized so it fits
> in CPU caches.
> 

Yeah, strange and interesting. My guess is it's some sort of caching
effect, where the first run has to initialize stuff that's not in any of
the CPU caches yet, likely something specific to AllocSet (freelist?).
Or maybe memory prefetching does not work that well for AllocSet?

I'd try perf-stat, that might tell us more ... but who knows.

Alternatively, it might be some interaction with the glibc allocator.
Have you tried using jemalloc using LD_PRELOAD, or tweaking the glibc
parameters using environment variables? If you feel adventurous, you
might even try the memory pool stuff, although I'm not sure that can
help with the first run.

> I'm not drawing any particular conclusion from the results aside from
> it's not quite as fast as bump.   I also have some reservations about
> how easy it would be to actually use something like palloc_nofree().
> For example heap_form_minimal_tuple() does palloc0().  What if I
> wanted to call ExecCopySlotMinimalTuple() and use palloc0_nofree().
> Would we need new versions of various functions to give us control
> over this?
> 

That's kinda the problem that I mentioned above - is this really any
simpler/better than just having a separate memory context type? I don't
see what benefits this is supposed to have.

IMHO the idea of having a general purpose memory context and then also
specialized memory contexts for particular allocation patterns is great,
and we should embrace it. Adding more and more special cases into
AllocSet seems to go directly against that idea, makes the code more
complex, and I don't quite see how is that better or easier to use than
having a separate BumpContext ...

Having an AllocSet that mixes chunks that may be freed and chunks that
can't be freed, and have a different context type in the chunk header,
seems somewhat confusing and "not great" for debugging, for example.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra

On 3/24/24 21:12, Melanie Plageman wrote:
> On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra
>  wrote:
>>
>> On 3/24/24 18:38, Melanie Plageman wrote:
>>> I haven't had a chance yet to reproduce the regressions you saw in the
>>> streaming read user patch or to look closely at the performance results.
>>
>> So you tried to reproduce it and didn't hit the issue? Or didn't have
>> time to look into that yet? FWIW with v7 it failed almost immediately
>> (only a couple queries until hitting one triggering the issue), but v9
>> that's not the case (hundreds of queries without an error).
> 
> I haven't started trying to reproduce it yet.
> 
>> I however wonder what the plan with these patches is - do we still plan
>> to get some of this into v17? It seems to me we're getting uncomfortably
>> close to the end of the cycle, with a fairly incomplete idea of how it
>> affects performance.
>>
>> Which is why I've been focusing more on the refactoring patches (up to
>> 0015), to make sure those don't cause regressions if committed. And I
>> think that's generally true.
> 
> Thank you for testing the refactoring patches with this in mind! Out
> of the refactoring patches, I think there is a subset of them that
> have independent value without the streaming read user. I think it is
> worth committing the first few patches because they remove a table AM
> layering violation. IMHO, all of the patches up to "Make
> table_scan_bitmap_next_block() async friendly" make the code nicer and
> better. And, if folks like the patch "Remove
> table_scan_bitmap_next_block()", then I think I could rebase that back
> in on top of "Make table_scan_bitmap_next_block() async friendly".
> This would mean table AMs would only have to implement one callback
> (table_scan_bitmap_next_tuple()) which I also think is a net
> improvement and simplification.
> 
> The other refactoring patches may not be interesting without the
> streaming read user.
> 

I admit not reviewing the individual patches very closely yet, but this
matches how I understood them - that at least some are likely an
improvement on their own, not just as a refactoring preparing for the
switch to streaming reads.

We only have ~2 weeks left, so it's probably time to focus on getting at
least those improvements committed. I see Heikki was paying way more
attention to the patches than me, though ...

BTW when you say "up to 'Make table_scan_bitmap_next_block() async
friendly'" do you mean including that patch, or that this is the first
patch that is not one of the independently useful patches.

(I took a quick look at the first couple patches and I appreciate that
you keep separate patches with small cosmetic changes to keep the actual
patch smaller and easier to understand.)

>> But for the main StreamingRead API the situation is very different.
> 
> My intent for the bitmapheapscan streaming read user was to get it
> into 17, but I'm not sure that looks likely. The main issues Thomas is
> looking into right now are related to regressions for a fully cached
> scan (noticeable with the pg_prewarm streaming read user). With all of
> these fixed, I anticipate we will still see enough behavioral
> differences with the bitmapheap scan streaming read user that it may
> not be committable in time. Though, I have yet to work on reproducing
> the regressions with the BHS streaming read user mostly because I was
> focused on getting the refactoring ready and not as much because the
> streaming read API is unstable.
> 

I don't have a very good intuition regarding impact of the streaming API
patch on performance. I haven't been following that thread very closely,
but AFAICS there wasn't much discussion about that - perhaps it happened
offlist, not sure. So who knows, really?

Which is why I started looking at this patch instead - it seemed easier
to benchmark with a somewhat realistic workload.

But yeah, there certainly were significant behavior changes, and it's
unlikely that whatever Thomas did in v8 made them go away.

FWIW I certainly am *not* suggesting there must be no behavior changes,
that's simply not possible. I'm not even suggesting no queries must get
slower - given the dependence on storage, I think some regressions are
pretty much inevitable. But it's still be good to know the regressions
are reasonably rare exceptions rather than the common case, and that's
not what I'm seeing ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra




On 3/24/24 18:38, Melanie Plageman wrote:
> On Sun, Mar 24, 2024 at 01:36:19PM +0100, Tomas Vondra wrote:
>>
>>
>> On 3/23/24 01:26, Melanie Plageman wrote:
>>> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote:
>>>> On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote:
>>>>> On 18/03/2024 17:19, Melanie Plageman wrote:
>>>>>> I've attached v7 rebased over this commit.
>>>>>
>>>>> If we delayed table_beginscan_bm() call further, after starting the TBM
>>>>> iterator, we could skip it altogether when the iterator is empty.
>>>>>
>>>>> That's a further improvement, doesn't need to be part of this patch set.
>>>>> Just caught my eye while reading this.
>>>>
>>>> Hmm. You mean like until after the first call to tbm_[shared]_iterate()?
>>>> AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or
>>>> not the iterator is "empty". Do you mean cases when the bitmap has no
>>>> blocks in it? It seems like we should be able to tell that from the
>>>> TIDBitmap.
>>>>
>>>>>
>>>>>> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch
>>>>>
>>>>> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call 
>>>>> the
>>>>> flag e.g. SO_NEED_TUPLE.
>>>>
>>>> Agreed. Done in attached v8. Though I wondered if it was a bit weird
>>>> that the flag is set in the common case and not set in the uncommon
>>>> case...
>>>
>>> v8 actually attached this time
>>
>> I tried to run the benchmarks with v8, but unfortunately it crashes for
>> me very quickly (I've only seen 0015 to crash, so I guess the bug is in
>> that patch).
>>
>> The backtrace attached, this doesn't seem right:
>>
>> (gdb) p hscan->rs_cindex
>> $1 = 543516018
> 
> Thanks for reporting this! I hadn't seen it crash on my machine, so I
> didn't realize that I was no longer initializing rs_cindex and
> rs_ntuples on the first call to heapam_bitmap_next_tuple() (since
> heapam_bitmap_next_block() wasn't being called first). I've done this in
> attached v9.
> 

OK, I've restarted the tests with v9.

> I haven't had a chance yet to reproduce the regressions you saw in the
> streaming read user patch or to look closely at the performance results.

So you tried to reproduce it and didn't hit the issue? Or didn't have
time to look into that yet? FWIW with v7 it failed almost immediately
(only a couple queries until hitting one triggering the issue), but v9
that's not the case (hundreds of queries without an error).

> I don't anticipate the streaming read user will have any performance
> differences in this v9 from v6, since I haven't yet rebased in Thomas'
> latest streaming read API changes nor addressed any other potential
> regression sources.
> 

OK, understood. It'll be interesting to see the behavior with the new
version of Thomas' patch.

I however wonder what the plan with these patches is - do we still plan
to get some of this into v17? It seems to me we're getting uncomfortably
close to the end of the cycle, with a fairly incomplete idea of how it
affects performance.

Which is why I've been focusing more on the refactoring patches (up to
0015), to make sure those don't cause regressions if committed. And I
think that's generally true.

But for the main StreamingRead API the situation is very different.

> I tried rebasing in Thomas' latest version today and something is
> causing a crash that I have yet to figure out. v10 of this patchset will
> have his latest version once I get that fixed. I wanted to share this
> version with what I think is a bug fix for the crash you saw first.
> 

Understood. I'll let the tests with v9 run for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-24 Thread Tomas Vondra



On 3/23/24 01:26, Melanie Plageman wrote:
> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote:
>> On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote:
>>> On 18/03/2024 17:19, Melanie Plageman wrote:
>>>> I've attached v7 rebased over this commit.
>>>
>>> If we delayed table_beginscan_bm() call further, after starting the TBM
>>> iterator, we could skip it altogether when the iterator is empty.
>>>
>>> That's a further improvement, doesn't need to be part of this patch set.
>>> Just caught my eye while reading this.
>>
>> Hmm. You mean like until after the first call to tbm_[shared]_iterate()?
>> AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or
>> not the iterator is "empty". Do you mean cases when the bitmap has no
>> blocks in it? It seems like we should be able to tell that from the
>> TIDBitmap.
>>
>>>
>>>> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch
>>>
>>> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the
>>> flag e.g. SO_NEED_TUPLE.
>>
>> Agreed. Done in attached v8. Though I wondered if it was a bit weird
>> that the flag is set in the common case and not set in the uncommon
>> case...
> 
> v8 actually attached this time

I tried to run the benchmarks with v8, but unfortunately it crashes for
me very quickly (I've only seen 0015 to crash, so I guess the bug is in
that patch).

The backtrace attached, this doesn't seem right:

(gdb) p hscan->rs_cindex
$1 = 543516018


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyCore was generated by `postgres: postgres test-100 [local] SELECT   
  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, 
slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, 
exact_pages=0x556f9b9eebf0)
at heapam_handler.c:2576

warning: Source file is more recent than executable.
2576targoffset = hscan->rs_vistuples[hscan->rs_cindex];
(gdb) bt
#0  0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, 
slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, 
exact_pages=0x556f9b9eebf0)
at heapam_handler.c:2576
#1  0x556f99e17adb in table_scan_bitmap_next_tuple 
(exact_pages=0x556f9b9eebf0, lossy_pages=0x556f9b9eebf8, 
recheck=0x556f9b9eec10, slot=0x556f9b9ef440, scan=0x556f9b9e6960)
at ../../../src/include/access/tableam.h:2003
#2  BitmapHeapNext (node=0x556f9b9eeb00) at nodeBitmapHeapscan.c:227
#3  0x556f99e1a331 in ExecProcNode (node=0x556f9b9eeb00) at 
../../../src/include/executor/executor.h:274
#4  gather_getnext (gatherstate=0x556f9b9ee970) at nodeGather.c:287
#5  ExecGather (pstate=0x556f9b9ee970) at nodeGather.c:222
#6  0x556f99e24ac3 in ExecProcNode (node=0x556f9b9ee970) at 
../../../src/include/executor/executor.h:274
#7  ExecLimit (pstate=0x556f9b9ee698) at nodeLimit.c:95
#8  0x556f99e0174a in ExecProcNode (node=0x556f9b9ee698) at 
../../../src/include/executor/executor.h:274
#9  ExecutePlan (execute_once=, dest=0x556f9b9f2360, 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x556f9b9ee698, estate=0x556f9b9ee458) at execMain.c:1644
#10 standard_ExecutorRun (queryDesc=0x556f9b944f88, direction=, 
count=0, execute_once=) at execMain.c:363
#11 0x556f99f9cc7f in PortalRunSelect (portal=portal@entry=0x556f9b997008, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x556f9b9f2360)
at pquery.c:924
#12 0x556f99f9dffa in PortalRun (portal=portal@entry=0x556f9b997008, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, 
dest=dest@entry=0x556f9b9f2360, altdest=altdest@entry=0x556f9b9f2360, 
qc=0x7ffc123a0b60) at pquery.c:768
#13 0x556f99f9a57c in exec_simple_query (query_string=0x556f9b91ab18 
"SELECT * FROM linear WHERE (a BETWEEN 2013 AND 8061) OFFSET 100;") at 
postgres.c:1274
#14 0x556f99f9c051 in PostgresMain (dbname=, 
username=) at postgres.c:4680
#15 0x556f99f96def in BackendMain (startup_data=, 
startup_data_len=) at backend_startup.c:101
#16 0x556f99f0c564 in postmaster_child_launch 
(child_type=child_type@entry=B_BACKEND, 
startup_data=startup_data@entry=0x7ffc123a0f90 "", 
startup_data_len=startup_data_len@entry=4, 
client_sock=client_sock@entry=0x7ffc123a0fb0) at launch_backend.c:265
#17 0x556f99f0fea9 in BackendStartup (client_sock=0x7ffc123a0fb0) at 
postmaster.c:3593
#18 ServerLoop () at postmaster.c:1674
#19 0x556f99f11b50 in PostmasterMain (argc=argc@entry=3, 
argv=argv@e

Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra



On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right?  That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think?  That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
> 
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
> 
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
> 

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom d2b717d14638632c25d0e6919f5cd40333e9ebd0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 23 Mar 2024 18:26:21 +0100
Subject: [PATCH v20240323-2 2/2] write_reconstructed_file

---
 src/bin/pg_combinebackup/reconstruct.c | 32 +++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index f5c7af8a23c..4de92894bed 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -59,7 +59,8 @@ static void write_reconstructed_file(char *input_filename,
 	 off_t *offsetmap,
 	 pg_checksum_context *checksum_ctx,
 	 bool debug,
-	 bool dry_run);
+	 bool dry_run,
+	 CopyMode copy_mode);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
 
 /*
@@ -325,7 +326,8 @@ reconstruct_from_incremental_file(char *input_filename,
 	{
 		write_reconstructed_file(input_filename, output_filename,
  block_length, sourcemap, offsetmap,
- _ctx, debug, dry_run);
+ _ctx, debug, dry_run,
+ copy_method);
 		debug_reconstruction(n_prior_backups + 1, source, dry_run);
 	}
 
@@ -528,7 +530,8 @@ write_reconstructed_file(char *input_filename,
 		 off_t *offsetmap,
 		 pg_checksum_context *checksum_ctx,
 		 bool debug,
-		 bool dry_run)
+		 bool dry_run,
+		 CopyMode copy_mode)
 {
 	int			wfd = -1;
 	unsigned	i;
@@ -630,6 +633,29 @@ write_reconstructed_file(char *input_filename,
 		if (dry_run)
 			continue;
 
+		/*
+		 * If requested, copy the block using copy_file_range.
+		 *
+		 * We can'd do this if the block needs to be zero-filled or when we
+		 * need to update checksum.
+		 */
+		if ((copy_mode == COPY_MODE_COPY_FILE_RANGE) &&
+			(s != NULL) && (checksum_ctx->type == CHECKS

Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra

On 3/23/24 13:38, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>> Hmm, this discussion seems to assume that we only use
>> copy_file_range() to copy/clone whole segment files, right?  That's
>> great and may even get most of the available benefit given typical
>> databases with many segments of old data that never changes, but... I
>> think copy_write_range() allows us to go further than the other
>> whole-file clone techniques: we can stitch together parts of an old
>> backup segment file and an incremental backup to create a new file.
>> If you're interested in minimising disk use while also removing
>> dependencies on the preceding chain of backups, then it might make
>> sense to do that even if you *also* have to read the data to compute
>> the checksums, I think?  That's why I mentioned it: if
>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>> search of a problem, has the world ever seen a better problem than
>> pg_combinebackup?
> 
> That makes sense; it's just a different part of the code than I
> thought we were talking about.
> 

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3448 matches

Mail list logo