Re: index prefetching

2024-01-16 Thread Jim Nasby

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).

EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.


Of course, Aurora Postgres (like Neon) is completely different. If you 
look at page 53 of [1] you'll note that there's two different terms 
used: prefetch and batch. I'm not sure how much practical difference 
there is, but batched IO (one IO request to Aurora Storage for many 
blocks) predates index prefetch; VACUUM in APG has used batched IO for a 
very long time (it also *only* reads blocks that aren't marked all 
visble/frozen; none of the "only skip if skipping at least 32 blocks" 
logic is used).


1: 
https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf

--
Jim Nasby, Data Architect, Austin TX





Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Jim Nasby

On 1/12/24 12:45 PM, Robert Haas wrote:

P.P.S. to everyone: Yikes, this logic is really confusing.


Having studied all this code several years ago when it was even simpler 
- it was *still* very hard to grok even back then. I *greatly 
appreciate* the effort that y'all are putting into increasing the 
clarity here.


BTW, back in the day the whole "no indexes" optimization was a really 
tiny amount of code... I think it amounted to 2 or 3 if statements. I 
haven't yet attempted to grok this patchset, but I'm definitely 
wondering how much it's worth continuing to optimize that case. Clearly 
it'd be very expensive to memoize dead tuples just to trawl that list a 
single time to clean the heap, but outside of that I'm not sure other 
optimazations are worth it given the amount of code 
complexity/duplication they seem to require - especially for code where 
correctness is so crucial.

--
Jim Nasby, Data Architect, Austin TX





Re: Make NUM_XLOGINSERT_LOCKS configurable

2024-01-12 Thread Jim Nasby

On 1/12/24 12:32 AM, Bharath Rupireddy wrote:

Test case:
./pgbench --initialize --scale=100 --username=ubuntu postgres
./pgbench --progress=10 --client=64 --time=300 --builtin=tpcb-like
--username=ubuntu postgres

Setup:
./configure --prefix=$PWD/inst/  CFLAGS="-ggdb3 -O3" > install.log &&
make -j 8 install > install.log 2>&1 &

shared_buffers = '8GB'
max_wal_size = '32GB'
track_wal_io_timing = on

Stats measured:
I've used the attached patch to measure WAL Insert Lock Acquire Time
(wal_insert_lock_acquire_time)  and WAL Wait for In-progress Inserts
to Finish Time (wal_wait_for_insert_to_finish_time).


Unfortunately this leaves the question of how frequently is 
WaitXLogInsertionsToFinish() being called and by whom. One possibility 
here is that wal_buffers is too small so backends are constantly having 
to write WAL data to free up buffers.

--
Jim Nasby, Data Architect, Austin TX





Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-12 Thread Jim Nasby

On 1/11/24 5:50 PM, Melanie Plageman wrote:

On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz  wrote:


On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:


On 1/4/24 2:23 PM, Andres Freund wrote:

On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:

Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
  nskippable_blocks

I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.

Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of 
skipping should go away in the context of vector IO? Instead of thinking about "we can skip 
this range of blocks", why not maintain a list of "here's the next X number of blocks 
that we need to vacuum"?


Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
"the next X number of blocks that need to be vacuumed" will be
prefetched by calculating the unskippable blocks ( using the
lazy_scan_skip() function ) and the X will be determined by Postgres
itself. Do you have something different in your mind?


I think you are both right. As we gain more control of readahead from
within Postgres, we will likely want to revisit this heuristic as it
may not serve us anymore. But the streaming read interface/vectored
I/O is also not a drop-in replacement for it. To change anything and
ensure there is no regression, we will probably have to do
cross-platform benchmarking, though.

That being said, I would absolutely love to get rid of the skippable
ranges because I find them very error-prone and confusing. Hopefully
now that the skipping logic is isolated to a single function, it will
be easier not to trip over it when working on lazy_scan_heap().


Yeah, arguably it's just a matter of semantics, but IMO it's a lot 
clearer to simply think in terms of "here's the next blocks we know we 
want to vacuum" instead of "we vacuum everything, but sometimes we skip 
some blocks".

--
Jim Nasby, Data Architect, Austin TX





Re: Add BF member koel-like indentation checks to SanityCheck CI

2024-01-09 Thread Jim Nasby

On 1/9/24 3:20 PM, Tom Lane wrote:

In short, I don't think that putting this into CI is the answer.
Putting it into committers' standard workflow is a better idea,
if we can get all the committers on board with that.


FWIW, that's the approach that go takes - not only for committing to go 
itself, but it is *strongly* recommended[1] that anyone writing any code 
in go makes running `go fmt` a standard part of their workflow. In my 
experience, it makes collaborating noticably easier because you never 
need to worry about formatting differences. FYI, vim makes this easy via 
vim-autoformat[2] (which also supports line-by-line formatting if the 
format tool allows it); presumably any modern editor has similar support.


1: Literally 3rd item at https://go.dev/doc/effective_go
2: https://github.com/vim-autoformat/vim-autoformat
--
Jim Nasby, Data Architect, Austin TX





Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-09 Thread Jim Nasby

On 1/8/24 2:10 PM, Robert Haas wrote:

On Fri, Jan 5, 2024 at 3:57 PM Andres Freund  wrote:

I will be astonished if you can make this work well enough to avoid
huge regressions in plausible cases. There are plenty of cases where
we do a very thorough job opportunistically removing index tuples.


These days the AM is often involved with that, via
table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
happen before physically removing the already-marked-killed index entries. We
can't rely on being able to actually prune the heap page at that point, there
might be other backends pinning it, but often we will be able to. If we were
to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
index again during "individual tuple pruning", if the to-be-marked-unused heap
tuple is one of the tuples passed to heap_index_delete_tuples(). Which
presumably will be very commonly the case.

At least for nbtree, we are much more aggressive about marking index entries
as killed, than about actually removing the index entries. "individual tuple
pruning" would have to look for killed-but-still-present index entries, not
just for "live" entries.


I don't want to derail this thread, but I don't really see what you
have in mind here. The first paragraph sounds like you're imagining
that while pruning the index entries we might jump over to the heap
and clean things up there, too, but that seems like it wouldn't work
if the table has more than one index. I thought you were talking about
starting with a heap tuple and bouncing around to every index to see
if we can find index pointers to kill in every one of them. That
*could* work out, but you only need one index to have been
opportunistically cleaned up in order for it to fail to work out.
There might well be some workloads where that's often the case, but
the regressions in the workloads where it isn't the case seem like
they would be rather substantial, because doing an extra lookup in
every index for each heap tuple visited sounds pricey.


The idea of probing indexes for tuples that are now dead has come up in 
the past, and the concern has always been whether it's actually safe to 
do so. An obvious example is an index on a function and now the function 
has changed so you can't reliably determine if a particular tuple is 
present in the index. That's bad enough during an index scan, but 
potentially worse while doing heeap cleanup. Given that operators are 
functions, this risk exists to some degree in even simple indexes.


Depending on the gains this might still be worth doing, at least for 
some cases. It's hard to conceive of this breaking for indexes on 
integers, for example. But we'd still need to be cautious.

--
Jim Nasby, Data Architect, Austin TX





Re: Random pg_upgrade test failure on drongo

2024-01-08 Thread Jim Nasby

On 1/4/24 10:19 PM, Amit Kapila wrote:

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin  wrote:


03.01.2024 14:42, Amit Kapila wrote:





And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...


What about checkpoints? Can't it do the same while writing the buffers?


As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
  SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
  SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
  TRUNCATE pg_catalog.pg_largeobject;
--  ^^^ here we can get the error "could not create file "base/5/2683": File 
exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).



I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/


Isn't this just sweeping the problem (non-POSIX behavior on SMB and 
ReFS) under the carpet? I realize that synthetic test workloads like 
pg_upgrade in a loop aren't themselves real-world scenarios, but what 
about other cases? Even if we're certain it's not possible for these 
issues to wedge a server, it's still not a good experience for users to 
get random, unexplained IO-related errors...

--
Jim Nasby, Data Architect, Austin TX





Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-04 Thread Jim Nasby


  
  
On 1/4/24 2:23 PM, Andres Freund wrote:


  On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
 nskippable_blocks
I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.

Admittedly I'm not up to speed on recent vacuum changes, but I
  have to wonder if the concept of skipping should go away in the
  context of vector IO? Instead of thinking about "we can skip this
  range of blocks", why not maintain a list of "here's the next X
  number of blocks that we need to vacuum"?

-- 
Jim Nasby, Data Architect, Austin TX
  





Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Jim Nasby


  
  
On 1/4/24 10:33 AM, Tom Lane wrote:


  Robert Haas  writes:
On Thu, Jan 4, 2024 at 10:22 AM Tom Lane  wrote:
We should be making an effort to ban coding patterns like
"return with spinlock still held", because they're just too prone
to errors similar to this one.
I agree that we don't want to add overhead, and also about how
spinlocks should be used, but I dispute "easily detectable
statically." I mean, if you or I look at some code that uses a
spinlock, we'll know whether the pattern that you mention is being
followed or not, modulo differences of opinion in debatable cases. But
you and I cannot be there to look at all the code all the time. If we
had a static checking tool that was run as part of every build, or in
the buildfarm, or by cfbot, or somewhere else that raised the alarm if
this rule was violated, then we could claim to be effectively
enforcing this rule.
I was indeed suggesting that maybe we could find a way to detect
such things automatically.  While I've not been paying close
attention, I recall there's been some discussions of using LLVM/clang
infrastructure for customized static analysis, so maybe it'd be
possible without an undue amount of effort.

FWIW, the lackey[1] tool in Valgrind is able to do some kinds of
  instruction counting, so it might be possible to measure how many
  instructions are actualyl being executed while holding a spinlock.
  Might be easier than code analysis.
Another possibility might be using the CPUs timestamp counter.

1: https://valgrind.org/docs/manual/lk-manual.html
    
-- 
Jim Nasby, Data Architect, Austin TX
  





Re: doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL

2024-01-04 Thread Jim Nasby


  
  
On 1/3/24 5:57 PM, Cedric Villemain
  wrote:


  
  for 15 years pgfincore has been sitting quietly and being used
in large setups to help in HA and resources management.
It can perfectly stay as is, to be honest I was expecting to one
day include a windows support and propose that to PostgreSQL, it
appears getting support on linux and BSD is more than enough
today.
  So I wonder if there are interest for having virtual memory
snapshot and restore operations with, for example,
pg_prewarm/autowarm ?
  

IMHO, to improve the user experience here we'd need something
  that combined the abilities of all these extensions into a
  cohesive interface that allowed users to simply say "please get
  this data into cache". Simply moving pgfincore into core Postgres
  wouldn't satisfy that need.
So I think the real question is whether the community feels
  spport for better cache (buffercache + filesystem) management is a
  worthwhile feature to add to Postgres.
Micromanaging cache contents for periodic jobs seems almost like
  a mis-feature. While it's a useful tool to have in the toolbox,
  it's also a non-trivial layer of complexity. IMHO not something
  we'd want to add. Though, there might be smaller items that would
  make creating tools to do that easier, such as some ability to see
  what blocks a backend is accessing (perhaps via a hook).
On the surface, improving RTO via cache warming sounds
  interesting ... but I'm not sure how useful it would be in
  reality. Users that care about RTO would almost always have some
  form of hot-standby, and generally those will already have a lot
  of data in cache. While they won't have read-only data in cache, I
  have to wonder if the answer to that problem is allowing writers
  to tell a replica what blocks are being read, so the replica can
  keep them in cache. Also, most (all?) operations that require a
  restart could be handled via a failover, so I'm not sure how much
  cache management moves the needle there.


   
  Some usecases covered: snapshot/restore cache around cronjobs,
around dumps, switchover, failover, on stop/start of postgres
(think kernel upgrade with a cold restart), ...

  
  pgfincore also provides some nice information with mincore (on
FreeBSD mincore is more interesting) or cachestat, again it can
remain as an out of tree extension but I will be happy to add to
commitfest if there are interest from the community.
An example of cachestat output:
  postgres=#
select *from vm_relation_cachestat('foo',range:=1024*32); 
  block_start | block_count | nr_pages | nr_cache | nr_dirty |
  nr_writeback | nr_evicted | nr_recently_evicted  
-+-+--+--+--+--++-
  
    0 |   32768 |    65536 |    62294 |    0 |
     0 |   3242 |    3242 
    32768 |   32768 |    65536 |    39279 |    0 |
     0 |  26257 |   26257 
    65536 |   32768 |    65536 |    22516 |    0 |
     0 |  43020 |   43020 
    98304 |   32768 |    65536 |    24944 |    0 |
     0 |  40592 |   40592 
   131072 |    1672 | 3344 |  487 |    0 |
     0 |   2857 |    2857
  

  

  Comments?

  ---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R


    
    -- 
Jim Nasby, Data Architect, Austin TX
  





Re: SET ROLE x NO RESET

2024-01-04 Thread Jim Nasby


  
  
On 1/3/24 5:47 PM, Nico Williams wrote:


  Though maybe `NO RESET` isn't really needed to build these, since after
all one could use an unprivileged role and a SECURITY DEFINER function
that does the `SET ROLE` following some user-defined authentication
method, and so what if the client can RESET the role, since that brings
it back to the otherwise unprivileged role.

An unprivileged role that has the ability to assume any other
  role ;p
Also, last I checked you can't do SET ROLE in a security definer
  function.


  
Who needs to RESET roles anyways?  Answer: connection pools, but not
every connection is used via a pool.  This brings up something: attempts
to reset a NO RESET session need to fail in such a way that a connection
pool can detect this and disconnect, or else it needs to fail by
terminating the connection altogether.


RESET ROLE is also useful for setting up a "superuser role" that
  DBAs have access to via a NO INHERIT role. It allows them to do
  things like...
SET ROLE su;
-- Do some superuserly thing
RESET ROLE;
Admittedly, the last step could be just to set their role back to
  themselves, but RESET ROLE removes the risk of typos.
    
    -- 
Jim Nasby, Data Architect, Austin TX
  





Re: add function argument names to regex* functions.

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 5:05 PM, Dian Fay wrote:


  

  Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.



I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

  
  
We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?


Presumably someone referencing arguments by name would have just
  looked up the names via \df or whatever, so presumably misspelling
  wouldn't be a big issue. But I think "instance" is OK as well.

-- 
Jim Nasby, Data Architect, Austin TX
  





Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 10:25 AM, Cédric Villemain
  wrote:

Hi
  Palak,
  
  
  I did a quick review of the patch:
  
  
  +CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default
  true)
  
  +RETURNS bool
  
  +AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
  
  +LANGUAGE C PARALLEL SAFE;
  
  
  --> Not enforced anywhere, but you can also add a comment to
  the function, for end users...
  

The arguments should also have names...

  +    force = PG_GETARG_BOOL(1);
  
  
  I think you also need to test PG_ARGISNULL with force parameter.
  

Actually, that's true for the first argument as well. Or, just mark
the function as STRICT.
-- 
Jim Nasby, Data Architect, Austin TX
  





Re: Password leakage avoidance

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 7:53 AM, Robert Haas wrote:


  Also, +1 for the general idea. I don't think this is a whole answer to
the problem of passwords appearing in log files because (1) you have
to be using libpq in order to make use of this and (2) you have to
actually use it instead of just doing something else and complaining
about the problem. But neither of those things is a reason not to have
it. There's no reason why a sophisticated user who goes through libpq
shouldn't have an easy way to do this instead of being asked to
reimplement it if they want the functionality.

ISTM the only way to really move the needle (short of removing
  all SQL support for changing passwords) would be a GUC that allows
  disabling the use of SQL for setting passwords. While that doesn't
  prevent leakage, it does at least force users to use a secure
  method of setting passwords.

-- 
Jim Nasby, Data Architect, Austin TX
  





Re: add function argument names to regex* functions.

2024-01-03 Thread Jim Nasby


  
  
On 1/1/24 12:05 PM, Dian Fay wrote:


  I agree that the parameter name `n` is not ideal. For example, in
`regexp_replace` it's easy to misinterpret it as "make up to n
replacements". This has not been a problem when `n` only lives in the
documentation which explains exactly what it does, but that context is
not readily available in code expressing `n => 3`.


Agreed; IMO it's worth diverging from what Oracle has done here.

  
Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.


I'm not a fan of "index" since that leaves the question of
  whether it's 0 or 1 based. "Position" is a bit better, but I think
  Jian's suggestion of "occurance" is best.

--
Jim Nasby, Data Architect, Austin TX
  





Re: Things I don't like about \du's "Attributes" column

2024-01-02 Thread Jim Nasby


  
  
On 1/2/24 1:38 PM, Robert Haas wrote:


  But to try to apply that concept
here means that we suppose the user knows whether the default is
INHERIT or NOINHERIT, whether the default is BYPASSRLS or NOBYPASSRLS,
etc. And I'm just a little bit skeptical of that assumption. Perhaps
it's just that I've spent less time doing user management than table
administration and so I'm the only one who finds this fuzzier than
some other kinds of SQL objects, but I'm not sure it's just that.
Roles are pretty weird.

In my consulting experience, it's extremely rare for users to do
  anything remotely sophisticated with roles (I was always happy
  just to see apps weren't connecting as a superuser...).
Like you, I view \du and friends as more of a "helping hand" to
  seeing the state of things, without the expectation that every
  tiny nuance will always be visible, because I don't think it's
  practical to do that in psql. While that behavior might surprise
  some users, the good news is once they start exploring non-default
  options the behavior becomes self-evident.
Some attributes are arguably important enough to warrant their
  own column. The most obvious is NOLOGIN, since those roles are
  generally used for a very different purpose than LOGIN roles.
  SUPERUSER might be another candidate (though, I much prefer a
  dedicated "sudo role" than explicit SU on roles).
I'm on the fence when it comes to SQL syntax vs what we have now.
  What we currenly have is more readable, but off-hand I think the
  other places we list attributes we do it in SQL syntax. It might
  be worth changing just for consistency sake.
--
Jim Nasby, Data Architect, Austin TX

  





Re: Next step towards 64bit XIDs: Switch to FullTransactionId for PGPROC->xid and XLogRecord->xl_xid

2024-01-02 Thread Jim Nasby


  
  
On 1/2/24 1:58 PM, Robert Haas wrote:


  Maybe this analysis I've just given isn't quite right, but my point is
that we should try to think hard about where in the system 32-bit XIDs
suck and for what reason, and use that as a guide to what to change
first.

Very much this. The biggest reason 2B XIDs are such an issue is
  because it's incredibly expensive to move the window forward,
  which is governed by on-disk stuff.

  





Re: SET ROLE x NO RESET

2024-01-02 Thread Jim Nasby


  
  
On 12/31/23 1:19 PM, Joe Conway wrote:

On
  12/30/23 17:19, Michał Kłeczek wrote:
  
  

On 30 Dec 2023, at 17:16, Eric Hanson
   wrote:
  
  
  What do you think of adding a NO RESET option to the SET ROLE
  command?
  


What I proposed some time ago is SET ROLE … GUARDED BY
‘password’, so that you could later: RESET ROLE WITH ‘password'

  
  
  I like that too, but see it as a separate feature. FWIW that is
  also supported by the set_user extension referenced elsewhere on
  this thread.

That means you'd need to manage that password. ISTM a better
mechanism to do this in SQL would be a method of changing roles that
returns a nonce that you'd then use to reset your role. I believe
that'd also make it practical to do this server-side in the various
PLs.
  





Re: Should vacuum process config file reload more often

2023-03-08 Thread Jim Nasby

On 3/2/23 1:36 AM, Masahiko Sawada wrote:


For example, I guess we will need to take care of changes of
maintenance_work_mem. Currently we initialize the dead tuple space at
the beginning of lazy vacuum, but perhaps we would need to
enlarge/shrink it based on the new value?
Doesn't the dead tuple space grow as needed? Last I looked we don't 
allocate up to 1GB right off the bat.

I don't think we need to do anything about that initially, just because the
config can be changed in a more granular way, doesn't mean we have to react to
every change for the current operation.

Perhaps we can mention in the docs that a change to maintenance_work_mem
will not take effect in the middle of vacuuming a table. But, Ithink it probably
isn't needed.

Agreed.


I disagree that there's no need for this. Sure, if 
maintenance_work_memory is 10MB then it's no big deal to just abandon 
your current vacuum and start a new one, but the index vacuuming phase 
with maintenance_work_mem set to say 500MB can take quite a while. 
Forcing a user to either suck it up or throw everything in the phase 
away isn't terribly good.


Of course, if the patch that eliminates the 1GB vacuum limit gets 
committed the situation will be even worse.


While it'd be nice to also honor maintenance_work_mem getting set lower, 
I don't see any need to go through heroics to accomplish that. Simply 
recording the change and honoring it for future attempts to grow the 
memory and on future passes through the heap would be plenty.


All that said, don't let these suggestions get in the way of committing 
this. Just having the ability to tweak cost parameters would be a win.


Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-21 Thread Jim Nasby



On 2/21/23 3:12 PM, Andres Freund wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi,

On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:

Some food for thought: I think it's also completely fine to extend any
relation over a certain size by multiple blocks, regardless of concurrency.
E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
for what algorithm would make sense here; maybe something along the lines of
extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
extending by just a couple extra pages doesn't help much without
concurrency).

I previously implemented just that. It's not easy to get right. You can easily
end up with several backends each extending the relation by quite a bit, at
the same time (or you re-introduce contention). Which can end up with a
relation being larger by a bunch if data loading stops at some point.

We might want that as well at some point, but the approach implemented in the
patchset is precise and thus always a win, and thus should be the baseline.
Yeah, what I was suggesting would only make sense when there *wasn't* 
contention.





Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-21 Thread Jim Nasby

On 10/28/22 9:54 PM, Andres Freund wrote:

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.
Some food for thought: I think it's also completely fine to extend any 
relation over a certain size by multiple blocks, regardless of 
concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't 
have a good feel for what algorithm would make sense here; maybe 
something along the lines of extend = max(relpages / 2048, 128); if 
extend < 8 extend = 1; (presumably extending by just a couple extra 
pages doesn't help much without concurrency).





Re: Direct I/O

2022-11-04 Thread Jim Nasby

On 11/1/22 2:36 AM, Thomas Munro wrote:


Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the


This is exciting to see! There's two other items to add to the TODO list 
before this would be ready for production:


1) work_mem. This is a significant impediment to scaling shared buffers 
the way you'd want to.


2) Clock sweep. Specifically, currently the only thing that drives 
usage_count is individual backends running the clock hand. On large 
systems with 75% of memory going to shared_buffers, that becomes a very 
significant problem, especially when the backend running the clock sweep 
is doing so in order to perform an operation like a b-tree page split. I 
suspect it shouldn't be too hard to deal with this issue by just having 
bgwriter or another bgworker proactively ensuring some reasonable number 
of buffers with usage_count=0 exist.



One other thing to be aware of: overflowing as SLRU becomes a massive 
problem if there isn't a filesystem backing the SLRU. Obviously only an 
issue if you try and apply DIO to SLRU files.






Re: autovacuum: change priority of the vacuumed tables

2018-03-03 Thread Jim Nasby

On 3/3/18 2:53 PM, Tomas Vondra wrote:

That largely depends on what knobs would be exposed. I'm against adding
some low-level knobs that perhaps 1% of the users will know how to tune,
and the rest will set it incorrectly. Some high-level options that would
specify the workload type might work, but I have no idea about details.


Not knowing about details is why we've been stuck here for years: it's 
not terribly obvious how to create a scheduler that is going to work in 
all situations. Current autovac is great for 80% of situations, but it 
simply doesn't handle the remaining 20% by itself. Once you're pushing 
your IO limits you *have* to start scheduling manual vacuums for any 
critical tables.


At least if we exposed some low level ability to control autovac workers 
then others could create tools to improve the situation. Currently 
that's not possible because manual vacuum lacks features that autovac has.



One fairly simple option would be to simply replace the logic that
currently builds a worker's table list with running a query via SPI.
That would allow for prioritizing important tables. It could also reduce
the problem of workers getting "stuck" on a ton of large tables by
taking into consideration the total number of pages/tuples a list contains.


I don't see why SPI would be needed to do that, i.e. why couldn't we
implement such prioritization with the current approach. Another thing


Sure, it's just a SMOC. But most of the issue here is actually a query 
problem. I suspect that the current code would actually shrink if 
converted to SPI. In any case, I'm not wed to that idea.



is I really doubt prioritizing "important tables" is an good solution,
as it does not really guarantee anything.


If by "important" you mean small tables with high update rates, 
prioritizing those actually would help as long as you have free workers. 
By itself it doesn't gain all that much though.



A more fine-grained approach would be to have workers make a new
selection after every vacuum they complete. That would provide the
ultimate in control, since you'd be able to see exactly what all the
other workers are doing.

That was proposed earlier in this thread, and the issue is it may starve
all the other tables when the "important" tables need cleanup all the time.


There's plenty of other ways to shoot yourself in the foot in that 
regard already. We can always have safeguards in place if we get too 
close to wrap-around, just like we currently do.

--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com



Re: autovacuum: change priority of the vacuumed tables

2018-03-03 Thread Jim Nasby

On 2/19/18 10:00 AM, Tomas Vondra wrote:

So I don't think this is a very promising approach, unfortunately.

What I think might work is having a separate pool of autovac workers,
dedicated to these high-priority tables. That still would not guarantee
the high-priority tables are vacuumed immediately, but at least that
they are not stuck in the worker queue behind low-priority ones.

I wonder if we could detect tables with high update/delete activity and
promote them to high-priority automatically. The reasoning is that by
delaying the cleanup for those tables would result in significantly more
bloat than for those with low update/delete activity.


I've looked at this stuff in the past, and I think that the first step 
in trying to improve autovacuum needs to be allowing for a much more 
granular means of controlling worker table selection, and exposing that 
ability. There are simply too many different scenarios to try and 
account for to try and make a single policy that will satisfy everyone. 
Just as a simple example, OLTP databases (especially with small queue 
tables) have very different vacuum needs than data warehouses.


One fairly simple option would be to simply replace the logic that 
currently builds a worker's table list with running a query via SPI. 
That would allow for prioritizing important tables. It could also reduce 
the problem of workers getting "stuck" on a ton of large tables by 
taking into consideration the total number of pages/tuples a list contains.


A more fine-grained approach would be to have workers make a new 
selection after every vacuum they complete. That would provide the 
ultimate in control, since you'd be able to see exactly what all the 
other workers are doing.

--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com