Using operators to do query hints

2022-02-22 Thread Greg Stark
I've been playing with an idea I had a while back. Basically that it
would be useful to have some "noop" operators that are used purely to
influence the planner.

For context I've suggested in the past that there are two categories of hints:

1 Hints that override the planner's decisions with explicit planning
decisions. These we don't like for a variety of reasons, notably
because users don't have very good knowledge of all the plan types
available and new plan types come out in the future.

2 Hints that just influence the estimates that the planner uses to
make its decisions. These seem more acceptable as it amounts to
admitting the users know the distribution of their data and the
behaviour of their functions better than the statistics. Also, there
are plenty of cases where the statistics are really just wild guesses.
They still allow the planner to make decisions about what join orders
or types and so on given the updated data.

So I thought adding some noop operators that took a boolean on one
side and a float on the other and simply returned the boolean at
run-time but used the float (which would have to be a constant) from
the lhs as the selectivity would be useful.

In practice it's usable but not nearly as widely useful as I expected.

1) The boolean being passed through has to be a real clause with real
fields. Otherwise the planner is too clever and turns it into a
one-time filter which doesn't affect the plan :/

2) If it's a clauase that is potentially going to be an index
condition then you need to repeat the clause outside the selectivity
noop operator. Otherwise the selectivity operator hides it from the
planner.

3) It doesn't work on joins at all. Joins are done using the join
selectivity function on the join clause's operator. There's no way to
pass construct a simple noop wrapper that would still work with that
that I can see.

Nonetheless what I have does seem to be somewhat handy for simple cases.

I added some simple SQL wrapper functions such as pg_unlikely():

postgres=# explain select * from test where i<100;
 QUERY PLAN
-
 Index Only Scan using i on test  (cost=0.28..10.01 rows=99 width=4)
   Index Cond: (i < 100)
(2 rows)

postgres=# explain select * from test where pg_unlikely(i<100);
 QUERY PLAN

 Index Only Scan using i on test  (cost=0.28..10.50 rows=1 width=4)
   Index Cond: (i < 100)
   Filter: ('1e-06'::double precision %%% (i < 100))
(3 rows)

However this doesn't really do what you might really be hoping.
specifically, it doesn't actually affect the planner's choice of the
index scan in that query. It affects the estimate of the result of the
scan and that might be useful for subsequent nodes of the plan. But
not for the index scan itself:

postgres=# explain select * from test where pg_unlikely(i<500);
 QUERY PLAN
-
 Seq Scan on test  (cost=0.00..22.50 rows=1 width=4)
   Filter: ((i < 500) AND ('1e-06'::double precision %%% (i < 500)))
(2 rows)


So. I dunno. This feels like it's something that could be quite
handy but it's not as good as I had hoped and I think I'm out of ideas
for making it more powerful.


-- 
greg




Re: Design of pg_stat_subscription_workers vs pgstats

2022-02-25 Thread Greg Stark
On Tue, 25 Jan 2022 at 01:32, Andres Freund  wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again.

Can you point me to this thread? I looked for it but couldn't find it.


-- 
greg




Re: Commitfest manager for 2022-03

2022-02-25 Thread Greg Stark
I would like to volunteer.

On Fri, 25 Feb 2022 at 05:31, Julien Rouhaud  wrote:
>
> Hi,
>
> The final commitfest for pg15 will start in a few days, and I didn't see any
> discussion on it or anyone volunteering to be a CFM.
>
> I thought it would be a good idea to send this reminder now and avoid the same
> situation as the last commitfest, to avoid unnecessary pain for the CFM(s).
>
> Is there any volunteer?
>
> For the record there are already 246 active patches registered.
>
>


-- 
greg




Re: Commitfest manager for 2022-03

2022-02-26 Thread Greg Stark
On Sat, 26 Feb 2022 at 01:33, Julien Rouhaud  wrote:
>
> On Sat, Feb 26, 2022 at 02:42:33PM +0900, Michael Paquier wrote:
> > On Fri, Feb 25, 2022 at 01:58:55PM -0600, David Steele wrote:
> > > On 2/25/22 12:39, Greg Stark wrote:
> > >> I would like to volunteer.
> >
> > > I've been hoping somebody would volunteer, so I'm all in favor of you 
> > > being
> > > CF.
> >
> > Greg as CFM would be fine.  It's nice to see someone volunteer.
>
> +1, thanks a lot Greg!
>
> Note that the last CF of a major version is probably even more work than
> "regular" CF, so if other people are interested there's probably room for
> multiple CFMs.

I do have the time available. What I don't necessarily have is insight
into everything that needs to be done, especially behind the scenes.
So if I could have someone to give me tips about things I'm missing or
review the things I am doing to see that I've covered everything that
would be great.


-- 
greg




CommitFest begins tomorrow... Get your patches in

2022-02-28 Thread Greg Stark
28 days has February ... So the March commitfest begins tomorrow. Meet
your friendly neighbourhood Commitfest Manager for March.  Greetings!

If you have a patch you're hoping to get feedback on or you're
expecting to get committed this month make sure it's in the commitfest
at https://commitfest.postgresql.org/37/

Currently there are 199 patches marked Needs Review and 25 Ready for Committer.

This is the last commitfest of the cycle so we'll be concentrating on
patches that are likely to be committed this commitfest. I'll move any
patches that are unlikely to be committed to target Postgres 16 and
the July commitfest shortly.

If you think your patch is really viable for Postgres 15 despite
needing feedback it can be helpful to post describing exactly what
you're blocked on and what remains to be done. It also sometimes helps
get attention if you make clear that you have the time available this
month to work on it in response to feedback.

The "round robin reviewers" system is no longer. Generally people can
just grab any unassigned patch from the commitfest and review it but
if for some reason you're reluctant feel free to email me -- please
put "commitfest" in the subject line.

If you've submitted a patch and are feeling blocked waiting on
feedback please update the thread on the list. If you're still not
getting responses please let me know -- again, please put "commitfest"
in the subject.

-- 
greg




Re: Removing unneeded self joins

2022-02-28 Thread Greg Stark
On Thu, 1 Jul 2021 at 02:38, Ronan Dunklau  wrote:
>
> Well in some cases they can't, when the query is not emitting redundant
> predicates by itself but they are added by something else like a view or a RLS
> policy.
> Maybe it would be worth it to allow spending a bit more time planning for
> those cases ?

Yeah, I'm generally in favour of doing more work in the optimizer to
save query authors work writing queries.

My question is whether it handles cases like:

select b.x,c.y
  from t
   join t2 as b on (b.id = t.id)
  join t2 as c on (c.id = t.id)

That is, if you join against the same table twice on the same qual.
Does the EC mechanism turn this into a qual on b.id = c.id and then
turn this into a self-join that can be removed?

That's the usual pattern I've seen this arise. Not so much that people
write self joins explicitly but that they add a join to check some
column but that is happening in some isolated piece of code that
doesn't know that that join is already in the query. You can easily
end up with a lot of joins against the same table this way.

It's not far different from the old chestnut

select (select x from t2 where id = t.id) as x,
   (select y from t2 where id = t.id) as y
  from t

which is actually pretty hard to avoid sometimes.

-- 
greg




Re: Removing unneeded self joins

2022-02-28 Thread Greg Stark
I don't think the benchmarking that's needed is to check whether
pruning unnecessary joins is helpful. Obviously it's going to be hard
to measure on simple queries and small tables. But the resulting plan
is unambiguously superior and in more complex cases could extra i/o.

The benchmarking people were looking for in the past was testing the
impact of the extra planning work in cases where it doesn't end up
being applied. I'm not sure what the worst case is, perhaps a many-way
self-join where the join clauses are not suitable for pruning?




Commitfest 2022-03 Patch Triage Part 1a.i

2022-03-01 Thread Greg Stark
Last November Daniel Gustafsson  did a patch triage. It took him three
emails to get through the patches in the commitfest back then. Since
then we've had the November and the January commitfests so I was
interested to see how many of these patches had advanced

I'm only part way through the first email but so far only two patches
have changed status -- and both to "Returned with feedback" :(

So I'm going to post updates but I'm going to break it up into smaller
batches because otherwise it'll take me a month before I post
anything.



> 1608: schema variables, LET command
> ===
> After 18 CF's and two very long threads it seems to be nearing completion
> judging by Tomas' review.  There is an ask in that review for a second pass
> over the docs by a native speaker, any takers?

Patch has a new name, "session variables, LET command"

There's been a *lot* of work on this patch so I'm loath to bump it.
The last review was from Julien Rouhaud which had mostly code style
comments but it had one particular concern about using xact callback
in core and about EOX cleanup.

Pavel, do you have a plan to improve this or are you looking for
suggestions from someone about how you should solve this problem?

> 1741: Index Skip Scan
> =
> An often requested feature which has proven hard to reach consensus on an
> implementation for.  The thread(s) have stalled since May, is there any hope 
> of
> taking this further?  Where do we go from here with this patch?

"Often requested indeed! I would love to be able to stop explaining to
people that Postgres can't handle this case well.

It seems there are multiple patch variants around and suggestions from
Heikki and Peter G about fundamental interface choices. It would be
nice to have a good summary from someone who is involved about what's
actually left unresolved.


> 1712: Remove self join on a unique column
> =
> This has moved from "don't write bad queries" to "maybe we should do something
> about this".  It seems like there is concensus that it can be worth paying the
> added planner cost for this, especially if gated by a GUC to keep it out of
> installations where it doesn't apply.  The regression on large join queries
> hasn't been benchmarked it seems (or I missed it), but the patch seems to have
> matured and be quite ready.  Any takers on getting it across the finish line?

There hasn't been any review since the v29 patch was posted in July.
That said, to my eye it seemed like pretty basic functionality errors
were being corrected quite late. All the bugs got patch updates
quickly but does this have enough tests to be sure it's working right?

The only real objection was about whether the planning time justified
the gains since the gains are small. But I think they were resolved by
making the optimization optional. Do we have consensus that that
resolved the issue or do we still need benchmarks showing the planning
time hit is reasonable?

> 2161: standby recovery fails when re-replaying due to missing directory which
> was removed in previous replay.
> =
> Tom and Robert seem to be in agreement that parts of this patchset are good,
> and that some parts are not.  The thread has stalled and the patch no longer
> apply, so unless someone feels like picking up the good pieces this seems like
> a contender to close for now.

Tom's feedback seems to have been acted on last November. And the last
update in January was that it was passing CI now. Is this ready to
commit now?


> 2138: Incremental Materialized View Maintenance
> ===
> The size of the
> patchset and the length of the thread make it hard to gauge just far away it
> is, maybe the author or a reviewer can summarize the current state and outline
> what is left for it to be committable.

There is an updated patch set as of February but I have the same
difficulty wrapping my head around the amount of info here.

Is this one really likely to be commitable in 15? If not I think we
should move this to 16 now and concentrate on patches that will be
commitable in this release.


> 2218: Implement INSERT SET syntax
> =
> The author has kept this patch updated, and has seemingly updated according to
> the review comments.  Tom: do you have any opinions on whether the updated
> version addresses your concerns wrt the SELECT rewrite?

I don't see any discussion implying that Tom's concerns were met. I'm
not exactly clear why Tom's concerns are real problems though --
wouldn't it be a *good* thing if we have a more expressive syntax? But
that's definitely what needs to be resolved before it can move ahead.

So unless there's objections I'm going to update this to "waiting on author".

-- 
greg




Commitfest 2022-03 Starts Now

2022-03-01 Thread Greg Stark
The final commitfest of this release begins now.

Whereas most commitfests are about getting feedback to authors so they
can advance the patch -- this one is about actually committing patches
to wrap up the release.

Please when reviewing patches try to push yourself to make the
difficult call about whether a patch will actually be feasible to
commit in this commitfest. If not be up front about it and say so
because we need to focus energy in this commitfest on patches that are
committable in Postgres 15.

There are a lot of great patches in the commitfest that would really
be great to get committed!

--
greg




Re: Commitfest 2022-03 Patch Triage Part 1a.i

2022-03-01 Thread Greg Stark
As Justin suggested I CC the authors from these patches I'm adding
them here. Some of the patches have multiple "Authors" listed in the
commitfest which may just be people who posted updated patches so I
may have added more people than necessary.
[If you received two copies of this do not reply to the first one or
you'll get bounces. Hopefully I've cleaned the list now but we'll
see...]




Re: Commitfest 2022-03 Patch Triage Part 1a.i

2022-03-01 Thread Greg Stark
On Tue, 1 Mar 2022 at 14:59, Greg Stark  wrote:
>
> As Justin suggested I CC the authors from these patches I'm adding
> them here. Some of the patches have multiple "Authors" listed in the
> commitfest which may just be people who posted updated patches so I
> may have added more people than necessary.

> [If you received two copies of this do not reply to the first one or
> you'll get bounces. Hopefully I've cleaned the list now but we'll
> see...]

Nope. One more address to remove from the CC.


-- 
greg




Commitfest 2022-03 Patch Triage Part 1b

2022-03-01 Thread Greg Stark
> 2096: psql - add SHOW_ALL_RESULTS option
> 
> Peter posted an updated version of Fabiens patch about a month ago (which at
> this point no longer applies) fixing a few issues, but also point at old 
> review
> comments still unaddressed.  Since this was pushed, but had to be reverted, I
> assume there is a desire for the feature but it seems to need more work still.

It looks like Peter and Fabien were debating the merits of a libpq
change and probably that won't happen this release cycle. Is there a
kernel of psql functionality that can be extracted from this without
the libpq change in this release cycle or should it wait until we add
the functionality to libpq?

If it's the latter then perhaps we should move this to 16?


> 1651: GROUP BY optimization
> ===
> This is IMO a desired optimization, and after all the heavy lifting by Tomas
> the patch seems to be in pretty good shape.

This is two patches and it sounds like the first one is ready for
committer whereas the second one is less clear. Or is the second one
meant to be an alternative for the first one?

>
> 2377: pg_ls_* functions for showing metadata and recurse (pg_ls_tmpdir to show
> shared filesets)
> ==
> The question of what to do with lstat() on Windows is still left unanswered,
> but the patchset has been split to up to be able to avoid it.  Stephen and 
> Tom,
> having done prior reviews do you have any thoughts on this?

Is this still blocked on lstat for windows? I couldn't tell, is there
consensus on a behaviour for windows even if that just means failing
or returning partial results on windows?

Other than that it seems like there's a lot of this patch that has
positive reviews and is ready for committing.


> 2349: global temporary table
> 
> GTT has been up for discussion numerous times in tbe past, and I can't judge
> whether this proposal has a better chance than previous ones.  I do note the
> patch has a number of crashes reported lately, and no reviews from senior
> contributors in a while, making it seem unlikely to be committed in this CF.
> Since the patch is very big, can it be teased apart and committed separately
> for easier review?

I think Andres's review decisively makes it clear this in an
uncommittable state.

https://www.postgresql.org/message-id/20220225074500.sfizxbmlrj2s6hp5%40alap3.anarazel.de
https://www.postgresql.org/message-id/20220227041304.mnimeqkhwktrjyht%40alap3.anarazel.de

It's definitely not going to make it this release and will probably
need a significant amount of time next release cycle. IMHO dividing it
up into smaller features does seem like it would be more effective at
getting things committed.

Should we mark this returned with feedback or just move it to the next
commitfest as waiting on author?


> 2433: Erase the distinctClause if the result is unique by definition
> 
> (parts of) The approach taken in this patch has been objected against in favor
> of work that Tom has proposed.  Until that work materialize this patch is
> blocked, and thus I think we are better of closing it and re-opening it when 
> it
> gets unstuck.  Unless Tom has plans to hack on this shortly?

Ugh. This is a problematic dynamic. Tom has a different idea of what
direction to take this but hasn't had a chance to work on it. So
what's Andy Fan supposed to do here? He can't read Tom's mind and
nobody else can really help him. Ultimately we all have limited time
so this is a thing that will happen but is there anything we can do to
resolve it in this case?

We definitely shouldn't spend lots of time on this patch unless we're
going to be ok going ahead without Tom's version of it. Is this
something we can do using the Andy's data structure for now and change
in the future?

It looks like the Skip Scan patch was related to this work in some
way? Is it blocked on it?

-- 
greg




Re: Commitfest 2022-03 Patch Triage Part 1a.i

2022-03-02 Thread Greg Stark
On Wed, 2 Mar 2022 at 07:12, Daniel Gustafsson  wrote:

> Thanks for picking it up and continuing with recent developments.  Let me know
> if you want a hand in triaging patchsets.

While I have the time there may be patches I may need help coming to
the right conclusions about what actions to take.

I think the main thing I can do to help is to take patches that have
no chance of making this release and taking them off our collective
plates. -- Hopefully after they've received feedback but as this is
the last commitfest of the release that's a secondary concern.

But I'm unclear exactly what the consequences in the commitfest app
are of specific state changes. As I understand it there are basically
two alternatives:

1) Returned with feedback -- does this make it harder for an author to
resume work release? Can they simply reactivate the CF entry or do
they need to start a new one and then lose history in the app?

2) Moved to next commitfest -- this seems to just drag the pain on.
Then it has to get triaged again next commitfest and if it's actually
stalled (or progressing fine without needing feedback) that's just
extra work for nothing.

Do I have this right? What is the right state to put a patch in that
means "this patch doesn't need to be triaged again unless the author
actually feels progress has been made and needs new feedback or thinks
its committable"?

-- 
greg




Re: Commitfest 2022-03 Patch Triage Part 1b

2022-03-03 Thread Greg Stark
Just FYI. Better to follow up to the thread for the patch that's
already in the CF. Otherwise your patch will missed by someone who
looks at the CF entry to see the latest patch.




Re: [Proposal] Global temporary tables

2022-03-03 Thread Greg Stark
It doesn't look like this is going to get committed this release
cycle. I understand more feedback could be valuable, especially on the
overall design, but as this is the last commitfest of the release we
should focus on other patches for now and spend that time in the next
release cycle.

I'm going to bump this one now as Waiting on Author for the design
documentation Robert asks for and probably a plan for how to separate
that design into multiple separable features as Andres suggested.

I'm still hopeful we get to advance this early in 16 because I think
everyone agrees the feature would be great.




Re: [PATCH] minor reloption regression tests improvement

2022-03-07 Thread Greg Stark
I don't think this is worth spending time adding tests for. I get what
you're saying that this is at least semi-intentional behaviour and
desirable behaviour so it should have tests ensuring that it continues
to work. But it's not documented behaviour and the test is basically
testing that the implementation is this specific implementation.

I don't think the test is really a bad idea but neither is it really
very useful and I think it's not worth having people spend time
reviewing and discussing. I'm leaning to saying this patch is
rejected.




Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining

2022-03-09 Thread Greg Stark
So it's 8 days into the commitfest. So far 3 patches have been committed:

* parse/analyze API refactoring by Peter Eisentraut
* FUNCAPI tuplestore helper function by Melanie Plagemen committed by
Michael Paquier
* Typo in pgbench messages by Kawamoto Masay committed by Tatsuo Ishii

(There was also a flurry of five commits from Tom on Feb 28 including
one that needed to be backpatched subsequently)

In addition 5 patches have been moved moved to a future commitfest,
withdrawn, or rejected.

So that has removed 8 patches from the commitfest. There are now 189
"Needs Review" and 24 "Ready for Committer" patches.

Of those 24 "Ready for Committer" patches only 3 actually have
committers assigned andonly 6 have had any emails since the beginning
of the commitfest.

Is there anything I can do to get committers assigned to these
patches? Should I do a round-robin style assignment for any of these?




Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining

2022-03-09 Thread Greg Stark
On Wed, 9 Mar 2022 at 15:44, David Steele  wrote:
>
> On 3/9/22 13:38, Greg Stark wrote:
> Should I do a round-robin style assignment for any of these?
>
> I don't think this is a good idea. Committers pick the patches they are
> going to commit.
>
> What prefer to do is bump any committers that have been involved in a
> patch thread to see if they are willing to commit it.

Well yes, I suppose that's probably what I had in mind despite calling it that.

But I've been skimming the set of "Ready for Committer" patches and
I'm a bit down on them. Many of them seem to mostly have gotten
feedback from committers already and the type of feedback that leads
me to think it's ready for commit.

I suppose people mark patches "Ready for Committer" when the level of
feedback they require is more in depth or more design feedback that
they think requires a committer even if it's not ready for commit.

So I'm going to go through the patches and ask the committers who have
already commented if they think the patch is on track to be committed
this release or should be pushed to the next commitfest.

-- 
greg




Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining

2022-03-09 Thread Greg Stark
On Wed, 9 Mar 2022 at 16:46, Greg Stark  wrote:
> Many of them seem to mostly have gotten
> feedback from committers already and the type of feedback that leads
> me to think it's ready for commit.

Er. I meant *not* the type of feedback that leads me to think it's
ready for commit. I mostly see patches with early design feedback or
even feedback questioning whether they're good ideas at all.

-- 
greg




Re: Tablesync early exit

2022-03-15 Thread Greg Stark
This patch has been through five CFs without any review. It's a very
short patch and I'm assuming the only issue is that it's not clear
whether it's a good idea or not and there are few developers familiar
with this area of code?

Amit, it looks like you were the one who asked for it to be split off
from the logical decoding of 2PC patch in [1]. Can you summarize what
questions remain here? Should we just commit this or is there any
issue that needs to be debated?

[1] 
https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB+k9_-yT5=9gdew84t...@mail.gmail.com


On Sat, 6 Mar 2021 at 20:56, Peter Smith  wrote:
>
> Hi hackers.
>
> I propose a small optimization can be added to the tablesync replication code.
>
> This proposal (and simple patch) was first discussed here [1].
>
> Basic idea is the tablesync could/should detect if there is anything
> to do *before* it enters the apply main loop. Calling
> process_sync_tables() before the apply main loop offers a quick way
> out, so the message handling will not be unnecessarily between
> workers. This will be a small optimization.
>
> But also, IMO this is a more natural separation of work. E.g tablesync
> worker will finish when the table is synced - not go one extra step...
>
> ~~
>
> This patch was already successfully used for several versions
> (v43-v50) of another 2PC patch [2], but it was eventually removed from
> there because, although it has its own independent value, it was not
> required for that patch series [3].
>
> 
> [1] 
> https://www.postgresql.org/message-id/CAHut%2BPtjk-Qgd3R1a1_tr62CmiswcYphuv0pLmVA-%2B2s8r0Bkw%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/flat/CAHut%2BPsd5nyg-HG6rGO2_5jzXuSA1Eq5%2BB5J2VJo0Q2QWi-1HQ%40mail.gmail.com#1c2683756b32e267d96b7177ba95
> [3] 
> https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB%2Bk9_-yT5%3D9GDEW84TF%2BA%40mail.gmail.com
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia
>
>


--
greg




Re: [PATCH] pgbench: add multiconnect option

2022-03-15 Thread Greg Stark
Hi guys,

It looks like David sent a patch and Fabien sent a followup patch. But
there hasn't been a whole lot of discussion or further patches.

It sounds like there are some basic questions about what the right
interface should be. Are there specific questions that would be
helpful for moving forward?




Re: Window Function "Run Conditions"

2022-03-15 Thread Greg Stark
This looks like an awesome addition.

I have one technical questions...

Is it possible to actually transform the row_number case into a LIMIT
clause or make the planner support for this case equivalent to it (in
which case we can replace the LIMIT clause planning to transform into
a window function)?

The reason I ask is because the Limit plan node is actually quite a
bit more optimized than the general window function plan node. It
calculates cost estimates based on the limit and can support Top-N
sort nodes.

But the bigger question is whether this patch is ready for a committer
to look at? Were you able to resolve Andy Fan's bug report? Did you
resolve the two questions in the original email?




Re: Skip partition tuple routing with constant partition key

2022-03-15 Thread Greg Stark
There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me and a bit of simple
arguments about how expensive an extra check is versus the time saved
in the boolean search should be easy enough to come up with to justify
whatever values make sense.




Commitfest Update

2022-03-17 Thread Greg Stark
So far this commitfest these 10 patches have been marked committed.
That leaves us with 175 "Needs Review" and 28 "Ready for Comitter" so
quite a ways to go ...

* FUNCAPI tuplestore helper function
* parse/analyze API refactoring
* Add wal_compression=zstd
* Add id's to various elements in protocol.sgml
* Fix alter data type of clustered/replica identity columns
* Fix flaky tests when synchronous_commit = off
* Support of time zone patterns - of, tzh and tzm
* Optionally automatically disable subscription on error
* fix race condition between DROP TABLESPACE and checkpointing
* ICU for global collation


In case anyone's looking for inspiration... Here are the list of
patches marked Ready for Committer:

* document the need to analyze partitioned tables
* ExecTypeSetColNames is fundamentally broken
* pg_dump - read data for some options from external file
* Add comment about startup process getting a free procState array slot always
* Doc patch for retryable xacts
* Function to log backtrace of postgres processes
* Allow multiple recursive self-references
* Expose get_query_def()
* Add checkpoint and redo LSN to LogCheckpointEnd log message
* Fast COPY FROM command for the foreign tables
* Consider parallel for LATERAL subqueries having LIMIT/OFFSET
* Faster pglz compression
* Full support for index LP_DEAD hint bits on standby
* KnownAssignedXidsGetAndSetXmin performance
* Parameter for planner estimates of recursive queries
* enhancing plpgsql API for debugging and tracing
* Make message at end-of-recovery less scary
* Identify missing publications from publisher while create/alter subscription.
* Allow providing restore_command as a command line option to pg_rewind
* Avoid erroring out when unable to remove or parse logical rewrite
files to save checkpoint work
* Allow batched insert during cross-partition updates
* Add callback table access method to reset filenode when dropping relation
* use has_privs_for_role for predefined roles
* range_agg with multirange inputs
* Add new reloption to views for enabling row level security
* Fix pg_rewind race condition just after promotion
* Mitigate pg_rewind race condition, if config file is enlarged concurrently.
* remove exclusive backup mode

-- 
greg




Re: Proposal for internal Numeric to Uint64 conversion function.

2022-03-17 Thread Greg Stark
On Fri, 11 Mar 2022 at 15:17, Tom Lane  wrote:
>
> Amul Sul  writes:

> >  Yeah, that's true, I am too not sure if we really need to refactor
> >  all those; If we want, I can give it a try.
>
> The patch as-presented isn't very compelling for
> lack of callers of the new function

Tom, are you saying you think we're not interested in just adding this
function unless it's part of this refactoring?

Amul, do you think if we did numeric_to_int64/numeric_to_uint64 as a
refactored API and a second patch that made numeric_pg_lsn and other
consumers use it it would clean up the code significantly?

-- 
greg




Re: proposal: enhancing plpgsql debug API - returns text value of variable content

2022-03-17 Thread Greg Stark
It looks like this is -- like a lot of plpgsql patches -- having
difficulty catching the attention of reviewers and committers.
Aleksander asked for a test and Pavel put quite a bit of work into
adding a good test case. I actually like that there's a test because
it shows the API can be used effectively.

>From my quick skim of this code it does indeed look like it's ready to
commit. It's mostly pretty mechanical code to expose a couple fields
so that a debugger can see them.

Pavel, are you planning to add a debugger to contrib using this? The
test example code looks like it would already be kind of useful even
in this form.




Re: jsonpath syntax extensions

2022-03-21 Thread Greg Stark
This patch seems to be getting ignored. Like David I'm a bit puzzled
because it doesn't seem like an especially obscure or difficult patch
to review. Yet it's been multiple years without even a superficial
"does it meet the coding requirements" review let alone a design
review.

Can we get a volunteer to at least give it a quick once-over? I don't
think it's ideal to be doing this in the last CF but neither is it
very appetizing to just shift it to the next CF without a review after
two years...

On Thu, 27 Feb 2020 at 10:58, Nikita Glukhov  wrote:
>
> Hi, hackers!
>
> Attached patches implement several useful jsonpath syntax extensions.
> I already published them two years ago in the original SQL/JSON thread,
> but then after creation of separate threads for SQL/JSON functions and
> JSON_TABLE I forgot about them.
>
> A brief description of the patches:
>
> 1. Introduced new jsonpath modifier 'pg' which is used for enabling
> PostgreSQL-specific extensions.  This feature was already proposed in the
> discussion of jsonpath's like_regex implementation.
>
> 2. Added support for raw jbvObject and jbvArray JsonbValues inside jsonpath
> engine.  Now, jsonpath can operate with JSON arrays and objects only in
> jbvBinary form.  But with introduction of array and object constructors in
> patches #4 and #5 raw in-memory jsonb containers can appear in jsonpath 
> engine.
> In some places we can iterate through jbvArrays, in others we need to encode
> jbvArrays and jbvObjects into jbvBinay.
>
> 3. SQL/JSON sequence construction syntax. A simple comma-separated list can be
> used to concatenate single values or sequences into a single resulting 
> sequence.
>
>  SELECT jsonb_path_query('[1, 2, 3]', 'pg $[*], 4, 2 + 3');
>   jsonb_path_query
>  --
>   1
>   2
>   3
>   4
>   5
>
>  SELECT jsonb_path_query('{ "a": [1, 2, 3], "b": [4, 5] }',
> 'pg ($.a[*], $.b[*]) ? (@ % 2 == 1)');
>jsonb_path_query
>  --
>   1
>   3
>   5
>
>
> Patches #4-#6 implement ECMAScript-like syntax constructors and accessors:
>
> 4. Array construction syntax.
> This can also be considered as enclosing a sequence constructor into brackets.
>
>  SELECT jsonb_path_query('[1, 2, 3]', 'pg [$[*], 4, 2 + 3]');
>   jsonb_path_query
>  --
>   [1, 2, 3, 4, 5]
>
> Having this feature, jsonb_path_query_array() becomes somewhat redundant.
>
>
> 5. Object construction syntax.  It is useful for constructing derived objects
> from the interesting parts of the original object.  (But this is not 
> sufficient
> to "project" each object in array, item method like '.map()' is needed here.)
>
>  SELECT jsonb_path_query('{"b": 2}', 'pg { a : 1, b : $.b, "x y" : $.b + 3 
> }');
>  jsonb_path_query
>  ---
>   { "a" : 1, "b": 3, "x y": 5 }
>
> Fields with empty values are simply skipped regardless of lax/strict mode:
>
>  SELECT jsonb_path_query('{"a": 1}', 'pg { b : $.b, a : $.a ? (@ > 1) }');
>   jsonb_path_query
>  --
>   {}
>
>
> 6. Object subscription syntax.  This gives us ability to specify what key to
> extract on runtime.  The syntax is the same as ordinary array subscription
> syntax.
>
>  -- non-existent $.x is simply skipped in lax mode
>  SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$.a, "x", "a"]');
>   jsonb_path_query
>  --
>   "c"
>   "b"
>
>  SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$fld]', '{"fld": 
> "b"}');
>   jsonb_path_query
>  --
>   "c"
>
> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company



-- 
greg




Re: jsonpath syntax extensions

2022-03-21 Thread Greg Stark
Hm. Actually... These changes were split off from the JSON_TABLE
patches? Are they still separate or have they been merged into those
other patches since? I see the JSON_TABLE thread is getting more
comments do those reviews include these patches?

On Mon, 21 Mar 2022 at 16:09, Greg Stark  wrote:
>
> This patch seems to be getting ignored. Like David I'm a bit puzzled
> because it doesn't seem like an especially obscure or difficult patch
> to review. Yet it's been multiple years without even a superficial
> "does it meet the coding requirements" review let alone a design
> review.
>
> Can we get a volunteer to at least give it a quick once-over? I don't
> think it's ideal to be doing this in the last CF but neither is it
> very appetizing to just shift it to the next CF without a review after
> two years...
>
> On Thu, 27 Feb 2020 at 10:58, Nikita Glukhov  wrote:
> >
> > Hi, hackers!
> >
> > Attached patches implement several useful jsonpath syntax extensions.
> > I already published them two years ago in the original SQL/JSON thread,
> > but then after creation of separate threads for SQL/JSON functions and
> > JSON_TABLE I forgot about them.
> >
> > A brief description of the patches:
> >
> > 1. Introduced new jsonpath modifier 'pg' which is used for enabling
> > PostgreSQL-specific extensions.  This feature was already proposed in the
> > discussion of jsonpath's like_regex implementation.
> >
> > 2. Added support for raw jbvObject and jbvArray JsonbValues inside jsonpath
> > engine.  Now, jsonpath can operate with JSON arrays and objects only in
> > jbvBinary form.  But with introduction of array and object constructors in
> > patches #4 and #5 raw in-memory jsonb containers can appear in jsonpath 
> > engine.
> > In some places we can iterate through jbvArrays, in others we need to encode
> > jbvArrays and jbvObjects into jbvBinay.
> >
> > 3. SQL/JSON sequence construction syntax. A simple comma-separated list can 
> > be
> > used to concatenate single values or sequences into a single resulting 
> > sequence.
> >
> >  SELECT jsonb_path_query('[1, 2, 3]', 'pg $[*], 4, 2 + 3');
> >   jsonb_path_query
> >  --
> >   1
> >   2
> >   3
> >   4
> >   5
> >
> >  SELECT jsonb_path_query('{ "a": [1, 2, 3], "b": [4, 5] }',
> > 'pg ($.a[*], $.b[*]) ? (@ % 2 == 1)');
> >jsonb_path_query
> >  --
> >   1
> >   3
> >   5
> >
> >
> > Patches #4-#6 implement ECMAScript-like syntax constructors and accessors:
> >
> > 4. Array construction syntax.
> > This can also be considered as enclosing a sequence constructor into 
> > brackets.
> >
> >  SELECT jsonb_path_query('[1, 2, 3]', 'pg [$[*], 4, 2 + 3]');
> >   jsonb_path_query
> >  --
> >   [1, 2, 3, 4, 5]
> >
> > Having this feature, jsonb_path_query_array() becomes somewhat redundant.
> >
> >
> > 5. Object construction syntax.  It is useful for constructing derived 
> > objects
> > from the interesting parts of the original object.  (But this is not 
> > sufficient
> > to "project" each object in array, item method like '.map()' is needed 
> > here.)
> >
> >  SELECT jsonb_path_query('{"b": 2}', 'pg { a : 1, b : $.b, "x y" : $.b + 3 
> > }');
> >  jsonb_path_query
> >  ---
> >   { "a" : 1, "b": 3, "x y": 5 }
> >
> > Fields with empty values are simply skipped regardless of lax/strict mode:
> >
> >  SELECT jsonb_path_query('{"a": 1}', 'pg { b : $.b, a : $.a ? (@ > 1) }');
> >   jsonb_path_query
> >  --
> >   {}
> >
> >
> > 6. Object subscription syntax.  This gives us ability to specify what key to
> > extract on runtime.  The syntax is the same as ordinary array subscription
> > syntax.
> >
> >  -- non-existent $.x is simply skipped in lax mode
> >  SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$.a, "x", "a"]');
> >   jsonb_path_query
> >  --
> >   "c"
> >   "b"
> >
> >  SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$fld]', '{"fld": 
> > "b"}');
> >   jsonb_path_query
> >  --
> >   "c"
> >
> > --
> > Nikita Glukhov
> > Postgres Professional: http://www.postgrespro.com
> > The Russian Postgres Company
>
>
>
> --
> greg



-- 
greg




Re: Temporary tables versus wraparound... again

2022-03-21 Thread Greg Stark
No problem, I can update the patch and check on the fuzz.

But the actual conflict is just in the test and I'm not sure it's
really worth having a test at all. It's testing a pretty low level
detail. So I'm leaning toward fixing the conflict by just ripping the
test out.

Nathan also pointed out there was a simpler way to get the pid. I
don't think the way I was doing it was wrong but I'll double check
that.




Re: Commitfest manager for 2022-03

2022-03-21 Thread Greg Stark
On Mon, 21 Mar 2022 at 21:48, Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-26 16:12:27 -0500, Greg Stark wrote:
> > I do have the time available. What I don't necessarily have is insight
> > into everything that needs to be done, especially behind the scenes.
>
> One thing that desperately needs doing, particularly during the last
> commitfest, is looking through CF entries and pruning stuff that shouldn't be
> there anymore or that are in the wrong state.

Thanks

> One can't just automatically mark all failing runs as "waiting on author"
> because sometimes there are issues with somebody else posting an incremental
> diff confusing cfbot or spurious test failures...

What I'm seeing is patches that are failing with either the
027_stream_regress.pl failure that I see is being actively worked on
in another thread or simply a timeout which I'm not sure but may be
the same issue?

But I'll do a pass and then do another pass later in the week when
those problems may have been ironed out.


--
greg




Re: Temporary tables versus wraparound... again

2022-03-22 Thread Greg Stark
Here's a rebased patch. I split the test into a separate patch that I
would lean to dropping. But at least it applies now.

I did look into pg_stat_get_backend_pid() and I guess it would work
but going through the stats mechanism does seem like going the long
way around since we're already looking at the backendId info directly
here, we just weren't grabbing the pid.

I did make a small change, I renamed the checkTempNamespaceStatus()
function to checkTempNamespaceStatusAndPid(). It seems unlikely there
are any external consumers of this function (the only internal
consumer is autovacuum.c). But just in case I renamed it to protect
against any external modules failing from the added parameter.
From eb6ec2edfcb10aafc3874262276638932a97add7 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Tue, 22 Mar 2022 15:54:59 -0400
Subject: [PATCH v3 2/2] Add test for truncating temp tables advancing
 relfrozenxid

This test depends on other transactions not running at the same time
so that relfrozenxid can advance so it has to be moved to its own
schedule.
---
 src/test/regress/expected/temp.out | 21 +
 src/test/regress/parallel_schedule | 10 +++---
 src/test/regress/sql/temp.sql  | 19 +++
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index a5b3ed34a3..1fee5521af 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -82,6 +82,27 @@ SELECT * FROM temptest;
 -
 (0 rows)
 
+DROP TABLE temptest;
+-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the
+-- table is truncated. This requires this test not be run in parallel
+-- with other tests as concurrent transactions will hold back the
+-- globalxmin
+CREATE TEMP TABLE temptest(col int) ON COMMIT DELETE ROWS;
+SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_
+BEGIN;
+INSERT INTO temptest (select generate_series(1,1000));
+ANALYZE temptest; -- update relpages, reltuples
+COMMIT;
+SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset new_
+SELECT :old_relpages  = :new_relpages  AS pages_reset,
+   :old_reltuples = :new_reltuples AS tuples_reset,
+   :old_relallvisible = :new_relallvisible AS allvisible_reset,
+   :old_relfrozenxid <> :new_relfrozenxid  AS frozenxid_advanced;
+ pages_reset | tuples_reset | allvisible_reset | frozenxid_advanced 
+-+--+--+
+ t   | t| t| t
+(1 row)
+
 DROP TABLE temptest;
 -- Test ON COMMIT DROP
 BEGIN;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6d8f524ae9..f919c2f978 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -116,10 +116,14 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath
 # --
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
-# so keep this parallel group to at most 19 tests
 # --
-test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
+test: plancache limit plpgsql copy2 domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
+
+# --
+# Run this alone because it transiently uses 2 connections and also
+# tests relfrozenxid advances when truncating temp tables
+# --
+test: temp
 
 # --
 # Another group of parallel tests
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 424d12b283..5f0c39b5e7 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -79,6 +79,25 @@ SELECT * FROM temptest;
 
 DROP TABLE temptest;
 
+-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the
+-- table is truncated. This requires this test not be run in parallel
+-- with other tests as concurrent transactions will hold back the
+-- globalxmin
+CREATE TEMP TABLE temptest(col int) ON COMMIT DELETE ROWS;
+
+SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_
+BEGIN;
+INSERT INTO temptest (select generate_series(1,1000));
+ANALYZE temptest; -- update relpages, reltuples
+COMMIT;
+SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset new_
+SELECT :old_relpages  = :new_relpages  AS pages_reset,
+   :old_reltuples = :new_reltuples AS tuples_reset,
+   :old_relallvisible = :new_relallvisible AS allvisible_reset,
+   :old_relfrozenxid <> :new_relfrozenxid  

Re: pg13.2: invalid memory alloc request size NNNN

2021-02-13 Thread Greg Stark
I think to get a size of -4 you would be trying to read a varlena
pointer pointing to four nul bytes. I bet if you run dd on the
corresponding block you'll find a chunk of nuls in the page. That
perhaps makes sense with ZFS where if a new page was linked to the
tree but never written it would be an uninitialized page rather than
the old data.

I'm becoming increasingly convinced that there are a lot of storage
systems out there that just lose data whenever they crash or lose
power. Systems that are supposed to be better than that.




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-23 Thread Greg Stark
On Tue, 23 Feb 2021 at 05:04, Andres Freund  wrote:
>
> ## Callbacks
>
> In the core AIO pieces there are two different types of callbacks at the
> moment:
>
> Shared callbacks, which can be invoked by any backend (normally the issuing
> backend / the AIO workers, but can be other backends if they are waiting for
> the IO to complete). For operations on shared resources (e.g. shared buffer
> reads/writes, or WAL writes) these shared callback needs to transition the
> state of the object the IO is being done for to completion. E.g. for a shared
> buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.
>
> The main reason these callbacks exist is that they make it safe for a backend
> to issue non-blocking IO on buffers (see the deadlock section above). As any
> blocked backend can cause the IO to complete, the deadlock danger is gone.

So firstly this is all just awesome work and I have questions but I
don't want them to come across in any way as criticism or as a demand
for more work. This is really great stuff, thank you so much!

The callbacks make me curious about two questions:

1) Is there a chance that a backend issues i/o, the i/o completes in
some other backend and by the time this backend gets around to looking
at the buffer it's already been overwritten again? Do we have to
initiate I/O again or have you found a way to arrange that this
backend has the buffer pinned from the time the i/o starts even though
it doesn't handle the comletion?

2) Have you made (or considered making) things like sequential scans
(or more likely bitmap index scans) asynchronous at a higher level.
That is, issue a bunch of asynchronous i/o and then handle the pages
and return the tuples as the pages arrive. Since sequential scans and
bitmap scans don't guarantee to read the pages in order they're
generally free to return tuples from any page in any order. I'm not
sure how much of a win that would actually be since all the same i/o
would be getting executed and the savings in shared buffers would be
small but if there are mostly hot pages you could imagine interleaving
a lot of in-memory pages with the few i/os instead of sitting idle
waiting for the async i/o to return.



> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

This is impressive. How easy is it to correlate with system aio stats?


-- 
greg




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Greg Stark
I guess what I would be looking for in stats would be a way to tell
what the bandwidth, latency, and queue depth is. Ideally one day
broken down by relation/index and pg_stat_statement record.

I think seeing the actual in flight async requests in a connection is
probably not going to be very useful in production. It's very low
level and in production the user is just going to find that
overwhelming detail. It is kind of cool to see the progress in
sequential operations but I think that should be solved in a higher
level way than this anyways.

What we need to calculate these values would be the kinds of per-op
stats nfsiostat uses from /proc/self/mountstats:
https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps

So number of async reads we've initiated, how many callbacks have been
called, total cumulative elapsed time between i/o issued and i/o
completed, total bytes of i/o initiated, total bytes of i/o completed.
As well a counter of requests which returned errors (eof? i/o error?)
If there are other locks or queues internally to postgres total time
spent in those states.

I have some vague idea that we should have a generic infrastructure
for stats that automatically counts things associated with plan nodes
and automatically bubbles that data up to the per-transaction,
per-backend, per-relation, and pg_stat_statements stats. But that's a
whole other ball of wax :)




Re: SSL SNI

2021-02-25 Thread Greg Stark
Hate to be that guy but

This still doesn't seem like it is IPv6-ready. Is there any harm in
having SNI with an IPv6 address there if it gets through?




Re: SSL SNI

2021-02-26 Thread Greg Stark
> Do you mean the IPv6 detection code is not correct?  What is the problem?

This bit, will recognize ipv4 addresses but not ipv6 addresses:

+ /*
+ * Set Server Name Indication (SNI), but not if it's a literal IP address.
+ * (RFC 6066)
+ */
+ if (!(strspn(conn->pghost, "0123456789.") == strlen(conn->pghost) ||
+   strchr(conn->pghost, ':')))
+ {




Temporary tables versus wraparound... again

2020-11-08 Thread Greg Stark
We had an outage caused by transaction wraparound. And yes, one of the
first things I did on this site was check that we didn't have any
databases that were in danger of wraparound.

However since then we added a monitoring job that used a temporary
table with ON COMMIT DELETE ROWS. Since it was a simple monitoring job
it stayed connected to the database and used this small temporary
table for a very long period of time.

The temporary table never got vacuumed by autovacuum and never by the
monitoring job (since it was being truncated on every transaction why
would it need to be vacuumed...).

We've been around this bush before. Tom added orphaned table
protection to autovacuum precisely because temporary tables can cause
the datfrozenxid to get held back indefinitely. Then Michael Paquier
and Tsunakawa Takayuki both found it worth making this more
aggressive.

But none of that helped as the temporary schema was still in use so
they were not considered "orphaned" temp tables at all.

I think we need to add some warnings to autovacuum when it detects
*non* orphaned temporary tables that are older than the freeze
threshold.

However in the case of ON COMMIT DELETE ROWS we can do better. Why not
just reset the relfrozenxid and other stats as if the table was
freshly created when it's truncated?

I put together this quick patch to check the idea and it seems to
integrate fine in the code. I'm not sure about a few points but I
don't think they're showstoppers.

1) Should we update relpages and reltuples. I think so but an argument
could be made that people might be depending on running analyze once
when the data is loaded and then not having to run analyze on every
data load.

2) adding the dependency on heapam.h to heap.c makes sense because of
heap_inplace_update bt it may be a bit annoying because I suspect
that's a useful sanity check that the tableam stuff hasn't been
bypassed

3) I added a check to the regression tests but I'm not sure it's a
good idea to actually commit this. It could fail if there's a parallel
transaction going on and even moving the test to the serial schedule
might not guarantee that never happens due to autovacuum running
analyze?

I didn't actually add the warning to autovacuum yet.

-- 
greg
From 76eb00c43fba2a293dc4a079307e675e0eeaff06 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Sun, 8 Nov 2020 11:54:50 -0500
Subject: [PATCH] update relfrozenxmin when truncating temp tables

---
 src/backend/catalog/heap.c | 45 ++
 src/test/regress/expected/temp.out | 21 ++
 src/test/regress/parallel_schedule |  9 +---
 src/test/regress/sql/temp.sql  | 19 
 4 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4cd7d76..ffe36bb 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -30,6 +30,7 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/relation.h"
@@ -3277,6 +3278,48 @@ RelationTruncateIndexes(Relation heapRelation)
 }
 
 /*
+ * Reset the relfrozenxid and other stats to the same values used when
+ * creating tables. This is used after non-transactional truncation.
+ *
+ * This reduces the need for long-running programs to vacuum their own
+ * temporary tables (since they're not covered by autovacuum) at least in the
+ * case where they're ON COMMIT DELETE ROWS.
+ *
+ * see also src/backend/commands/vacuum.c vac_update_relstats()
+ * also see AddNewRelationTuple() above
+ */
+
+static void
+ResetVacStats(Relation rel)
+{
+	HeapTuple	ctup;
+	Form_pg_class pgcform;
+	Relation classRel;
+
+	/* Fetch a copy of the tuple to scribble on */
+	classRel = table_open(RelationRelationId, RowExclusiveLock);
+	ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(RelationGetRelid(rel)));
+	if (!HeapTupleIsValid(ctup))
+		elog(ERROR, "pg_class entry for relid %u vanished during truncation",
+			 RelationGetRelid(rel));
+	pgcform = (Form_pg_class) GETSTRUCT(ctup);
+
+	/*
+	 * Update relfrozenxid
+	 */
+
+	pgcform->relpages = 0;
+	pgcform->reltuples = 0;
+	pgcform->relallvisible = 0;
+	pgcform->relfrozenxid = RecentXmin;
+	pgcform->relminmxid = GetOldestMultiXactId();
+
+	heap_inplace_update(classRel, ctup);
+
+	table_close(classRel, RowExclusiveLock);
+}
+
+/*
  *	 heap_truncate
  *
  *	 This routine deletes all data within all the specified relations.
@@ -3340,6 +3383,7 @@ heap_truncate_one_rel(Relation rel)
 
 	/* Truncate the underlying relation */
 	table_relation_nontransactional_truncate(rel);
+	ResetVacStats(rel);
 
 	/* If the relation has indexes, truncate the indexes too */
 	RelationTruncateIndexes(rel);
@@ -3351,6 +3395,7 @@ heap_truncate_on

Re: Temporary tables versus wraparound... again

2020-11-08 Thread Greg Stark
On Mon, 9 Nov 2020 at 00:17, Noah Misch  wrote:
>
> > 2) adding the dependency on heapam.h to heap.c makes sense because of
> > heap_inplace_update bt it may be a bit annoying because I suspect
> > that's a useful sanity check that the tableam stuff hasn't been
> > bypassed
>
> That is not terrible.  How plausible would it be to call vac_update_relstats()
> for this, instead of reimplementing part of it?

It didn't seem worth it to change its API to add boolean flags to skip
setting some of the variables (I was originally only doing
relfrozenxid and minmmxid). Now that I'm doing most of the variables
maybe it makes a bit more sense.

> > @@ -3340,6 +3383,7 @@ heap_truncate_one_rel(Relation rel)
> >
> >   /* Truncate the underlying relation */
> >   table_relation_nontransactional_truncate(rel);
> > + ResetVacStats(rel);
>
> I didn't test, but I expect this will cause a stats reset for the second
> TRUNCATE here:
>
> CREATE TABLE t ();
> ...
> BEGIN;
> TRUNCATE t;
> TRUNCATE t;  -- inplace relfrozenxid reset
> ROLLBACK;  -- inplace reset survives
>
> Does that indeed happen?

Apparently no, see below.  I have to say I was pretty puzzled by the
actual behaviour which is that the rollback actually does roll back
the inplace update. But I *think* what is happening is that the first
truncate does an MVCC update so the inplace update happens only to the
newly created tuple which is never commited.

Thinking about things a bit this does worry me a bit. I wonder if
inplace update is really safe outside of vacuum where we know we're
not in a transaction that can be rolled back. But IIRC doing a
non-inplace update on pg_class for these columns breaks other things.
I don't know if that's still true.

Also, in checking this question I realized I had missed 3d351d91. I
should be initializing reltuples to -1 not 0.


postgres=# vacuum t;
VACUUM
postgres=# select
relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class
where oid='t'::regclass;
 relname | relpages | reltuples | relallvisible | relfrozenxid
-+--+---+---+--
 t   |9 |  2000 | 9 |15557
(1 row)

postgres=# begin;
BEGIN
postgres=*# truncate t;
TRUNCATE TABLE
postgres=*# truncate t;
TRUNCATE TABLE
postgres=*# select
relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class
where oid='t'::regclass;
 relname | relpages | reltuples | relallvisible | relfrozenxid
-+--+---+---+--
 t   |0 | 0 | 0 |15562
(1 row)

postgres=*# abort;
ROLLBACK
postgres=# select
relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class
where oid='t'::regclass;
 relname | relpages | reltuples | relallvisible | relfrozenxid
-+--+---+---+--
 t   |9 |  2000 | 9 |15557
(1 row)




-- 
greg




Re: don't allocate HashAgg hash tables when running explain only

2020-11-18 Thread Greg Stark
On Wed, 18 Nov 2020 at 05:40, Heikki Linnakangas  wrote:
>
> On 13/11/2020 18:10, Alexey Bashtanov wrote:
>> > I would appreciate if someone could have a look at the patch attached,
> > which makes executor skip initializing hash tables when doing explain only.
>
> Makes sense. Committed, thanks for the patch!


Egads. That seems like a backpatchable bug fix to me. Have we been
doing this all along?!

-- 
greg




Re: Is postgres ready for 2038?

2020-11-18 Thread Greg Stark
On Wed, 18 Nov 2020 at 12:22, Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > On 11/18/20 9:44 AM, Tom Lane wrote:
> >> Hmm.  Digging around, I see that Mkvcbuild.pm intentionally absorbs
> >> _USE_32BIT_TIME_T when building with a Perl that defines that.
> >> I don't know what the state of play is in terms of Windows Perl
> >> distributions getting off of that, but maybe we should press people
> >> to not be using such Perl builds.
>
> > I think there's a good argument to ban it if we're doing a 64 bit build
> > (and why would we do anything else?)
>
> I'm not really eager to ban it.  If somebody is building with an old
> Perl distribution, and doesn't particularly care that the installation
> will break in 2038, why should we force them to care?

Wait, is configuring with a Perl that has 32-bit time_t driving the
rest of Postgres to use 32-bit timestamps? That seems like the tail
wagging the dog.

It seems like a sensible compromise would be to have Postgres's
configure default to 64-bit time_t and have a flag to choose 32-bit
time_t and then have a configure check that errors out if the time_t
in Perl doesn't match with a hint to either find a newer Perl
distribution or configure with the flag to choose 32-bit. Ie, don't
silently assume users want 32-bit time_t but leave them the option to
choose it explicitly.


-- 
greg




Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Greg Stark
On Fri, 29 Jan 2021 at 18:41, Tom Lane  wrote:
>
> Ah.  So basically, this comes into play when you consider that some
> outside-the-database entity is your "real" authenticated identity.
> That seems reasonable when using Kerberos or the like, though it's
> not real meaningful for traditional password-type authentication.
> I'd misunderstood your point before.

I wonder if there isn't room to handle this the other way around. To
configure Postgres to not need a CREATE ROLE for every role but
delegate the user management to the external authentication service.

So Postgres would consider the actual role to be the one kerberos said
it was even if that role didn't exist in pg_role. Presumably you would
want to delegate to a corresponding authorization system as well so if
the role was absent from pg_role (or more likely fit some pattern)
Postgres would ignore pg_role and consult the authorization system
configured like AD or whatever people use with Kerberos these days.


-- 
greg




Re: [PATCH] Document heuristics for parameterized paths

2021-12-15 Thread Greg Stark
On Mon, 6 Dec 2021 at 13:01, Steinar H. Gunderson
 wrote:
>
> +one that must cannot be delayed right away (because of outer join

must cannot?

-- 
greg




Re: [PATCH] Document heuristics for parameterized paths

2021-12-15 Thread Greg Stark
On Wed, 15 Dec 2021 at 23:22, Greg Stark  wrote:
>
> On Mon, 6 Dec 2021 at 13:01, Steinar H. Gunderson
>  wrote:
> >
> > +one that must cannot be delayed right away (because of outer join
>
> must cannot?

Actually on further reading... "delayed right away"?

-- 
greg




Re: Transparent column encryption

2021-12-15 Thread Greg Stark
> In the server, the encrypted datums are stored in types called
> encryptedr and encryptedd (for randomized and deterministic
> encryption).  These are essentially cousins of bytea.

Does that mean someone could go in with psql and select out the data
without any keys and just get a raw bytea-like representation? That
seems like a natural and useful thing to be able to do. For example to
allow dumping a table and loading it elsewhere and transferring keys
through some other channel (perhaps only as needed).




Re: Documenting when to retry on serialization failure

2021-12-15 Thread Greg Stark
Fwiw I think the real problem with automatic retries is that the SQL
interface doesn't lend itself to it because the server never really
knows if the command is going to be followed by a commit or more
commands.

I actually think if that problem were tackled it would very likely be
a highly appreciated option. Because I think there's a big overlap
between the set of users interested in higher isolation levels and the
set of users writing stored procedures defining their business logic.
They're both kind of "traditional" SQL engine approaches and both lend
themselves to the environment where you have a lot of programmers
working on a system and you're not able to do things like define
strict locking and update orderings.

So a lot of users are probably looking at something like "BEGIN;
SELECT create_customer_order(); COMMIT" and wondering why the
server can't handle automatically retrying the query if they get an
isolation failure.

There are actually other reasons why providing the whole logic for the
transaction up front with a promise that it'll be the whole
transaction is attractive. E.g. vacuum could ignore a transaction if
it knows the transaction will never look at the table it's
processing... Or automatic deadlock testing tools could extract the
list of tables being accessed and suggest "lock table" commands to put
at the head of the transaction sorted in a canonical order.

These things may not be easy but they're currently impossible for the
same reasons automatically retrying is. The executor doesn't know what
subsequent commands will be coming after the current one and doesn't
know whether it has the whole transaction.




Re: Add index scan progress to pg_stat_progress_vacuum

2021-12-16 Thread Greg Stark
I had a similar question. And I'm still not clear from the response
what exactly index_blks_total is and whether it addresses it.

I think I agree that a user is likely to want to see the progress in a
way they can understand which means for a single index at a time.

I think what you're describing is that index_blks_total and
index_blks_scanned are the totals across all the indexes? That isn't
clear from the definitions but if that's what you intend then I think
that would work.

(For what it's worth what I was imagining was having a pair of
counters for blocks scanned and max blocks in this index and a second
counter for number of indexes processed and max number of indexes. But
I don't think that's necessarily any better than what you have)




Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark
On Fri, 26 Nov 2021 at 21:47, Tom Lane  wrote:
>
> Yeah ... on the one hand, that machine has shown signs of
> hard-to-reproduce flakiness, so it's easy to write off the failures
> I saw as hardware issues.  On the other hand, the flakiness I've
> seen has otherwise manifested as kernel crashes, which is nothing
> like the consistent test failures I was seeing with the patch.

Hm. I asked around and found a machine I can use that can run PPC
binaries, but it's actually, well, confusing. I think this is an x86
machine running Leopard which uses JIT to transparently run PPC
binaries. I'm not sure this is really a good test.

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

IBUILD:~ gsstark$ uname -a
Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15
16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

IBUILD:~ gsstark$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.5.8
BuildVersion: 9L31a




Re: WIP: WAL prefetch (another approach)

2021-12-16 Thread Greg Stark
The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not
really clear how the emulation is done and whether it makes a
reasonable test environment or not.

Hardware Overview:

  Model Name: Mac mini
  Model Identifier: Macmini2,1
  Processor Name: Intel Core 2 Duo
  Processor Speed: 2 GHz
  Number Of Processors: 1
  Total Number Of Cores: 2
  L2 Cache: 4 MB
  Memory: 2 GB
  Bus Speed: 667 MHz
  Boot ROM Version: MM21.009A.B00




Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET

2021-12-16 Thread Greg Stark
On Thu, 16 Dec 2021 at 22:18, Tom Lane  wrote:
>
> * If the sort order is underspecified, or you omit ORDER BY
> entirely, then it's not clear which rows will be operated on.
> The LIMIT might stop after just some of the rows in a peer
> group, and you can't predict which ones.

Meh, that never seemed very compelling to me. I think that's on the
user and there are *lots* of cases where the user can easily know
enough extra context to know that's what she wants. In particular I've
often wanted to delete one of two identical records and it would have
been really easy to just do a DELETE LIMIT 1. I know there are ways to
do it but it's always seemed like unnecessary hassle when there was a
natural way to express it.

> * UPDATE/DELETE necessarily involve the equivalent of SELECT
> FOR UPDATE, which may cause the rows to be ordered more
> surprisingly than you expected, ie the sort happens *before*
> rows are replaced by their latest versions, which might have
> different sort keys.

This... is a real issue. Or is it? Just to be clear I think what
you're referring to is a case like:

INSERT INTO t values (1),(2)

In another session: BEGIN; UPDATE t set c=0 where c=2

DELETE FROM t ORDER BY c ASC LIMIT 1


In other session: COMMIT

Which row was deleted? In this case it was the row where c=1. Even
though the UPDATE reported success (i.e. 1 row updated) so presumably
it happened "before" the delete. The delete saw the ordering from
before it was blocked and saw the ordering with c=1, c=2 so apparently
it happened "before" the update.

There are plenty of other ways to see the same surprising
serialization failure. If instead of a DELETE you did an UPDATE there
are any number of functions or other features that have been added
which can expose the order in which the updates happened.

The way to solve those cases would be to use serializable isolation
(or even just repeatable read in this case). Wouldn't that also solve
the DELETE serialization failure too?

-- 
greg




Re: pg_dump versus ancient server versions

2021-12-17 Thread Greg Stark
On Fri, 22 Oct 2021 at 19:27, Robert Haas  wrote:
>
> Another thing to think about in that regard: how likely is that
> PostgreSQL 7.4 and PostgreSQL 15 both compile and run on the same
> operating system? I suspect the answer is "not very." I seem to recall
> Greg Stark trying to compile really old versions of PostgreSQL for a
> conference talk some years ago, and he got back to a point where it
> just became impossible to make work on modern toolchains even with a
> decent amount of hackery.

That was when I compared sorting performance over time. I was able to
get Postgres to build back to the point where 64-bit architecture
support was added. From Andrew Dunstans comment later in this thread
I'm guessing that was 7.2

That was basically at the point where 64-bit architecture support was
added. It looks like the earliest date on the graphs in the talk are
2002-11-27 which matches the 7.3 release date. I think building
earlier versions would have been doable if I had built them in 32-bit
mode.

-- 
greg




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
What tools and tool versions are you using to build? Is it just GCC for PPC?

There aren't any special build processes to make a fat binary involved?

On Thu, 16 Dec 2021 at 23:11, Tom Lane  wrote:
>
> Greg Stark  writes:
> > But if you're interested and can explain the tests to run I can try to
> > get the tests running on this machine:
>
> I'm not sure that machine is close enough to prove much, but by all
> means give it a go if you wish.  My test setup was explained in [1]:
>
> >> To recap, the test lashup is:
> >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
> >> * Standard debug build (--enable-debug --enable-cassert)
> >> * Out-of-the-box configuration, except add wal_consistency_checking = all
> >> and configure a wal-streaming standby on the same machine
> >> * Repeatedly run "make installcheck-parallel", but skip the tablespace
> >> test to avoid issues with the standby trying to use the same directory
> >> * Delay long enough after each installcheck-parallel to let the
> >> standby catch up (the run proper is ~24 min, plus 2 min for catchup)
>
> Remember also that the code in question is not in HEAD; you'd
> need to apply Munro's patches, or check out some commit from
> around 2021-04-22.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us



-- 
greg




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
I have

IBUILD:postgresql gsstark$ ls /usr/bin/*gcc*
/usr/bin/gcc
/usr/bin/gcc-4.0
/usr/bin/gcc-4.2
/usr/bin/i686-apple-darwin9-gcc-4.0.1
/usr/bin/i686-apple-darwin9-gcc-4.2.1
/usr/bin/powerpc-apple-darwin9-gcc-4.0.1
/usr/bin/powerpc-apple-darwin9-gcc-4.2.1

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.


2021-12-17 17:51:51.688 EST [50955] LOG:  background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL:  Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1
  UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG:  terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL:  the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG:  all server processes
terminated; reinitializing




Re: WIP: WAL prefetch (another approach)

2021-12-17 Thread Greg Stark
On Fri, 17 Dec 2021 at 18:40, Tom Lane  wrote:
>
> Greg Stark  writes:
> > Hm. I seem to have picked a bad checkout. I took the last one before
> > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).
>
> FWIW, I think that's the first one *after* the revert.

Doh

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

-- 
greg




Re: Getting rid of regression test input/ and output/ files

2021-12-20 Thread Greg Stark
On Sun, 19 Dec 2021 at 18:41, Corey Huinker  wrote:
>
> Which brings up a tangential question, is there value in having something 
> that brings in one or more env vars as psql vars directly. I'm thinking 
> something like:
>
> \importenv pattern [prefix]

Oof. That gives me the security heebie jeebies. Off the top of my head
PHP, CGI, SSH have all dealt with vulnerabilities caused by
accidentally importing variables they didn't intend to.

-- 
greg




Re: Unifying VACUUM VERBOSE and log_autovacuum_min_duration output

2021-12-21 Thread Greg Stark
I haven't read the patch yet. But some thoughts based on the posted output:

1) At first I was quite skeptical about losing the progress reporting.
I've often found it quite useful. But looking at the examples I'm
convinced.

Or rather I think a better way to look at it is that the progress
output for the operator should be separated from the metrics logged.
As an operator what I want to see is some progress indicator
""starting table scan", "overflow at x% of table scanned, starting
index scan", "processing index 1" "index 2"... so I can have some idea
of how much longer the vacuum will take and see whether I need to
raise maintenance_work_mem and by how much. I don't need to see all
the metrics while it's running.

2) I don't much like the format. I want to be able to parse the output
with awk or mtail or even just grep for relevant lines. Things like
"index scan not needed" make it hard to parse since you don't know
what it will look like if they are needed. I would have expected
something like "index scans: 0" which is actually already there up
above. I'm not clear how this line is meant to be read. Is it just
explaining *why* the index scan was skipped? It would just be missing
entirely if it wasn't skipped?

Fwiw, having it be parsable is why I wouldn't want it to be multiple
ereports. That would mean it could get interleaved with other errors
from other backends. That would be a disaster.




Re: new release pspg

2021-03-20 Thread Greg Stark
This is really cool.  Now I just need to figure out how to
integrate it with using Emacs for my terminal. I still want to use
emacs enter and edit my queries but it would be cool to be able to hit
a key and launch an xterm and send the query output to pspg




Re: fdatasync performance problem with large number of DB files

2021-03-21 Thread Greg Stark
On Wed, 10 Mar 2021 at 20:25, Tom Lane  wrote:
>
> So this means that in less-than-bleeding-edge kernels, syncfs can
> only be regarded as a dangerous toy.  If we expose an option to use
> it, there had better be large blinking warnings in the docs.

Isn't that true for fsync and everything else related on
less-than-bleeding-edge kernels anyways?

-- 
greg




Re: New IndexAM API controlling index vacuum strategies

2021-03-21 Thread Greg Stark
On Thu, 18 Mar 2021 at 14:37, Peter Geoghegan  wrote:

> They usually involve some *combination* of Postgres problems,
> application code problems, and DBA error. Not any one thing. I've seen
> problems with application code that runs DDL at scheduled intervals,
> which interacts badly with vacuum -- but only really on the rare
> occasions when freezing is required!

What I've seen is an application that regularly ran ANALYZE on a
table. This worked fine as long as vacuums took less than the interval
between analyzes (in this case 1h) but once vacuum started taking
longer than that interval autovacuum would cancel it every time due to
the conflicting lock.

That would have just continued until the wraparound vacuum which
wouldn't self-cancel except that there was also a demon running which
would look for sessions stuck on a lock and kill the blocker -- which
included killing the wraparound vacuum.

And yes, this demon is obviously a terrible idea but of course it was
meant for killing buggy user queries. It wasn't expecting to find
autovacuum jobs blocking things.  The real surprise for that user was
that VACUUM could be blocked by things that someone would reasonably
want to run regularly like ANALYZE.




-- 
greg




Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-22 Thread Greg Stark
On Sun, 21 Mar 2021 at 18:16, Stephen Frost  wrote:
>
> Greetings,
>
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > I also believe that the snapshotting behavior has advantages in terms
> > of being able to perform multiple successive queries and get consistent
> > results from them.  Only the most trivial sorts of analysis don't need
> > that.
> >
> > In short, what you are proposing sounds absolutely disastrous for
> > usability of the stats views, and I for one will not sign off on it
> > being acceptable.
> >
> > I do think we could relax the consistency guarantees a little bit,
> > perhaps along the lines of only caching view rows that have already
> > been read, rather than grabbing everything up front.  But we can't
> > just toss the snapshot concept out the window.  It'd be like deciding
> > that nobody needs MVCC, or even any sort of repeatable read.
>
> This isn't the same use-case as traditional tables or relational
> concepts in general- there aren't any foreign keys for the fields that
> would actually be changing across these accesses to the shared memory
> stats- we're talking about gross stats numbers like the number of
> inserts into a table, not an employee_id column.  In short, I don't
> agree that this is a fair comparison.

I use these stats quite a bit and do lots of slicing and dicing with
them. I don't think it's as bad as Tom says but I also don't think we
can be quite as loosy-goosy as I think Andres or Stephen might be
proposing either (though I note that haven't said they don't want any
consistency at all).

The cases where the consistency really matter for me is when I'm doing
math involving more than one statistic.

Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide
seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look
at the ratio between n_tup_upd and n_tup_hot_upd.

And no, it doesn't help that these are often large numbers after a
long time because I'm actually working with the first derivative of
these numbers using snapshots or a time series database. So if you
have the seq_tup_read incremented but not seq_scan incremented you
could get a wildly incorrect calculation of "tup read per seq scan"
which actually matters.

I don't think I've ever done math across stats for different objects.
I mean, I've plotted them together and looked at which was higher but
I don't think that's affected by some plots having peaks slightly out
of sync with the other. I suppose you could look at the ratio of
access patterns between two tables and know that they're only ever
accessed by a single code path at the same time and therefore the
ratios would be meaningful. But I don't think users would be surprised
to find they're not consistent that way either.

So I think we need to ensure that at least all the values for a single
row representing a single object are consistent. Or at least that
there's *some* correct way to retrieve a consistent row and that the
standard views use that. I don't think we need to guarantee that every
possible plan will always be consistent even if you access the row
multiple times in a self-join or use the lookup function on individual
columns separately.

-- 
greg




Re: Temporary tables versus wraparound... again

2021-10-12 Thread Greg Stark
Here's an updated patch. I added some warning messages to autovacuum.

One thing I learned trying to debug this situation in production is
that it's nigh impossible to find the pid of the session using a
temporary schema. The number in the schema refers to the backendId in
the sinval stuff for which there's no external way to look up the
corresponding pid. It would have been very helpful if autovacuum had
just told me which backend pid to kill.

I still have the regression test in the patch and as before I think
it's probably not worth committing. I'm leaning to reverting that
section of the patch before comitting.

Incidentally there's still a hole here where a new session can attach
to an existing temporary schema where a table was never truncated due
to a session dieing abnormally. That new session could be a long-lived
session but never use the temporary schema causing the old table to
just sit there. Autovacuum has no way to tell it's not actually in
use. I tend to think the optimization to defer cleaning the temporary
schema until it's used might not really be an optimization. It still
needs to be cleaned someday so it's just moving the work around. Just
removing that optimization might be the easiest way to close this
hole. The only alternative I see is adding a flag to PROC or somewhere
where autovacuum can see if the backend has actually initialized the
temporary schema yet or not.
From 1315daf80e668b03cf2aab04106fe53ad606c9d0 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Tue, 12 Oct 2021 17:17:51 -0400
Subject: [PATCH v2] Update relfrozenxmin when truncating temp tables

Make ON COMMIT DELETE ROWS reset relfrozenxmin and other table stats
like normal truncate. Otherwise even typical short-lived transactions
using temporary tables can easily cause them to reach relfrozenxid.

Also add warnings when old temporary tables are found to still be in
use during autovacuum. Long lived sessions using temporary tables are
required to vacuum them themselves.

For the warning to be useful modify checkTempNamespaceStatus to return
the backend pid using it so that we can inform super-user which pid to
terminate. Otherwise it's quite tricky to determine as a user.
---
 src/backend/access/transam/varsup.c | 12 ++---
 src/backend/catalog/heap.c  | 53 +
 src/backend/catalog/namespace.c |  9 +--
 src/backend/postmaster/autovacuum.c | 48 -
 src/include/catalog/namespace.h |  2 +-
 src/test/regress/expected/temp.out  | 21 +++
 src/test/regress/parallel_schedule  |  9 ---
 src/test/regress/sql/temp.sql   | 19 +
 8 files changed, 151 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index a6e98e7..84ccd2f 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -129,14 +129,16 @@ GetNewTransactionId(bool isSubXact)
 		 errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
 oldest_datname),
 		 errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 			else
 ereport(ERROR,
 		(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 		 errmsg("database is not accepting commands to avoid wraparound data loss in database with OID %u",
 oldest_datoid),
 		 errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 		}
 		else if (TransactionIdFollowsOrEquals(xid, xidWarnLimit))
 		{
@@ -149,14 +151,16 @@ GetNewTransactionId(bool isSubXact)
 oldest_datname,
 xidWrapLimit - xid),
 		 errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 			else
 ereport(WARNING,
 		(errmsg("database with OID %u must be vacuumed within %u transactions",
 oldest_datoid,
 xidWrapLimit - xid),
 		 errhint("

Re: Extending amcheck to check toast size and compression

2021-10-19 Thread Greg Stark
Right so here's a review.

I think the patch is committable as is. It's an improvement and it
does the job as promised. I do have some comments but I don't think
they're serious issues and would actually be pretty happy committing
it as is. Fwiw I didn't realize how short the patch was at first and
it probably doesn't need yet another review.


/* Toasted attributes too large to be untoasted should never be stored */
if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT)

1) I know this used to say MaxAlloc -- personally I would probably go
with that but either is fine. But if you're going to have a separate
constant there could be more of a comment explaining why that's the
maximum -- probably with a pointer to MaxAlloc and postgres.h's
VARSIZE macros.

The switch statement at line 1443 seems a bit ... baroque. Is it
clearer than a simple "if cmid != TOAST_PGLZ_COMPRESSION_ID && cmid !=
TOAST_LZ4_COMPRESSION_ID)" ? I mean, I see this is easier to add more
cases to but I found dealing with a case that falls through and no
default is a lot of cognitive overhead to understand what is in the
end just effectively a simple branch.

Fwiw compilers aren't always the best at optimizing switch statements.
It's entirely possible a compiler may end up building a whole lookup
table of jumps for this thing. Not that it's performance critical but
...

But all that's more words than necessary for a minor style comment.


Fwiw I spent a few minutes thinking about and writing up this
suggestion and then only afterwards realized the code in question
wasn't from this patch. I'll mention it anyways but it's not relevant
to this patch review I guess :)

I found the whole expected_chunk_seq parameter thing a bit confusing
and less useful than possible. I would instead suggestion:

Allocate an array of the expected number of chunk numbers before
calling check_toast_tuple and then just gather the chunk_seq that are
returned. When it's finished you can do things like: a) Check if
they're all ascending and report index corruption if not. b) Check if
any numbers are missing and report which ones are missing and/or how
many. c) Check if there are duplicates and report that. These would
all be easier for a user to interpret than "index scan returned chunk
5 when expecting chunk 9".



On Tue, 4 May 2021 at 12:20, Mark Dilger  wrote:
>
> Hackers,
>
> During the version 14 development period, a few checks of toasted attributes 
> were written but never committed.  For the version 15 development cycle, I'd 
> like to consider extending the checks of toasted attributes.  First, no 
> toasted attribute should ever have a rawsize larger than the 1GB varlena 
> limit.  Second, no compressed toasted attribute should have an extsize 
> indicating that the toast expanded during toasting.  Such a extsize could 
> mean the compression code is malfunctioning, or that the extsize or rawsize 
> fields are corrupt.  Third, any compressed attribute should have a valid 
> compression method ID.
>
> These checks are cheap.  Actually retrieving the compressed toasted data and 
> checking that it uncompresses correctly would have very different performance 
> implications, but that is not included in this patch.
>
>
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>


-- 
greg

On Wed, 14 Jul 2021 at 10:58, Mark Dilger  wrote:
>
>
>
> > On Jul 14, 2021, at 3:33 AM, Heikki Linnakangas  wrote:
> >
> >> +/* The largest valid toast va_rawsize */
> >> +#define VARLENA_SIZE_LIMIT 0x3FFF
> >> +
> >
> > Hmm, a toasted datum cannot be larger than MaxAllocSize, because it's 
> > reconstituted in a palloc'd datum, right?
>
> No datum size exceeds MaxAllocSize, and no datum expands when compressed 
> (because for those that do expand under any particular compression algorithm, 
> we opt to instead store the datum uncompressed), so no valid toast pointer 
> should contain a va_rawsize field greater than MaxAllocSize.  Any toast 
> pointers that have larger va_rawsize fields are therefore corrupt.
>
> VARLENA_SIZE_LIMIT is defined here equal to MaxAllocSize:
>
> src/include/utils/memutils.h:#define MaxAllocSize   ((Size) 
> 0x3fff) /* 1 gigabyte - 1 */
>
> Earlier versions of the patch used MaxAllocSize rather than defining 
> VARLENA_SIZE_LIMIT, but review comments suggested that was less clear.
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
>
>


-- 
greg




Re: Extending amcheck to check toast size and compression

2021-10-20 Thread Greg Stark
On Wed., Oct. 20, 2021, 12:41 Mark Dilger, 
wrote:

>
> I used a switch statement to trigger a compiler warning in such an event.
>

Catching better compiler diagnostics is an excellent reason to choose this
structure. I guess all I could ask is that the comment saying no default
branch say this is the motivation.

>


Thinking about ANALYZE stats and autovacuum and large non-uniform tables

2021-10-21 Thread Greg Stark
One problem I've seen in multiple databases and is when a table has a
mixture of data sets within it. E.g. A queue table where 99% of the
entries are "done" but most queries are working with the 1% that are
"new" or in other states. Often the statistics are skewed by the
"done" entries and give bad estimates for query planning when the
query is actually looking at the other rows.

We've always talked about this as a "skewed distribution" or
"intercolumn correlation" problem. And we've developed some tools for
dealing with those issues. But I've been thinking that's not the only
problem with these cases.

The problem I'm finding is that the distribution of these small
subsets can swing quickly. And understanding intercolumn correlations
even if we could do it perfectly would be no help at all.

Consider a table with millions of rows that are "done" but none that
are "pending". Inserting just a few hundred or thousand new pending
rows makes any estimates based on the existing statistics entirely
incorrect. Even if we had perfect statistics capable of making perfect
estimates they would be entirely wrong once a few inserts of pending
rows are done.

Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze,
etc are kind of affected by this. It's easy (at least on older
versions, maybe Peter's work has fixed this for btree) to get severe
index bloat because vacuum doesn't run for a long time relative to the
size of the busy portion of a table.

I'm imagining to really tackle this we should be doing something like
noticing when inserts, updates, deletes are affecting key values that
are "rare" according to the statistics and triggering autovacuum
ANALYZE statements that use indexes to only update the statistics for
the relevant key ranges.

Obviously this could get complex quickly. Perhaps it should be
something users could declare. Some kind of "partitioned statistics"
where you declare a where clause and we generate statistics for the
table where that where clause is true. Then we could fairly easily
also count things like n_mod_since_analyze for that where clause too.

And yes, partitioning the table could be a solution to this in some
cases. I think there are reasons why it isn't always going to work for
these issues though, not least that users will likely have other ways
they want to partition the data already.


-- 
greg




Re: How to retain lesser paths at add_path()?

2021-10-23 Thread Greg Stark
On Wed, 31 Jul 2019 at 11:45, Robert Haas  wrote:
>
> On Wed, Jul 31, 2019 at 11:07 AM Tom Lane  wrote:
> > What you'd want to do for something like the above, I think, is to
> > have some kind of figure of merit or other special marking for paths
> > that will have some possible special advantage in later planning
> > steps.  Then you can teach add_path that that's another dimension it
> > should consider, in the same way that paths with different sort orders
> > or parallizability attributes don't dominate each other.
>
> Yeah, but I have to admit that this whole design makes me kinda
> uncomfortable.  Every time somebody comes up with a new figure of
> merit, it increases not only the number of paths retained but also the
> cost of comparing two paths to possibly reject one of them.

But this is a fundamental problem with having lots of possible reasons
a path might be good. Not a problem with the algorithm.

I'm imagining that you're both right. If we had some sort of way to
look at the shape of the query and make decisions early on about what
figures of merit might be relevant then we might be able to pick just
a few. Sort of like how we currently only check paths that match some
join or other query feature.


-- 
greg




hot_standby_feedback vs excludeVacuum and snapshots

2018-03-29 Thread Greg Stark
I'm poking around to see debug a vacuuming problem and wondering if
I've found something more serious.

As far as I can tell the snapshots on HOT standby are built using a
list of running xids that the primary builds and puts in the WAL and
that seems to include all xids from transactions running in all
databases. The HOT standby would then build a snapshot and eventually
send the xmin of that snapshot back to the primary in the hot standby
feedback and that would block vacuuming tuples that might be visible
to the standby.

Many ages ago Alvaro sweated blood to ensure vacuums could run for
long periods of time without holding back the xmin horizon and
blocking other vacuums from cleaning up tuples. That's the purpose of
the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible
because we know vacuums won't insert any tuples that queries might try
to view and also vacuums won't try to perform any sql queries on other
tables.

I can't find anywhere that the standby snapshot building mechanism
gets this same information about which xids are actually vacuums that
can be ignored when building a snapshot. So I'm concerned that the hot
standby sending back its xmin would be effectively undermining this
mechanism and forcing vacuum xids to be included in the xmin horizon
and prevent vacuuming of tuples.

Am I missing something obvious? Is this a known problem?

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Greg Stark
On 3 April 2018 at 11:35, Anthony Iliopoulos  wrote:
> Hi Robert,
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*. But I think one would have a
> hard time defending a modification to the kernel where this is further
> extended to cover cases where:
>
> process A does write() on some file offset which fails writeback,
> fsync() gets EIO and exit()s.
>
> process B does write() on some other offset which succeeds writeback,
> but fsync() gets EIO due to (uncleared) failures of earlier process.


Surely that's exactly what process B would want? If it calls fsync and
gets a success and later finds out that the file is corrupt and didn't
match what was in memory it's not going to be happy.

This seems like an attempt to co-opt fsync for a new and different
purpose for which it's poorly designed. It's not an async error
reporting mechanism for writes. It would be useless as that as any
process could come along and open your file and eat the errors for
writes you performed. An async error reporting mechanism would have to
document which writes it was giving errors for and give you ways to
control that.

The semantics described here are useless for everyone. For a program
needing to know the error status of the writes it executed, it doesn't
know which writes are included in which fsync call. For a program
using fsync for its original intended purpose of guaranteeing that the
all writes are synced to disk it no longer has any guarantee at all.


> This would be a highly user-visible change of semantics from edge-
> triggered to level-triggered behavior.

It was always documented as level-triggered. This edge-triggered
concept is a completely surprise to application writers.

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-03 Thread Greg Stark
On 3 April 2018 at 14:36, Anthony Iliopoulos  wrote:

> If EIO persists between invocations until explicitly cleared, a process
> cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes
still haven't been written out. We don't care about tracking errors,
we just care whether all the writes to the file have been flushed to
disk. By "clear the error" you mean throw away the dirty pages and
revert part of the file to some old data? Why would anyone ever want
that?

> But instead of deconstructing and debating the semantics of the
> current mechanism, why not come up with the ideal desired form of
> error reporting/tracking granularity etc., and see how this may be
> fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some
Linux-specific interface. And doesn't really need any granular error
reporting system anyways. It just needs to know when all writes have
been synced to disk.

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-08 Thread Greg Stark
On 8 April 2018 at 04:27, Craig Ringer  wrote:
> On 8 April 2018 at 10:16, Thomas Munro 
> wrote:
>
> If the kernel does writeback in the middle, how on earth is it supposed to
> know we expect to reopen the file and check back later?
>
> Should it just remember "this file had an error" forever, and tell every
> caller? In that case how could we recover? We'd need some new API to say
> "yeah, ok already, I'm redoing all my work since the last good fsync() so
> you can clear the error flag now". Otherwise it'd keep reporting an error
> after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
track of any errors. We just need fsync to accurately report whether
all the buffers in the file have been written out. When you call fsync
again the kernel needs to initiate i/o on all the dirty buffers and
block until they complete successfully. If they complete successfully
then nobody cares whether they had some failure in the past when i/o
was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The
problem is that dirty buffers are being marked clean when they haven't
been written out. They consider dirty filesystem buffers when there's
hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then
there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel
filesystem buffer provides no reliable interface so we need to
reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we
have the added barrier that we don't have as much information about
the hardware as the kernel has. We don't know where raid stripes begin
and end, how big the memory controller buffers are or how to tell when
they're full or empty or how to flush them. etc etc. We also don't
know what else is going on on the machine.

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:
> On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
>> On 8 April 2018 at 04:27, Craig Ringer  wrote:
>> > On 8 April 2018 at 10:16, Thomas Munro 
>
> The question is, what should the kernel and application do in cases
> where this is simply not possible (according to freebsd that keeps
> dirty pages around after failure, for example, -EIO from the block
> layer is a contract for unrecoverable errors so it is pointless to
> keep them dirty). You'd need a specialized interface to clear-out
> the errors (and drop the dirty pages), or potentially just remount
> the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an
unrecoverable error. And even unrecoverable errors for a single write
doesn't mean the write will never be able to succeed in the future.
But secondly doesn't such an interface already exist? When the device
is dropped any dirty pages already get dropped with it. What's the
point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is
exactly backwards from the application's point of view. If the error
writing to persistent media really is unrecoverable then it's all the
more critical that the pages be kept so the data can be copied to some
other device. The last thing user space expects to happen is if the
data can't be written to persistent storage then also immediately
delete it from RAM. (And the *really* last thing user space expects is
for this to happen and return no error.)

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-09 Thread Greg Stark
On 9 April 2018 at 15:22, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Greg Stark
On 9 April 2018 at 11:50, Anthony Iliopoulos  wrote:
> On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
>> On 8 April 2018 at 22:47, Anthony Iliopoulos  wrote:

> To make things a bit simpler, let us focus on EIO for the moment.
> The contract between the block layer and the filesystem layer is
> assumed to be that of, when an EIO is propagated up to the fs,
> then you may assume that all possibilities for recovering have
> been exhausted in lower layers of the stack.

Well Postgres is using the filesystem. The interface between the block
layer and the filesystem may indeed need to be more complex, I
wouldn't know.

But I don't think "all possibilities" is a very useful concept.
Neither layer here is going to be perfect. They can only promise that
all possibilities that have actually been implemented have been
exhausted. And even among those only to the degree they can be done
automatically within the engineering tradeoffs and constraints. There
will always be cases like thin provisioned devices that an operator
can expand, or degraded raid arrays that can be repaired after a long
operation and so on. A network device can't be sure whether a remote
server may eventually come back or not and have to be reconfigured by
a human or system automation tool to point to the new server or new
network configuration.

> Right. This implies though that apart from the kernel having
> to keep around the dirtied-but-unrecoverable pages for an
> unbounded time, that there's further an interface for obtaining
> the exact failed pages so that you can read them back.

No, the interface we have is fsync which gives us that information
with the granularity of a single file. The database could in theory
recognize that fsync is not completing on a file and read that file
back and write it to a new file. More likely we would implement a
feature Oracle has of writing key files to multiple devices. But
currently in practice that's not what would happen, what would happen
would be a human would recognize that the database has stopped being
able to commit and there are hardware errors in the log and would stop
the database, take a backup, and restore onto a new working device.
The current interface is that there's one error and then Postgres
would pretty much have to say, "sorry, your database is corrupt and
the data is gone, restore from your backups". Which is pretty dismal.

> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.

Postgres cannot just store the entire database in RAM. It writes
things to the filesystem all the time. It calls fsync only when it
needs a write barrier to ensure consistency. That's only frequent on
the transaction log to ensure it's flushed before data modifications
and then periodically to checkpoint the data files. The amount of data
written between checkpoints can be arbitrarily large and Postgres has
no idea how much memory is available as filesystem buffers or how much
i/o bandwidth is available or other memory pressure there is. What
you're suggesting is that the application should have to babysit the
filesystem buffer cache and reimplement all of it in user-space
because the filesystem is free to throw away any data any time it
chooses?

The current interface to throw away filesystem buffer cache is
unmount. It sounds like the kernel would like a more granular way to
discard just part of a device which makes a lot of sense in the age of
large network block devices. But I don't think just saying that the
filesystem buffer cache is now something every application needs to
re-implement in user-space really helps with that, they're going to
have the same problems to solve.

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Greg Stark
On 10 April 2018 at 02:59, Craig Ringer  wrote:

> Nitpick: In most cases the kernel reserves disk space immediately,
> before returning from write(). NFS seems to be the main exception
> here.

I'm kind of puzzled by this. Surely NFS servers store the data in the
filesystem using write(2) or the in-kernel equivalent? So if the
server is backed by a filesystem where write(2) preallocates space
surely the NFS server must behave as if it'spreallocating as well? I
would expect NFS to provide basically the same set of possible
failures as the underlying filesystem (as long as you don't enable
nosync of course).

-- 
greg



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-11 Thread Greg Stark
On 10 April 2018 at 19:58, Joshua D. Drake  wrote:
> You can't unmount the file system --- that requires writing out all of the 
> pages
> such that the dirty bit is turned off.

I always wondered why Linux didn't implement umount -f. It's been in
BSD since forever and it's a major annoyance that it's missing in
Linux. Even without leaking memory it still leaks other resources,
causes confusion and awkward workarounds in UI and automation
software.

-- 
greg



Partitioning versus autovacuum

2019-09-30 Thread Greg Stark
So we now support `ANALYZE partitioned_table` which will gather statistics
for the main table by gathering stats from all the partitions.

However as far as I can tell autovacuum will never actually trigger this
analyze. Because we never generate any update records for the parent table
in the statistics. Have I missed something?

I didn't find any discussion of this in the threads from when partitioning
was committed but there were a lot of discussions and I could easily have
missed it.

Is there a story for this? Some way to configure things so that autovacuum
will analyze partitioned tables?

Or should we look at doing something? Maybe whether we analyze a child we
should also update the parent -- and if there's no stats yet run analyze on
it?

This may be a serious enough problem for users that it may warrant
backpatching. Not having any stats is resulting in some pretty weird plans
for us.


Re: Partitioning versus autovacuum

2019-09-30 Thread Greg Stark
Actually I did just find it in the To-do wiki:

Have autoanalyze of parent tables occur when child tables are modified


   - http://archives.postgresql.org/pgsql-performance/2010-06/msg00137.php


On Mon., Sep. 30, 2019, 1:48 p.m. Greg Stark,  wrote:

> So we now support `ANALYZE partitioned_table` which will gather statistics
> for the main table by gathering stats from all the partitions.
>
> However as far as I can tell autovacuum will never actually trigger this
> analyze. Because we never generate any update records for the parent table
> in the statistics. Have I missed something?
>
> I didn't find any discussion of this in the threads from when partitioning
> was committed but there were a lot of discussions and I could easily have
> missed it.
>
> Is there a story for this? Some way to configure things so that autovacuum
> will analyze partitioned tables?
>
> Or should we look at doing something? Maybe whether we analyze a child we
> should also update the parent -- and if there's no stats yet run analyze on
> it?
>
> This may be a serious enough problem for users that it may warrant
> backpatching. Not having any stats is resulting in some pretty weird plans
> for us.
>


Re: Partitioning versus autovacuum

2019-09-30 Thread Greg Stark
Actually -- I'm sorry to followup to myself (twice) -- but that's
wrong. That Todo item predates the modern partitioning code. It came
from when the partitioned statistics were added for inheritance trees.
The resulting comment almost doesn't make sense any more since it
talks about updates to the parent table and treats them as distinct
from updates to the children.

In any case it's actually not true any more as updates to the parent
table aren't even tracked any more -- see below. My modest proposal is
that we should count any updates that arrive through the parent table
as mods for both the parent and child.

A more ambitious proposal would have updates to the children also
count against the parent somehow but I'm not sure exactly how. And I'm
not sure we shouldn't be updating the parent statistics whenever we
run analyze on a child anyways but again I'm not sure how.

postgres=# postgres=# create table p (i integer primary key, t text)
partition by range (i) ;
CREATE TABLE
postgres=# create table p0 partition of p for values from (0) to (10);
CREATE TABLE
postgres=# analyze p;
ANALYZE
postgres=# analyze p0;
ANALYZE
postgres=# select pg_stat_get_mod_since_analyze('p'::regclass) as p,
pg_stat_get_mod_since_analyze('p0'::regclass) as p0;
 p | p0
---+
 0 |  0
(1 row)

postgres=# insert into p values (2);
INSERT 0 1
postgres=# select pg_stat_get_mod_since_analyze('p'::regclass) as p,
pg_stat_get_mod_since_analyze('p0'::regclass) as p0;
 p | p0
---+
 0 |  1
(1 row)




Re: maintenance_work_mem used by Vacuum

2019-10-16 Thread Greg Stark
It's a bit unfortunate that we're doing the pending list flush while the
vacuum memory is allocated at all. Is there any reason other than the way
the callbacks are defined that gin doesn't do the pending list flush before
vacuum does the heap scan and before it allocates any memory using
maintenance_work_mem?

(I'm guessing doing it after vacuum is finished would have different
problems with tuples in the pending queue not getting vacuumed?)


Re: Partitioning versus autovacuum

2019-10-18 Thread Greg Stark
At the risk of forking this thread... I think there's actually a
planner estimation bug here too.

Consider this test case of a simple partitioned table and a simple
join. The cardinality estimates for each partition and the Append node
are all perfectly accurate. But the estimate for the join is way off.
The corresponding test case without partitioning produces a perfect
cardinality estimate for the join.

I've never completely wrapped my head around the planner selectivity
estimations. IIRC join restrictions are treated differently from
single-relation restrictions. Perhaps what's happening here is that
the single-relation restrictions are being correctly estimated based
on the child partitions but the join restriction code hasn't been
taught the same tricks?



stark=# create table p (i integer, j integer) partition by list  (i);
CREATE TABLE

stark=# create table p0 partition of p for values in (0);
CREATE TABLE
stark=# create table p1 partition of p for values in (1);
CREATE TABLE

stark=# insert into p select 0,generate_series(1,1000);
INSERT 0 1000
stark=# insert into p select 1,generate_series(1,1000);
INSERT 0 1000

stark=# analyze p0;
ANALYZE
stark=# analyze p1;
ANALYZE

stark=# create table q (i integer);
CREATE TABLE
stark=# insert into q values (0);
INSERT 0 1
stark=# analyze q;
ANALYZE

-- Query partitioned table, get wildly off row estimates for join

stark=# explain analyze select * from q join p using (i) where j
between 1 and 500;
┌─┐
│ QUERY PLAN
   │
├─┤
│ Hash Join  (cost=1.02..44.82 rows=5 width=8) (actual
time=0.060..1.614 rows=500 loops=1)│
│   Hash Cond: (p0.i = q.i)
   │
│   ->  Append  (cost=0.00..40.00 rows=1000 width=8) (actual
time=0.030..1.127 rows=1000 loops=1) │
│ ->  Seq Scan on p0  (cost=0.00..20.00 rows=500 width=8)
(actual time=0.029..0.440 rows=500 loops=1) │
│   Filter: ((j >= 1) AND (j <= 500))
   │
│   Rows Removed by Filter: 500
   │
│ ->  Seq Scan on p1  (cost=0.00..20.00 rows=500 width=8)
(actual time=0.018..0.461 rows=500 loops=1) │
│   Filter: ((j >= 1) AND (j <= 500))
   │
│   Rows Removed by Filter: 500
   │
│   ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual
time=0.011..0.012 rows=1 loops=1)  │
│ Buckets: 1024  Batches: 1  Memory Usage: 9kB
   │
│ ->  Seq Scan on q  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.005..0.006 rows=1 loops=1)   │
│ Planning time: 0.713 ms
   │
│ Execution time: 1.743 ms
   │
└─┘
(14 rows)


-- Query non-partitioned table get accurate row estimates for join

stark=# create table pp as (Select * from p);
SELECT 2000
stark=# analyze pp;
ANALYZE

stark=# explain analyze select * from q join pp using (i) where j
between 1 and 500;
┌─┐
│   QUERY PLAN
   │
├─┤
│ Hash Join  (cost=1.02..48.77 rows=500 width=8) (actual
time=0.027..0.412 rows=500 loops=1)  │
│   Hash Cond: (pp.i = q.i)
   │
│   ->  Seq Scan on pp  (cost=0.00..39.00 rows=1000 width=8) (actual
time=0.014..0.243 rows=1000 loops=1) │
│ Filter: ((j >= 1) AND (j <= 500))
   │
│ Rows Removed by Filter: 1000
   │
│   ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual
time=0.005..0.005 rows=1 loops=1)  │
│ Buckets: 1024  Batches: 1  Memory Usage: 9kB
   │
│ ->  Seq Scan on q  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.003..0.003 rows=1 loops=1)   │
│ Planning time: 0.160 ms
   │
│ Execution time: 0.456 ms
   │
└─┘
(10 rows)


Can't we give better table bloat stats easily?

2019-08-16 Thread Greg Stark
Everywhere I've worked I've seen people struggle with table bloat. It's
hard to even measure how much of it you have or where, let alone actually
fix it.

If you search online you'll find dozens of different queries estimating how
much empty space are in your tables and indexes based on pg_stats
statistics and suppositions about header lengths and padding and plugging
them into formulas of varying credibility.

But isn't this all just silliiness these days? We could actually sum up the
space recorded in the fsm and get a much more trustworthy number in
milliseconds.

I rigged up a quick proof of concept and the code seems super simple and
quick. There's one or two tables where the number is a bit suspect and
there's no fsm if vacuum hasn't run but that seems pretty small potatoes
for such a huge help in reducing user pain.


LIKE foo% optimization easily defeated by OR?

2018-01-03 Thread Greg Stark
Our database has a query that looks like this -- note the OR between a
simple equality qual and a LIKE qual:

=> explain SELECT  1 AS one FROM "redirect_routes" WHERE
redirect_routes.path =  'foobar' OR redirect_routes.path LIKE
'foobar/%';
QUERY PLAN
---
 Seq Scan on redirect_routes  (cost=0.00..1776.23 rows=5 width=4)
   Filter: (((path)::text = 'foobar'::text) OR ((path)::text ~~
'foobar/%'::text))
(2 rows)


The database uses a sequential scan even though both of the sides of
that OR have valid indexes that can satisfy them (and for much lower
costs):

=> explain SELECT  1 AS one FROM "redirect_routes" WHERE
redirect_routes.path =  'foobar' ;
QUERY PLAN
---
 Index Only Scan using index_redirect_routes_on_path_text_pattern_ops
on redirect_routes  (cost=0.41..4.43 rows=1 width=4)
   Index Cond: (path = 'foobar'::text)
(2 rows)

=> explain SELECT  1 AS one FROM "redirect_routes" WHERE
redirect_routes.path LIKE 'foobar/%';
QUERY PLAN
---
 Index Only Scan using index_redirect_routes_on_path_text_pattern_ops
on redirect_routes  (cost=0.41..4.44 rows=4 width=4)
   Index Cond: ((path ~>=~ 'foobar/'::text) AND (path ~<~ 'foobar0'::text))
   Filter: ((path)::text ~~ 'foobar/%'::text)
(3 rows)


I'm guessing the LIKE optimization isn't clever enough to kick in when
it's buried under an OR? Does it only kick in at the top level of the
quals?

-- 
greg



Re: LIKE foo% optimization easily defeated by OR?

2018-01-03 Thread Greg Stark
On 3 January 2018 at 22:34, Alexander Korotkov
 wrote:

> I've checked similar case on database with PostgreSQL mailing lists.  It
> works for me.

Wow that's fascinating. I wonder why it's not kicking in for me. I
have checked with enable_seqscan=off but I'll have to do some more
investigations. I'll try it on different instances of the database.

-- 
greg



Re: LIKE foo% optimization easily defeated by OR?

2018-01-03 Thread Greg Stark
I think I found the bug 18" from the monitor I'll just be over
here with the paper bag over my head mumbling about running RESET ALL
before running tests...



Re: [HACKERS] Restricting maximum keep segments by repslots

2018-01-11 Thread Greg Stark
On 11 January 2018 at 09:55, Sergei Kornilov  wrote:
> if (active_pid != 0)
> status = "streaming";
> else
> status = "keeping";

Perhaps "idle" by analogy to a pg_stat_activity entry for a backend
that's connected but not doing anything.

> status = "may_lost";

Perhaps "stale" or "expired"?

Is this patch in bike-shed territory? Are there any questions about
whether we want the basic shape to look like this?

Fwiw I think there's a real need for this feature so I would like to
get it in for Postgres 11.

-- 
greg



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-11 Thread Greg Stark
On 11 January 2018 at 19:41, Peter Eisentraut
 wrote:

> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Is there any actual advantage to picking the largest transaction? it
means fewer spills and fewer unspills at commit time but that just a
bigger spike of i/o and more of a chance of spilling more than
necessary to get by. In the end it'll be more or less the same amount
of data read back, just all in one big spike when spilling and one big
spike when committing. If you spilled smaller transactions you would
have a small amount of i/o more frequently and have to read back small
amounts for many commits. But it would add up to the same amount of
i/o (or less if you avoid spilling more than necessary).

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.



-- 
greg



Re: WIP: a way forward on bootstrap data

2018-01-14 Thread Greg Stark
I'm 1000% on board with replacing oid constants with symbolic names
that get substituted programmatically.

However I wonder why we're bothering inventing a new syntax that
doesn't actually do much more than present static tabular data. If
things like magic proname->prosrc behaviour are not valuable then
we're not getting much out of this perl-friendly syntax that a simpler
more standard format wouldn't get us.

So just as a straw man proposal What if we just replaced the data
file with a csv file that could be maintained in a spreadsheet. It
could easily be parsed by perl and we could even have perl scripts
that load the records into memory and modify them. You could even
imagine writing a postgres script that loaded the csv file into a
temporary table, did complex SQL updates or other DML, then wrote it
back out to a csv file.



Re: Bug in Physical Replication Slots (at least 9.5)?

2018-01-19 Thread Greg Stark
On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
 wrote:
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.

I think the case to check is a commit record with many thousands of
subtransactions. I'm not sure you can fill a whole segment though.


-- 
greg



Re: log bind parameter values on error

2019-12-09 Thread Greg Stark
On Mon, 9 Dec 2019 at 15:17, Tom Lane  wrote:
>
> Meh ... people will inevitably complain that they needed to see the
> whole value, and we'll end up having to add another configuration
> variable.  Let's not go there just yet.

I haven't been following this whole thread but my initial reaction is
that this particular configuration parameter would actually carry it's
weight.

While some people have applications where they know the expected size
of the data and can safely log the entire data to the logs other
people deal with user-supplied data that can be very large. It would
suck to have something like Gitlab dump entire merge diffs to the log
and those aren't even very large, probably under a megabyte. Some
users will have individual data that are up to 1GB

-- 
greg




Re: VACUUM memory management

2019-12-09 Thread Greg Stark
On Mon, 9 Dec 2019 at 14:03, Ibrar Ahmed  wrote:
> I'd
> actually argue that the segment size should be substantially smaller
> than 1 GB, like say 64MB; there are still some people running systems
> which are small enough that allocating 1 GB when we may need only 6
> bytes can drive the system into OOM."

I don't even see why you would allocated as much as 64MB. I would
think something around 1MB would be more sensible. So you might need
an array of segment pointers as long as a few thousand pointers, big
deal. We can handle repalloc on 8kB arrays pretty easily.

-- 
greg




Re: verbose cost estimate

2019-12-09 Thread Greg Stark
On Mon, 9 Dec 2019 at 17:14, Tomas Vondra  wrote:
>
> On Sat, Dec 07, 2019 at 11:34:12AM -0500, Tom Lane wrote:
> >Justin Pryzby  writes:
> >> Jeff said:
> >>> |What would I find very useful is a verbosity option to get the cost
> >>> |estimates expressed as a multiplier of each *_cost parameter, rather than
> >>> |just as a scalar.
> >
> >> It seems to me that's "just" a matter of redefining Cost and fixing 
> >> everything that breaks:
> >
> >> struct Cost {
> >> double seq, rand;
> >> double cpu_tuple, cpu_index_tuple, cpu_oper;
> >> double parallel_setup; // This is probably always in startup_cost 
> >> and never in run_cost
> >>  double parallel_tuple; // This is probably always in run_cost and 
> >> never in startup_cost
> >> double disable;
> >> };
> >
> >> I'm perhaps 50% done with that - is there some agreement that's a desirable
> >> goal and a good way to do it ?
> >
> >No, I think this will get rejected out of hand.  The implications for
> >the planner's speed and memory consumption seem quite unacceptable
> >for the size of the benefit.  What you're showing above probably
> >doubles the size of most Paths, and the added cycles in hot-spots
> >like add_path seem pretty daunting.
> >
>
> Yeah, that's an issue. But I have to admit my main issue with this
> proposal is that I have no idea how I'd interpret this Cost. I mean,
> what do the fields express for different types of paths? How do they
> contribute to the actual cost of that path?

What I think users would be able to do with this info is understand
which parameter to tweak to raise the estimated cost of the node.

Everyone knows if you see a index scan is being used but is taking
longer than a sequential scan then you might try raising
random_page_cost. But I rarely see people tweaking the more "exotic"
parameters like operator_tuple_cost or index_tuple_cost and when they
do they aren't really sure what nodes they're affecting...

I remember planning to do a very similar thing back in the 8.3 era and
never getting around to it. You could imaging even storing these for
the overall plan in the logs and building a large matrix of actual
execution values versus these broken out individual costs. Then it
becomes a standard linear optimization problem to find the optimal
values for each parameter to minimize inaccurate plan estimates (and
to identify cases where there are outliers).

-- 
greg




Re: On disable_cost

2019-12-12 Thread Greg Stark
On Wed, 11 Dec 2019 at 01:24, Laurenz Albe  wrote:
>
> On Tue, 2019-12-10 at 15:50 -0700, Jim Finnerty wrote:
> > As a proof of concept, I hacked around a bit today to re-purpose one of the
> > bits of the Cost structure to mean "is_disabled" so that we can distinguish
> > 'disabled' from 'non-disabled' paths without making the Cost structure any
> > bigger.  In fact, it's still a valid double.  The obvious choice would have
> > been to re-purpose the sign bit, but I've had occasion to exploit negative
> > costs before so for this POC I used the high-order bit of the fractional
> > bits of the double.  (see Wikipedia for double precision floating point for
> > the layout).
> >
> > The idea is to set a special bit when disable_cost is added to a cost.
> > Dedicating multiple bits instead of just 1 would be easily done, but as it
> > is we can accumulate many disable_costs without overflowing, so just
> > comparing the cost suffices.
>
> Doesn't that rely on a specific implementation of double precision (IEEE)?
> I thought that we don't want to limit ourselves to platforms with IEEE floats.

We could always implement it again in another format

However, I wouldn't have expected to be bit twiddling. I would have
expected to use standard functions like ldexp to do this. In fact I
think if you use the high bit of the exponent you could do it entirely
using ldexp and regular double comparisons (with fabs).

Ie, to set the bit you set cost = ldexp(cost, __DBL_MAX_EXP__/2). And
to check for the bit being set you compare ilogb(cost,
__DBL_MAX_EXP__/2). Hm. that doesn't handle if the cost is already < 1
in which case I guess you would have to set it to 1 first. Or reserve
the two high bits of the cost so you can represent disabled values
that had negative exponents before being disabled.

I wonder if it wouldn't be a lot cleaner and more flexible to just go
with a plain float for Cost and use the other 32 bits for counters and
bitmasks and still be ahead of the game. A double can store 2^1024 but
a float 2^128 which still feels like it should be more than enough to
store the kinds of costs plans have without the disabled costs. 2^128
milliseconds is still 10^28 years which is an awfully expensive
query

-- 
greg




Re: On disable_cost

2019-12-15 Thread Greg Stark
I think this would be ready to abstract away behind a few functions that
could always be replaced by something else later...


However on further thought I really think just using a 32-bit float and 32
bits of other bitmaps or counters would be a better approach.


On Sun., Dec. 15, 2019, 14:54 Tom Lane,  wrote:

> Thomas Munro  writes:
> > On Wed, Dec 11, 2019 at 7:24 PM Laurenz Albe 
> wrote:
> >> Doesn't that rely on a specific implementation of double precision
> (IEEE)?
> >> I thought that we don't want to limit ourselves to platforms with IEEE
> floats.
>
> > Just by the way, you might want to read the second last paragraph of
> > the commit message for 02ddd499.  The dream is over, we're never going
> > to run on Vax.
>
> Still, the proposed hack is doubling down on IEEE dependency in a way
> that I quite dislike, in that (a) it doesn't just read float values
> but generates new ones (and assumes that the hardware/libc will react in
> a predictable way to them), (b) in a part of the code that has no damn
> business having close dependencies on float format, and (c) for a gain
> far smaller than what we got from the Ryu code.
>
> We have had prior discussions about whether 02ddd499 justifies adding
> more IEEE dependencies elsewhere.  I don't think it does.  IEEE 754
> is not the last word that will ever be said on floating-point arithmetic,
> any more than x86_64 is the last CPU architecture that anyone will ever
> care about.  We should keep our dependencies on it well circumscribed.
>
> regards, tom lane
>
>
>


Re: Implementing Incremental View Maintenance

2019-04-03 Thread Greg Stark
On Sun, 31 Mar 2019 at 23:22, Yugo Nagata  wrote:
>
> Firstly, this will handle simple definition views which includes only
> selection, projection, and join.  Standard aggregations (count, sum, avg,
> min, max) are not planned to be implemented in the first patch, but these
> are commonly used in materialized views, so I'll implement them later on.

It's fine to not have all the features from day 1 of course. But I
just picked up this comment and the followup talking about splitting
AVG into SUM and COUNT and I had a comment. When you do look at
tackling aggregates I don't think you should restrict yourself to
these specific standard aggregations. We have all the necessary
abstractions to handle all aggregations that are feasible, see
https://www.postgresql.org/docs/devel/xaggr.html#XAGGR-MOVING-AGGREGATES

What you need to do -- I think -- is store the "moving aggregate
state" before the final function. Then whenever a row is inserted or
deleted or updated (or whenever another column is updated which causes
the value to row to enter or leave the aggregation) apply either
aggtransfn or aggminvtransfn to the state. I'm not sure if you want to
apply the final function on every update or only lazily either may be
better in some usage.




Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-08 Thread Greg Stark
Don't we have a build farm animal that runs under valgrind that would
have caught this?




Re: Create TOAST table only if AM needs

2019-05-19 Thread Greg Stark
Just throwing this out there Perhaps we should just disable toasting
for non-heap tables entirely for now?

That way at least people can use it and storage plugins just have to be
able to deal with large datums in their own (or throw errors).

On Fri., May 17, 2019, 5:56 p.m. Ashwin Agrawal, 
wrote:

> On Fri, May 17, 2019 at 2:42 PM Robert Haas  wrote:
>
>> In any case, it bears saying that
>> tableam is a remarkable accomplishment regardless of whatever
>> shortcomings it has in this area or elsewhere.
>>
>
> Big +1 to this.
>


Re: New vacuum option to do only freezing

2018-11-05 Thread Greg Stark
On Mon 5 Nov 2018, 12:49 Robert Haas  That seems non-orthogonal.  We have an existing flag to force freezing
> (FREEZE); we don't need a second option that does the same thing.
> Skipping the index scans (and thus necessarily the second heap pass)
> is a separate behavior which we don't currently have a way to control.
>

It sounds like it might be better to name this "VACUUM (FAST)” and document
that it skips some of the normal (and necessary) work that vacuum does and
is only suitable for avoiding wraparounds and not sufficient for avoiding
bloat

>


Query pattern tha Postgres doesn't handle well

2018-02-24 Thread Greg Stark
At my day job I've been doing a fair amount of routine query and
schema optimization and I've noticed on particular query shape that
has repeatedly caused problems, and it's one we've talked about
before.

select * from table where simple-restriction 0 OR (complex-subquery)

For example something like:

SELECT * FROM projects WHERE ispublic OR project_id IN (SELECT
project_id FROM project_members WHERE userid = ?)

Either half of this clause can easily be executed using indexes but
the combination forces Postgres to do a full sequential table scan.

The solution has been to rewrite each of these cases into:

SELECT * FROM projects WHERE ispublic
UNION ALL
SELECT * FROM projects WHERE NOT ispublic AND project_id IN (SELECT
project_id FROM project_members WHERE userid = ?)

But there are several problems with this.

1) It's often difficult to get the conditions to be exactly disjoint
such that you can use UNION ALL, and if you can't then UNION can be
slow and possibly even incorrect.

2) The resulting query is difficult to combine with other clauses.
They must either be copied into both sides of the UNION or wrapped
around the outside and hope that Postgres pushes them down into the
union branches where necessary. More complex conditions can't be
pushed down, and in particular combining two conditions that have been
rewritten to use union is very difficult.

3) It's extremely difficult to generate this kind of query in an ORM
such as ActiveRecord without breaking the abstractions and causing
surprising interactions.

I'm not sure I have much to offer here. I definitely don't know where
to start implementing it. But I know it's come up before on the list
and just thought I would mention that I've noticed it being a
recurring problem for at least this user.

-- 
greg



Re: Online enabling of checksums

2018-02-24 Thread Greg Stark
> The change of the checksum state is WAL logged with a new xlog record. All 
> the buffers written by the background worker are forcibly enabled full page 
> writes to make sure the checksum is fully updated on the standby even if no 
> actual contents of the buffer changed.

Hm. That doesn't sound necessary to me. If you generate a checkpoint
(or just wait until a new checkpoint has started) then go through and
do a normal xlog record for every page (any xlog record, a noop record
even) then the normal logic for full page writes ought to be
sufficient. If the noop record doesn't need a full page write it's
because someone else has already come in and done one and that one
will set the checksum. In fact if any page has an lsn > the checkpoint
start lsn for the checkpoint after the flag was flipped then you
wouldn't need to issue any record at all.



jsonlog logging only some messages?

2018-02-26 Thread Greg Stark
I tried loading the jsonlog module from
https://github.com/michaelpq/pg_plugins into Postgres 9.6.

However it seems it resulted in logs only for session log messages but
not any background worker log messages. We have log_checkpoints set
but there were no log messages in the json log about checkpoints. Nor
were there any from autovacuum.

Also I have log_destination set to stderr,cvslog and as I understand
it the jsonlog module effectively overrides the "stderr" output which
goes to the file named *.log (which I find super confusing,
incidentally). But I was expecting to get the csv file as well. We
didn't, we only got the *.log file with the (partial) json logs. Is
there something I'm missing here or is this unexpected?

-- 
greg



Re: jsonlog logging only some messages?

2018-02-27 Thread Greg Stark
On 27 February 2018 at 02:04, Michael Paquier  wrote:
> On Mon, Feb 26, 2018 at 05:38:56PM +0000, Greg Stark wrote:
>> I tried loading the jsonlog module from
>> https://github.com/michaelpq/pg_plugins into Postgres 9.6.
>>
>> However it seems it resulted in logs only for session log messages but
>> not any background worker log messages. We have log_checkpoints set
>> but there were no log messages in the json log about checkpoints. Nor
>> were there any from autovacuum.
>
> Hm.  I have just loaded jsonlog on a 9.6 and 10 instance where
> log_checkpoints is enabled with this background worker which logs a
> simple string every 10s:
> https://github.com/michaelpq/pg_plugins/tree/master/hello_world
>
> Both checkpoint logs and the logs of the bgworker are showing up for me.

Weird. I guess I have some more debugging with gdb to do.

>> Also I have log_destination set to stderr,cvslog and as I understand
>> it the jsonlog module effectively overrides the "stderr" output which
>> goes to the file named *.log (which I find super confusing,
>> incidentally). But I was expecting to get the csv file as well. We
>> didn't, we only got the *.log file with the (partial) json logs. Is
>> there something I'm missing here or is this unexpected?
>
> Yeah, that's unfortunately expected...  jsonlog enforces
> edata->output_to_server to false, which has the advantage of disabling
> extra log outputs, but also has the advantage of preventing duplicate
> entries into stderr.  One way that I can see to solve things would be to
> patch the backend and get output_to_server replaced by a filter which
> allows a plugin to choose what are the extra output types allowed.  In
> this case you don't want stderr later on, but would still want csvlog to
> happen, instead of having an all-or-nothing switch.  I haven't tested,
> but it could be possible to have as well jsonlog enforce Log_destination
> to only use csvlog after it generates its entry so as stderr is not
> duplicated by csvlog gets though.  Not sure how you would reset the
> parameter though, so you may perhaps want to invoke an extra plugin
> which outputs to csvlog as jsonlog cascades through things.


I would actually lean another way. If jsonlog opened *.json and wrote
there then it could leave the output_to_server field unchanged. It
does look like this might be a bit awkward with yet more of the static
functions needing to be duplicated.




-- 
greg



  1   2   3   4   >