from:"David Rowley"

Re: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)

2025-05-13 Thread David Rowley

On Tue, 13 May 2025 at 03:19, Maxim Boguk  wrote:
> On Mon, May 12, 2025 at 6:01 PM David Rowley  wrote:
>> This is just an artifact of the fact that runtime pruning is not factored 
>> into the costs. Note the cost of the generic plan. The plan_cache_mode GUC 
>> is about the only way to overrule the choice to use the custom plan.
>
> Situation quite the opposite - I need to force a generic plan because it has 
> the same execution time as a custom plan but performs 20-50x faster (because 
> in custom plan case - 95-98% time spent in planning not in execution).

You misunderstood. The choice the planner (or choose_custom_plan) made
to use the custom plan can be overridden with SET plan_cache_mode =
force_generic_plan;, which seems to be what performs better for you,
per your example EXPLAIN ANALYZE outputs.

> And the problem is that the cost of a custom plan ignores the cost of 
> planning itself (which is like 2x orders of magnitude worse than the cost of 
> real time partition pruning of a generic plan). I started thinking of 
> something like cost_planner GUC to help with similar issues (where planning 
> cost calculated as cost_planned*(some heuristic function with amount involved 
> in query tables).
>
> In my case the high cost of planning itself should force the database to use 
> generic plan.

Certainly the cost estimate for planning there is quite crude. I doubt
you'll find anyone arguing that it's not. It is however designed to be
low-overhead.  The estimated planning cost isn't the issue here. It's
(as I mentioned) related to no cost consideration being given to
run-time pruning.  We could certainly adjust things so that is
accounted for, and we (I think Robert and I) have talked about it in
the past.  The problem is that doing that is a wild stab in the dark,
especially so for your range partitioned case where the amount of
actual partitions pruned during executor startup could range from 0 to
all of them.  Unfortunately when we tag those costs onto the plan,
we've no idea what the parameter values are going to be when the plan
is executed. I think Robert suggested multiplying the Append cost by
DEFAULT_INEQ_SEL for this bounded range type pruning. Whether that
will help you or not depends on how many partitions you have and how
evenly populated they are.

In order words, it's a tricky problem with no one-size-fits-all solution.

David

Re: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)

2025-05-12 Thread David Rowley

On Mon, 12 May 2025, 05:08 Andrei Lepikhov,  wrote:

> Thanks for this puzzle!
> I suppose, in case generic planning is much faster than custom one,
> there are two candidates exist:
> 1. Touching the index during planning causes too much overhead - see
> get_actual_variable_range
> 2. You have a massive default_statistics_target for a table involved.
>

This is just an artifact of the fact that runtime pruning is not factored
into the costs. Note the cost of the generic plan. The plan_cache_mode GUC
is about the only way to overrule the choice to use the custom plan.

David

>

Re: Query planning read a large amount of buffers for partitioned tables

2025-02-18 Thread David Rowley

On Tue, 18 Feb 2025 at 09:18, bruno vieira da silva
 wrote:
>
> Hello, I did a more comprehensive test with a different number of partitions 
> and I found this:
>
> Summary buffers usage for the first call vs second call on the same session.
>
> Query 200, 100, 50, and 10 partitions:
> 200 Partitions: 12,828 (100MB)
> 100 Partitions:  9,329 (72MB)
>  50 Partitions:  3,305 (25MB)
>  10 Partitions:875 (7MB)
>
> Same query on the same session:
> 200 Partitions:205 (1.6MB)
> 100 Partitions:  5 (40KB)
> 50  Partitions:  5 (40KB)
> 10  Partitions:  5 (40KB)
>
> I did test on PG 17.3 no relevant changes.
>
> Question is, does it make sense?

I didn't analyze this in great detail, but nothing looks too
surprising to me. I get roughly the same numbers on the latest git
master branch as you've shown above.

A PostgreSQL backend will cache various metadata about relations the
first time they're accessed in a backend.  Building those caches
requires accessing the system catalogue tables. I expect the majority
of the buffer accesses are for those tables. If you're curious about
what's being accessed and have a fresh test instance handy, you could
use strace to see which buffers are being read. You'll need to ensure
the shared buffers are not caching anything. Restarting PostgreSQL
should clear those out sufficiently. You can translate the filenodes
back into relation names by using a query such as: select relname from
pg_class where pg_relation_filenode(oid)=1259;

If this is causing you problems then maybe a connection pooler would
help you. With one of those, the backend will live longer than just 1
query. You could also perhaps revisit your partition count to see if
the number you've chosen gives you the best performance. It's very
common for people to over-partition and not properly consider the
overheads of partitioning.

David

Re: Query planning read a large amount of buffers for partitioned tables

2025-01-15 Thread David Rowley

On Thu, 16 Jan 2025 at 07:29, bruno vieira da silva
 wrote:
> On pg 17 now we have better visibility on the I/O required during query 
> planning.
> so, as part of an ongoing design work for table partitioning I was analyzing 
> the performance implications of having more or less partitions.
> In one of my tests of a table with 200 partitions using explain showed a 
> large amount of buffers read during planning. around 12k buffers.

That's a suspiciously high number of buffers.

> I observed that query planning seems to have a caching mechanism as 
> subsequent similar queries require only a fraction of buffers read during 
> query planning.
> However, this "caching" seems to be per session as if I end the client 
> session and I reconnect the same query execution will require again to read 
> 12k buffer for query planning.
>
> Does pg have any mechanism to mitigate this issue ( new sessions need to read 
> a large amount of buffers for query planning) ? or should I mitigate this 
> issue by the use of connection pooling.
> How is this caching done? Is there a way to have viability on its usage? 
> Where is it stored?

The caching is for relation meta-data and for various catalogue data.
This is stored in local session hash tables. The caching is done
lazily the first time something is looked up after the session starts.
If you're doing very little work before ending the session, then
you'll pay this overhead much more often than you would if you were to
do more work in each session. A connection pooler would help you do
that, otherwise it would need to be a redesign of how you're
connecting to Postgres from your application.

There's no easy way from EXPLAIN to see which tables or catalogue
tables the IO is occurring on, however, you might want to try looking
at pg_statio_all_tables directly before and after the query that's
causing the 12k buffer accesses and then look at what's changed.

I suspect if you're accessing 12k buffers to run EXPLAIN that you have
some auto-vacuum starvation issues. Is auto-vacuum enabled and
running? If you look at pg_stat_activity, do you see autovacuum
running? It's possible that it's running but not configured to run
quickly enough to keep up with demand.  Alternatively, it may be
keeping up now, but at some point in the past, it might not have been
and you have some bloat either in an index or in a catalogue table as
a result.

David

Re: CTE Inline On TPC-DS Query 95

2024-12-27 Thread David Rowley

On Sat, 28 Dec 2024 at 00:34,  wrote:
> Sometimes one is better and sometimes the other is best.
> In Oracle, the choice is done by COST. This should be the same for 
> PostgreSQL. In essence, it is the same thing for views: inlining or 
> materializing...

I don't doubt you. However, I did already mention the cost part in the
final paragraph of my email and I also hinted on why it might be more
difficult than you might imagine to implement. I don't want to put
anyone off making improvements in this area. I only aim to highlight
that it's not trivial to do so.

David

Re: CTE Inline On TPC-DS Query 95

2024-11-25 Thread David Rowley

On Tue, 26 Nov 2024 at 10:03, Ba Jinsheng  wrote:
> If we specify ``NOT MATERIALIZED`` for the query, the execution time is 
> reduced from 1min to 1s due to the usage of inline CTE. It seems expected as 
> described in the PostgreSQL documentation. However, from the code: 
> https://github.com/postgres/postgres/blob/REL_17_STABLE/src/backend/optimizer/plan/subselect.c#L939,
>  I understand that this query does not contain volatile functions and 
> includes simple predicates. I am wondering whether we can relax this 
> condition checking, to enable inline CTE for such queries? Because it seems 
> the performance benefit is huge on a standard benchmark.

The relevant part of the documentation [1] seems to be "By default,
this happens if the parent query references the WITH query just once,
but not if it references the WITH query more than once."

If you're proposing that we change the heuristics for when to inline
CTEs, then I suggest you go and check the archives for why we made
this decision. You'll need to prove to us why your newly proposed
heuristic is better than what we have. That's going to take more than
one example query to show that your heuristics are better.

However, just to save you from wasting any time, I highly doubt an
adjustment to the CTE inlining heuristics alone is enough.  Just
imagine the CTE evaluation performed a Seq Scan on a 1 billion row
table and found 1 matching row. Let's assume the remaining part of the
query is very cheap. It's obviously going to be better to materialise
the CTE so that we perform the Seq Scan once rather than twice (or
however many times the CTE is referenced). Adjusting the heuristics
here is just going to make some queries faster at the expense of
making other queries slower. That's never going to pass community
standards.

If you truly wanted to improve things here, the heuristics would need
to be swapped out with a cost-based decision. I can tell you now, that
doing that is a very difficult task as it basically requires
performing the join search and quite a large portion of planning once
for each combination of CTE inlined vs not-inlined. If you had a query
with a dozen or so CTEs, that's going to be a very costly thing to
plan.

David

[1] https://www.postgresql.org/docs/current/queries-with.html

Re: Performance of TPC-DS Query 95

2024-11-25 Thread David Rowley

On Tue, 26 Nov 2024 at 09:55, Ba Jinsheng  wrote:
> TPC-DS query 95:
> Its execution time is nearly 1 min:

>CTE ws_wh
>  ->  Hash Join  (cost=37772.14..198810.77 rows=7242361 width=12) (actual 
> time=211.161..1443.926 rows=6644004 loops=1)

> If applying this patch:

> -   if (is_innerrel_unique_for(root, joinrelids, outerrelids, innerrel,
> +   if (!is_innerrel_unique_for(root, joinrelids, outerrelids, innerrel,

> The execution time is reduced to 6 seconds:

>CTE ws_wh
>  ->  Hash Join  (cost=37772.14..74062.53 rows=7095248 width=12) (actual 
> time=203.407..560.264 rows=719205 loops=1)

> The difference between both query plans is the second one uses Materialize 
> instead of Memoize. From the code, it seems that changing the usage of the 
> cache brings performance improvement unexpectedly.

What's going on here is that you've introduced a new bug which has the
Hash Join effectively perform a semi-join rather than an inner join so
that it only joins the first matching inner-side row instead of all
matching inner-side rows for each outer-side row. The bug is fairly
evident when to see that the Hash Join produces 6644004 rows without
your change and only 719205 rows after you added the bug.

I'm not sure what you expect us to do here, but just in case you're
not aware, we're not going to prioritise performance over correct
results.

I expect the problem you've introduced will be more evident if you use
EXPLAIN ANALYZE VERBOSE. The verbose option will show which joins are
marked "Inner Unique". If you want to learn more about the
optimisation you've broken, see [1].

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c7f5229ad68d7e0e4dd149e3f80257893e404d4

Re: Major performance degradation with joins in 15.8 or 15.7?

2024-11-07 Thread David Rowley

On Fri, 8 Nov 2024 at 10:54, Ed Sabol  wrote:
> The good news is that, after some research and experimentation, I was able to 
> fix this performance degradation by setting random_page_cost = 2.0. We've 
> always used the default values for seq_page_cost and random_page_cost (1.0 
> and 4.0, respectively, I think?). A lot of the advice I found online 
> suggested these *_page_cost settings are too high?

Often people lower random_page_cost when running SSDs. The default
value of 4.0 was determined before SSDs were a thing.

> Did upgrading PostgreSQL to 15.8 cause this performance degradation? Did the 
> page costs calculation change in 15.8 (or 15.7)? The PostgreSQL version is 
> the only thing that has appreciably changed with this database this whole 
> year.

It's impossible to say with the given information. You didn't mention
which version you upgraded from to start with.

We'd need to see EXPLAIN ANALYZE from before and after you changed
random_page_cost.

> Or do you think the size of the table just incrementally reached some 
> threshold that resulted in the optimizer changing its plan based on the page 
> costs?

It's possible, but it's also impossible to know without having more information.

> All of the columns are text fields and indexed. Unfortunately, I don't have 
> the "EXPLAIN (ANALYZE, BUFFERS)" output from before I changed the 
> random_page_cost setting. It's no longer in my scollback buffer.

You can set random_page_cost for just the session you're connected to
and try it.  SET random_page_cost = ; before running
EXPLAIN (ANALYZE, BUFFERS).

David

Re: Unexpected Performance for the Function simplify_function

2024-10-25 Thread David Rowley

On Fri, 25 Oct 2024 at 22:26, Ba Jinsheng  wrote:
> I guess the better query plan is not considered when comparing the cost of 
> paths?

You might want to change effective_cache_size is set high enough.
Something like 50-75% of RAM is likely fine.

David

Re: Hash Right join and seq scan

2024-07-05 Thread David Rowley

On Sat, 6 Jul 2024 at 02:43, James Pang  wrote:
>for nest loop path, since the first one estimated only "8" rows , 
> and they use partitionkeyid as joinkey and all are hash partitions , is it 
> better to estimate cost to  8 (loop times) * 1600 = 12800 (each one loop  map 
> to only 1 hash partition bitmap scan ,avg one partition cost), that's much 
> less than 398917.29 of all partitions ?

I'm not really sure where you're getting the numbers from here. The
outer side of the deepest nested loop has an 8 row estimate, not the
nested loop itself.  I'm unsure where the 1600 is from. I only see
1669.

As of now, we don't do a great job of costing for partition pruning
that will happen during execution.  We won't be inventing anything to
fix that in existing releases of PostgreSQL, so you'll need to either
adjust the code yourself, or find a workaround.

You've not shown us your schema, but perhaps enable_partitionwise_join
= on might help you. Other things that might help are further lowering
random_page_cost or raising effective_cache_size artificially high.
It's hard to tell from here how much random I/O is being costed into
the index scans.  You could determine this by checking if the nested
loop plan costs change as a result of doing further increases to
effective_cache_size. You could maybe nudge it up enough for it to win
over the hash join plan. It is possible that this won't work, however.

>  for secondary Nest Loop Anti join could be rows 299118 rows *  15.78(avg 
> index scan cost of one partition) = 4,720,082 that still much less than  
> 132168227.57 ?
>  for Hash Right join, is it possible to estimate by  8 seq partition 
> scan instead of all 32 hash partitions since the first query estimated 8 rows 
> only ?
>  extend statistics may help estimate count(partitionkeyid) based on 
> other columns bind variables, but looks like that did not help table join 
> case.

I can't quite follow this. You'll need to better explain where you're
getting these numbers for me to be able to understand.

David

Re: Hash Right join and seq scan

2024-07-04 Thread David Rowley

On Fri, 5 Jul 2024 at 12:50, James Pang  wrote:
>we have a daily vacuumdb and analyze job, generally speaking it's done in 
> seconds, sometimes it suddenly running more than tens of minutes with same 
> bind variable values and huge temp space got used  and at that time, explain 
> show  "Hash Anti join, Hash Right join" with seq scan two tables.

There was talk about adding costing for run-time partition pruning
factors but nothing was ever agreed, so nothing was done.  It's just
not that obvious to me how we'd do that.  If the Append had 10
partitions as subnodes, with an equality join condition, you could
assume we'll only match to 1 of those 10, but we've no idea at plan
time which one that'll be and the partitions might drastically vary in
size.  The best I think we could do is take the total cost of those 10
and divide by 10 to get the average cost.  It's much harder for range
conditions as those could match anything from 0 to all partitions. The
best suggestion I saw for that was to multiply the costs by
DEFAULT_INEQ_SEL.

I think for now, you might want to lower the random_page_cost or
increase effective_cache_size to encourage the nested loop -> index
scan plan. Good ranges for effective_cache_size is anywhere between 50
- 75% of your servers's RAM.  However, that might not be ideal if your
server is under memory pressure from other running processes. It also
depends on how large shared_buffers are as a percentage of total RAM.

David

Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley

On Mon, 1 Jul 2024 at 22:20, Pavel Stehule  wrote:
> The planners get min/max range from indexes. So some user's indexes can be 
> bloated too with similar effect

I considered that, but it doesn't apply to this query as there are no
range quals.

David

Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley

On Mon, 1 Jul 2024 at 21:45, James Pang  wrote:
>Buffers: shared hit=110246 <<< here planning need access a lot of 
> buffers
>  Planning Time: 81.850 ms
>  Execution Time: 0.034 ms
>
>could you help why planning need a lot of shared buffers access ?

Perhaps you have lots of bloat in your system catalogue tables. That
could happen if you make heavy use of temporary tables. There are many
other reasons too. It's maybe worth doing some vacuum work on the
catalogue tables.

David

Re: Plan selection based on worst case scenario

2024-05-29 Thread David Rowley

On Thu, 30 May 2024 at 13:03, Darwin O'Connor  wrote:
> Is there a PostgreSQL setting that can control how it judges plans?

There's nothing like that, unfortunately.

> Here is a recent example of a query that finds the last time at a stop 
> filtered for a certain route it has to look up another table to find. 
> PostgreSQL initially chose the plan that cost "37357.45..37357.45" rather 
> than the one that cost "1.15..61088.32".
>
> transsee=# explain analyze select t.time+coalesce(t.timespent,0)*interval '1 
> second' from trackstopscurr t join tracktrip r on r.a=0 and r.route='501' and 
> r.id=t.trackid and t.stopid='4514' order by t.time desc limit 1;

>  Limit  (cost=37357.45..37357.45 rows=1 width=16) (actual 
> time=2667.674..2694.047 rows=1 loops=1)

>->  Nested Loop  (cost=182.60..36357.34 rows=1 width=16) 
> (actual time=381.913..2659.412 rows=1342 loops=3)
>  ->  Parallel Bitmap Heap Scan on trackstopscurr t  
> (cost=182.03..19048.63 rows=2014 width=14) (actual time=380.467..1231.788 
> rows=8097 loops=3)
>Recheck Cond: ((stopid)::text = '4514'::text)
>Heap Blocks: exact=8103
>->  Bitmap Index Scan on trackstopscurr_2  
> (cost=0.00..180.82 rows=4833 width=0) (actual time=382.653..382.653 
> rows=24379 loops=1)
>  Index Cond: ((stopid)::text = '4514'::text)
>  ->  Index Scan using tracktrip_0 on tracktrip r  
> (cost=0.57..8.59 rows=1 width=4) (actual time=0.175..0.175 rows=0 loops=24290)
>Index Cond: (id = t.trackid)
>Filter: ((a = 0) AND ((route)::text = '501'::text))
>Rows Removed by Filter: 1
>  Planning Time: 0.228 ms
>  Execution Time: 2694.077 ms

The problem here is primarily down to the poor estimates for the scan
on tracktrip.  You can see that the Nested Loop estimates 1 row, so
therefore the LIMIT costing code thinks LIMIT 1 will require reading
all rows, all 1 of them.  In which case that's expected to cost
36357.34, which is cheaper than the other plan which costs 61088.32 to
get one row.

If you can fix the row estimate to even estimate 2 rows rather than 1,
then it'll choose the other plan.  An estimate of 2 rows would mean
the total cost of the best path after the limit would be 61088.32 / 2
= 30544.16, which is cheaper than the 36357.34 of the bad plan.

You could try ANALYZE on tracktrip, or perhaps increasing the
statistics targets on the columns being queried here.

If there's a correlation between the "a" and "route" columns then you
might want to try CREATE STATISTICS:

CREATE STATISTICS ON a,route FROM tracktrip;
ANALYZE tracktrip;

David

Re: Slow query in table where many rows were deleted. VACUUM FULL fixes it

2024-01-30 Thread David Rowley

On Wed, 31 Jan 2024 at 09:09, Philip Semanchuk
 wrote:
> So in your case those 5m rows that you deleted were probably still clogging 
> up your table until you ran VACUUM FULL.

It seems more likely to me that the VACUUM removed the rows and just
left empty pages in the table.  Since there's no index on expires_at,
the only way to answer that query is to Seq Scan and Seq Scan will
need to process those empty pages.  While that processing is very fast
if the page's item pointers array is empty, it could still be slow if
the page needs to be read from disk. Laurenz's request for the explain
(analyze, buffers) output with track_io_timing on will help confirm
this.

If it is just reading empty pages that's causing this issue then
adding that missing index would improve the situation after running
just plain VACUUM each time there's a bulk delete.

David

Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley

On Fri, 26 Jan 2024 at 17:23, Jean-Christophe Boggio
 wrote:
> Let me know if I can do anything to provide you with more useful
> benchmark. The DB is still very small so it is easy to do tests.

What I was looking to find out was if there was some enable_* GUC that
you could turn off that would make the unaccounted time that you were
complaining about go away.

Because it seems when you reduce work_mem this unaccounted for time
goes away, it makes me think that some executor node is allocating a
load of memory during executor startup.  I was hoping to find out
which node is the offending one by the process of elimination.

Are you still seeing this unaccounted for time with your notebook?
i.e. the relatively high "Execution Time" reported by EXPLAIN ANALYZE
and low total actual execution time on the plan's top-level node.

I probably didn't need to mention the planning time as it seems
unlikely that disabling an enable* GUC would result in increased
planning time. However, it does not seem impossible that that *could*
happen.

David

Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley

On Fri, 26 Jan 2024 at 02:31, Jean-Christophe Boggio
 wrote:
> You are absolutely correct : the EXPLAIN without ANALYZE gives about the same 
> results. Also, minimizing the amount of workmem in postgresql.conf changes 
> drastically the timings. So that means memory allocation is eating up a lot 
> of time _PER_QUERY_ ?

We do reuse pallocs to create memory context, but only for I believe
1k and 8k blocks.  That likely allows most small allocations in the
executor to be done without malloc.  Speaking in vague terms as I
don't have the exact numbers to hand, but larger allocations will go
directly to malloc.

There was a bug fixed in [1] that did cause behaviour like this, but
you seem to be on 14.10 which will have that fix.  Also, the 2nd plan
you sent has no Memoize nodes.

I do wonder now if it was a bad idea to make Memoize build the hash
table on plan startup rather than delaying that until we fetch the
first tuple. I see Hash Join only builds its table during executor
run.

> Since we have quite some RAM on our machines, I dedicated as much as possible 
> to workmem (initially I was allocating 1GB) but this looks quite 
> counterproductive (I didn't think that memory was allocated every time, I 
> thought it was "available" for the current query but not necessarily used). 
> Is this an issue specific to that version of PostgreSQL? (I guess no) Or can 
> this be hardware-related? Or OS-related (both systems on which I have done 
> tests are running Ubuntu, I will try on Debian)?

It would be good to narrow down which plan node is causing this.  Can
you try disabling various planner enable_* GUCs before running EXPLAIN
(SUMMARY ON)  with \timing on and see if you can find
which enable_* GUC causes the EXPLAIN to run more quickly?  Just watch
out for variations in the timing of "Planning Time:". You're still
looking for a large portion of time not accounted for by planning
time.

I'd start with:

SET enable_memoize=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_memoize;

SET enable_hashjoin=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_hashjoin;

The following will show others that you could try.
select name,setting from pg_settings where name like 'enable%';

David

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e731ed12aa

Re: I don't understand that EXPLAIN PLAN timings

2024-01-23 Thread David Rowley

On Tue, 23 Jan 2024 at 20:45, Jean-Christophe Boggio
 wrote:
> explain says actual time between 1.093→1.388 but final execution time says 
> 132.880ms?!?

The 1.388 indicates the total time spent in that node starting from
just before the node was executed for the first time up until the node
returned no more tuples.

The 132.88 ms is the total time spent to start up the plan, which
includes doing things like obtaining locks (if not already obtained
during planning), opening files and allocating memory for various
things needed for the plan.  After the top-level node returns NULL it
then will print out the plan before shutting the plan down. This
shutdown time is also included in the 132.88 ms.

I don't know where exactly the extra time is spent with your query,
but it must be in one of the operations that I mentioned which takes
place either before the top node is first executed or after the top
node returns NULL.

If you're using psql, if you do \timing on, how long does EXPLAIN take
without ANALYZE? That also goes through executor startup and shutdown.
It just skips the running the executor part.

David

Re: [EXTERNAL] Re: Performance down with JDBC 42

2023-11-05 Thread David Rowley

On Mon, 6 Nov 2023 at 08:37, Abraham, Danny  wrote:
>
> Both plans refer to the same DB.

JDBC is making use of PREPARE statements, whereas psql, unless you're
using PREPARE is not.

> #1 – Fast – using psql or old JDBC driver

The absence of any $1 type parameters here shows that's a custom plan
that's planned specifically using the parameter values given.

> Slow – when using JDBC 42

Because this query has $1, $2, etc, that's a generic plan. When
looking up statistics histogram bounds and MCV slots cannot be
checked. Only ndistinct is used. If you have a skewed dataset, then
this might not be very good.

You might find things run better if you adjust postgresql.conf and set
plan_cache_mode = force_custom_plan then select pg_reload_conf();

Please also check the documentation so that you understand the full
implications for that.

David

Re: Planning time is time-consuming

2023-09-11 Thread David Rowley

On Tue, 12 Sept 2023 at 02:27, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm not sure if you're asking for help here because you need planning
> > to be faster than it currently is, or if it's because you believe that
> > planning should always be faster than execution. If you think the
> > latter, then you're mistaken.
>
> Yeah.  I don't see anything particularly troubling here.  Taking
> circa three-quarters of a millisecond (on typical current hardware)
> to plan a four-way join on large tables is not unreasonable.

I took a few minutes to reverse engineer the tables in question (with
assistance from an AI bot) and ran the query in question.
Unsurprisingly, I also see planning as slower than execution, but with
a ratio of about planning being 12x slower than execution vs the
reported ~18x.

Planning Time: 0.581 ms
Execution Time: 0.048 ms

Nothing alarming in perf top of executing the query in pgbench with -M
simple.  I think this confirms the problem is just with expectations.

   5.09%  postgres  [.] AllocSetAlloc
   2.99%  postgres  [.] SearchCatCacheInternal
   2.52%  postgres  [.] palloc
   2.38%  postgres  [.] expression_tree_walker_impl
   1.82%  postgres  [.] add_path_precheck
   1.78%  postgres  [.] add_path
   1.73%  postgres  [.] MemoryContextAllocZeroAligned
   1.63%  postgres  [.] base_yyparse
   1.61%  postgres  [.] CatalogCacheComputeHashValue
   1.38%  postgres  [.] try_nestloop_path
   1.36%  postgres  [.] stack_is_too_deep
   1.33%  postgres  [.] add_paths_to_joinrel
   1.19%  postgres  [.] core_yylex
   1.18%  postgres  [.] lappend
   1.15%  postgres  [.] initial_cost_nestloop
   1.13%  postgres  [.] hash_search_with_hash_value
   1.01%  postgres  [.] palloc0
   0.95%  postgres  [.] get_memoize_path
   0.90%  postgres  [.] equal
   0.88%  postgres  [.] get_eclass_for_sort_expr
   0.81%  postgres  [.] compare_pathkeys
   0.80%  postgres  [.] bms_is_subset
   0.77%  postgres  [.] ResourceArrayRemove
   0.77%  postgres  [.] check_stack_depth
   0.77%  libc.so.6 [.] __memmove_avx_unaligned_erms
   0.74%  libc.so.6 [.] __memset_avx2_unaligned
   0.73%  postgres  [.] AllocSetFree
   0.71%  postgres  [.] final_cost_nestloop
   0.69%  postgres  [.] compare_path_costs_fuzzily
   0.68%  postgres  [.] initial_cost_mergejoin
   0.64%  libc.so.6 [.] __memset_avx2_unaligned_erms
   0.61%  postgres  [.] create_nestloop_path
   0.61%  postgres  [.] examine_variable
   0.59%  postgres  [.] hash_bytes
   0.56%  postgres  [.] truncate_useless_pathkeys
   0.56%  postgres  [.] bms_overlap

David

Re: Planning time is time-consuming

2023-09-11 Thread David Rowley

On Mon, 11 Sept 2023 at 21:54, Mikhail Balayan  wrote:
> Could it be a regression? I'll check it on PG14 when I get a chance.

I'm not sure if you're asking for help here because you need planning
to be faster than it currently is, or if it's because you believe that
planning should always be faster than execution. If you think the
latter, then you're mistaken. It seems to me that the complexity of
planning this query is much more complex than executing it.  The outer
side of the inner-most nested loop finds 0 rows, so it need not scan
the inner side, which results in that nested loop producing 0 rows,
therefore the outer side of none of the subsequent nested loops find
any rows. This is why you see "(never executed)" in the EXPLAIN
ANALYZE.

You could use perf record or perf top to dig into what's slow.

On the other hand, please report back if you find PG14 to be much faster here.

You could also experiment with a set of tables which are empty.  It's
possible getting the relation sizes are a factor to consider here.
mdnblocks() needs to do a bit more work when the relation has multiple
segments.

David

Re: Planning time is time-consuming

2023-09-11 Thread David Rowley

On Mon, 11 Sept 2023 at 18:16, Laurenz Albe  wrote:
> Also, there are quite a lot of indexes on "test_db_bench_1".  On a test 
> database, drop some
> indexes and see if that makes a difference.

Yeah, I count 3 that either have the key columns as some prefix of
another index or are just a duplicate of some other index.

Getting rid of those 3 will save some time in create_index_paths().

David

Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-30 Thread David Rowley

On Thu, 31 Aug 2023 at 06:32, Rondat Flyag  wrote:
> I tried VACUUM ANALYZE for three tables, but without success. I also tried to 
> set enable_seqscan=off and the query took even more time. If I set 
> enable_sort=off then the query takes a lot of time and I cancel it.
>
> Please see the attached query plans.

It's a little hard to comment here as I don't see what the plan was
before when you were happy with the performance. I also see the
queries you mentioned in the initial email don't match the plans.
There's no table called "isbns" in the query. I guess this is "asins"?

Likely you could get a faster plan if there was an index on
asins_statistics (asin_id).  That would allow a query plan that scans
the isbns_value_key index and performs a parameterised nested loop on
asins_statistics using the asins_statistics (asin_id) index.  Looking
at your schema, I don't see that index, so it's pretty hard to guess
why the plan used to be faster.  Even if the books/asins merge join
used to take place first, there'd have been no efficient way to join
to the asins_statistics table and preserve the Merge Join's order (I'm
assuming non-parameterized nested loops would be inefficient in this
case). Doing that would have also required the asins_statistics
(asin_id) index.  Are you sure that index wasn't dropped?

However, likely it's a waste of time to try to figure out what the
plan used to be. Better to focus on trying to make it faster. I
suggest you create the asins_statistics (asin_id) index. However, I
can't say with any level of confidence that the planner would opt to
use that index if it did exist.   Lowering random_page_cost or
increasing effective_cache_size would increase the chances of that.

David

Re: Range partitioning query performance with date_trunc (vs timescaledb)

2023-08-29 Thread David Rowley

On Tue, 29 Aug 2023 at 19:40, Philippe Pepiot  wrote:
> I'm trying to implement some range partitioning on timeseries data. But it
> looks some queries involving date_trunc() doesn't make use of partitioning.
>
> BEGIN;
> CREATE TABLE test (
> time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
> value FLOAT NOT NULL
> ) PARTITION BY RANGE (time);
> CREATE INDEX test_time_idx ON test(time DESC);
> CREATE TABLE test_y2010 PARTITION OF test FOR VALUES FROM ('2020-01-01') TO 
> ('2021-01-01');
> CREATE TABLE test_y2011 PARTITION OF test FOR VALUES FROM ('2021-01-01') TO 
> ('2022-01-01');
> CREATE VIEW vtest AS SELECT DATE_TRUNC('year', time) AS time, SUM(value) AS 
> value FROM test GROUP BY 1;
> EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
> ROLLBACK;
>
> The plan query all partitions:

> I wonder if there is a way with a reasonable amount of SQL code to achieve 
> this
> with vanilla postgres ?

The only options I see for you are

1) partition by LIST(date_Trunc('year', time)), or;
2) use a set-returning function instead of a view and pass the date
range you want to select from the underlying table via parameters.

I imagine you won't want to do #1. However, it would at least also
allow the aggregation to be performed before the Append if you SET
enable_partitionwise_aggregate=1.

#2 isn't as flexible as a view as you'd have to create another
function or expand the parameters of the existing one if you want to
add items to the WHERE clause.

Unfortunately, date_trunc is just a black box to partition pruning, so
it's not able to determine that DATE_TRUNC('year', time) >=
'2021-01-01'  is the same as time >= '2021-01-01'.  It would be
possible to make PostgreSQL do that, but that's a core code change,
not something that you can do from SQL.

David

Re: Slow query, possibly not using index

2023-08-28 Thread David Rowley

On Mon, 28 Aug 2023 at 19:21, Les  wrote:
> More important question is, how can I find out why the index was not auto 
> vacuumed.

You should have a look at pg_stat_user_tables.  It'll let you know if
the table is being autovacuumed and how often.  If you're concerned
about autovacuum not running properly, then you might want to lower
log_autovacuum_min_duration.  Generally, anything that takes a
conflicting lock will cause autovacuum to cancel so that the
conflicting locker can get through.  Things like ALTER TABLE or even
an ANALYZE running will cancel most autovacuum runs on tables.

Also, this is a fairly large table and you do have the standard
autovacuum settings.  Going by pgstattuple, the table has 39652836
tuples.  Autovacuum will trigger when the statistics indicate that 20%
of tuples are dead, which is about 8 million tuples. Perhaps that's
enough for the index scan to have to skip over a large enough number
of dead tuples to make it slow. You might want to consider lowering
the autovacuum scale factor for this table.

Also, ensure you're not doing anything like calling pg_stat_reset();

It might be worth showing us the output of:

select * from pg_stat_user_tables where relid = 'media.block'::regclass;

David

Re: Partitioning update-heavy queue with hash partitions vs partial indexes

2023-08-10 Thread David Rowley

On Thu, 10 Aug 2023 at 20:36, Dorian Hoxha  wrote:
> > Do Not Use Partial Indexes as a Substitute for Partitioning
> > While a search in this larger index might have to descend through a couple 
> > more tree levels than a search in a smaller index, that's almost certainly 
> > going to be cheaper than the planner effort needed to select the 
> > appropriate one of the partial indexes. The core of the problem is that the 
> > system does not understand the relationship among the partial indexes, and 
> > will laboriously test each one to see if it's applicable to the current 
> > query.
>
> Would this be true in my case too?

Yes.  The process of determining which partial indexes are valid for
the given query must consider each index one at a time and validate
the index's WHERE clause against the query's WHERE clause to see if it
can be used.  There is no shortcut that sees you have a series of
partial indexes with WHERE id % 10 = N; which just picks 1 index
without searching all of them.

> Is it faster for the planner to select a correct partition(hash partitioning 
> on `id` column) instead of a correct partial index like in my case? I don't 
> think I'll need more than ~32 partitions/partial-indexes in an extreme 
> scenario.

I mean, test it and find out, but probably, yes, the partition pruning
code for hash partitioning is an O(1) operation and is very fast.
Once the given Constants have been hashed, finding the partition is
just a single divide operation away.

David

Re: High QPS, random index writes and vacuum

2023-04-17 Thread David Rowley

On Tue, 18 Apr 2023 at 12:35, peter plachta  wrote:
> I increased work_mem to 2Gb

maintenance_work_mem is the configuration option that vacuum uses to
control how much memory it'll make available for storage of dead
tuples. I believe 1GB would allow 178,956,970 tuples to be stored
before multiple passes would be required. The chunk of memory for dead
tuple storage is capped at 1GB.

David

Re: speeding up grafana sensor-data query on raspberry pi 3

2023-04-16 Thread David Rowley

On Mon, 17 Apr 2023 at 05:00, Clemens Eisserer  wrote:
> Example:
> table smartmeter with non-null column ts (timestamp with time zone)
> and brinc index on ts, no pk to avoid a btree index.
> Sensor values are stored every 5s, so for 1 month there are about 370k
> rows - and in total the table currently holds about 3M rows.
> The query to display the values for 1 month takes ~3s, with the bitmap
> heap scan as well as aggregation taking up most of the time, with
> sorting in between.

I know you likely don't have much RAM to spare here, but more work_mem
might help, even just 16MBs might be enough. This would help the Sort
and to a lesser extent the Bitmap Heap Scan too.

Also, if you'd opted to use PostgreSQL 14 or 15, then you could have
performed CREATE STATISTICS on your GROUP BY clause expression and
then run ANALYZE.  That might cause the planner to flip to a Hash
Aggregate which would eliminate the Sort before aggregation.  You'd
only need to sort 236 rows after the Hash Aggregate for the ORDER BY.

Plus, what Justin said.

David

Re: multicolumn partitioning help

2023-03-15 Thread David Rowley

On Thu, 16 Mar 2023 at 00:47, James Robertson  wrote:
> or do we get?
>
> TopLevelTable
> |
> |> worker-thread 1 (default catch)
> |> worker thread 2 -> sub-table 1.1
> |> worker thread 3 -> sub-table 1.2
> |> worker thread 4 -> sub-table 1.n
> |
> |> worker-thread 5 (default catch)
> |> worker thread 6 -> sub-table 2.1
> |> worker thread 7 -> sub-table 2.2
> |> worker thread 8 -> sub-table 2.n

The planner generally flattens out the scans to each partition into a
single Append or MergeAppend node.  Append nodes can be parallelised.
Assuming there's no reason that a particular partition can't support
it, the parallel workers can be distributed to assist without
restriction to which partition they help with. Multiple workers can
even help with a single partition. Workers can then move over to
helping with other partitions when they're done with the partition
they've been working on. I believe some other databases do or did at
least restrict parallelism to 1 worker per partition (which perhaps is
why you raised this).  There's no such restriction with PostgreSQL.

> Summary:
> 1) if we create nested partitions, do we create performance issues:

If you create too many partitions, it can create performance issues.
You should look at the partitioning best practices section of the
documents for details about that. I recommend a careful read of those.

> 2) if nested partitions are the solutions, what is the point of multi-column 
> partitioning?

There are many reasons.  If you have multiple levels of partitioning,
then the partition pruning done during planning is going to have more
work to do as it'll be executed once, then once again for each
partitioned table remaining after running it for the first level.
Also, it seems much easier to PARTITION BY HASH(a,b) than to first do
HASH(a) then another level to HASH(b).  However, there may be
advantages to having multiple levels here as the planner would still
be able to prune partitions if the WHERE clause didn't contain any
quals like "b = ".  The key take away here is that they're
different, so we support both.

> wish list) wouldn't it be neat if we can do mult-mode multi-column? like 
> PARTITION BY RANGE (EXTRACT(YEAR FROM dob)) LIST (SUBSTRING(hash, 1, 1));

Effectively, multi-level partitioning gives you that, It's just the
DDL is different from how you wrote it.

David

Re: multicolumn partitioning help

2023-03-15 Thread David Rowley

On Wed, 15 Mar 2023 at 10:41, Laurenz Albe  wrote:
> I think you want subpartitioning, like
>
>   CREATE TABLE humans (
>  hash bytea,
>  fname text,
>  dob date
>   ) PARTITION BY LIST (EXTRACT (YEAR FROM dob));

This may be perfectly fine, but it is also important to highlight that
partitioning in this way may hinder partition pruning.

If the first level partitioned table was to be BY RANGE (dob); then
the partitions could be defined like FOR VALUES FROM ('2023-01-01') TO
('2024-01-01'). For a query that had something like WHERE dob =
'2023-03-16', then PostgreSQL could prune away all the partitions for
the other years.  The same wouldn't occur if the table was partitioned
by LIST (EXTRACT (YEAR FROM dob)) unless you added a AND EXTRACT (YEAR
FROM dob) = 2023 to the query's WHERE clause.

Robert, there are a few tips about partitioning in [1] that you may
wish to review.

David

[1] https://www.postgresql.org/docs/devel/ddl-partitioning.html

Re: Window Functions & Table Partitions

2023-02-20 Thread David Rowley

On Fri, 10 Feb 2023 at 06:40, Benjamin Tingle  wrote:
> Thanks for the helpful response david! I'll have a shot at getting the patch 
> to work myself & submitting to pgsql-hackers.

I took some time today for this and fixed up a few mistakes in the
patch and added it to the March commitfest [1].  Time is ticking away
for v16, so given this is a fairly trivial patch, I thought it might
be nice to have it.

Any discussion on the patch can be directed at [2]

David

[1] https://commitfest.postgresql.org/42/4198/
[2] 
https://www.postgresql.org/message-id/flat/caaphdvojkdbr3mr59jxmacybyhb6q_5qpru+dy93en8wm+x...@mail.gmail.com

> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).>
> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).

Re: Window Functions & Table Partitions

2023-02-08 Thread David Rowley

On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
> Basically- window partition functions don't take advantage of existing table 
> partitions. I use window functions as a more powerful GROUP BY clause that 
> preserves row-by-row information- super handy for a lot of things.
>
> In particular, I want to use window functions on already partitioned tables, 
> like the below example:
>
> create table abb (a int, b int, g int) partition by hash(b)
> /* populate table etc... */
> select a, b, min(a) over (partition by b) as g from abb
>
> Ideally with a query plan like this:
>
> Window:
> Append:
> Sort on table_p0
> Sort on table_p1
> Sort on table_p2

There was some effort [1] in version 12 to take advantage of the order
defined by the partitioning scheme. The release notes [2] mention:

"Avoid sorting when partitions are already being scanned in the necessary order"

However, it's not 100% of what you need as there'd have to be a btree
index on abb(b) for the planner to notice.

Likely this could be made better so that add_paths_to_append_rel()
added the pathkeys defined by the partitioned table into
all_child_pathkeys if they didn't exist already. In fact, I've
attached a very quickly hacked together patch against master to do
this.  I've given it very little thought and it comes complete with
failing regression tests.

If you're interested in pursuing this then feel free to take the patch
to the pgsql-hackers mailing list and propose it. It's unlikely I'll
get time to do that for a while, but I will keep a branch locally with
it to remind me in case I do at some point in the future.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
[2] https://www.postgresql.org/docs/release/12.0/
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index ae0f9bdc8a..c4271c9179 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1365,6 +1365,27 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo 
*rel,
}
}

+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+   partitions_are_ordered(rel->boundinfo, rel->live_parts))
+   {
+   List   *partition_pathkeys;
+   boolpartial;
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  ForwardScanDirection,
+   
  &partial);
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  BackwardScanDirection,
+   
  &partial);
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+   }
+
/*
 * Collect lists of all the available path orderings and
 * parameterizations for all the children.  We use these as a

Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley

On Thu, 2 Feb 2023 at 14:49, Thomas Munro  wrote:
> If I had more timerons myself, I'd like to try to make parallel
> function scans, or parallel CTE scans, work...

I've not really looked in detail but I thought parallel VALUES scan
might be easier than those two.

David

Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley

On Wed, 1 Feb 2023 at 18:39, Alex Kaiser  wrote:
> postgres=# set force_parallel_mode = on;

There's been a bit of debate about that GUC and I'm wondering how you
came to the conclusion that it might help you. Can you share details
of how you found out about it and what made you choose to set it to
"on"?

David

Re: Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-18 Thread David Rowley

On Sun, 18 Dec 2022 at 23:44, João Paulo Luís  wrote:
> Meanwhile, as a one-time workaround I've disabled the hashagg algorithm,

The way the query planner determines if Hash Aggregate's hash table
will fit in work_mem or not is based on the n_distinct estimate of the
columns being grouped on.  You may want to review what analyze set
n_distinct to on this table. That can be done by looking at:

select attname,n_distinct from pg_Stats where tablename =
'sentencesource' and attname = 'sentence';

If what that's set to does not seem realistic, then you can overwrite this with:

ALTER TABLE sentencesource ALTER COLUMN sentence SET (n_distinct = N);

Please see the paragraph in [1] about n_distinct.  Using an absolute
value is likely not a great idea if the table is going to grow. You
could maybe give it a better estimate about how many times values are
repeated by setting some negative value, as described in the
documents. You'll need to analyze the table again after changing this
setting.

David

[1] https://www.postgresql.org/docs/12/sql-altertable.html

Re: DML sql execution time slow down PGv14 compared with PGv13

2022-12-15 Thread David Rowley

On Thu, 15 Dec 2022 at 21:12, James Pang (chaolpan)  wrote:
>We had some load test ( DML inserts/deletes/updates/ on tens of hash 
> partition tables)  and found that PGV14 slow down 10-15% compared with PGV13. 
>  Same test server, same schema tables and data. From pg_stat_statements, sql 
> exec_time, we did found similar mean_exec_time increased from 5%-25% with 
> same SQL statements. Both v14 and v13 give very fast sql response time, just 
> compare the %diff from sql statements mean_exec_time.

I tried this out on the tip of the PG13 and PG14 branch with the same
scale of pgbench as you mentioned and I don't see the same slowdown as
you do.

PG13:
tps = 1711.980109 (excluding connections establishing)

PG14:
tps = 1736.466835 (without initial connection time)

As for why yours might be slower.  You might want to have a look at
the EXPLAIN ANALYZE output for the UPDATE statements. You can recreate
the -M prepared by using PREPARE and EXECUTE. You might want to
execute the statements 6 times and see if the plan changes on the 6th
execution.  It's likely not impossible that PG14 is using custom
plans, whereas PG13 might be using generic plans for these updates.
There were some quite significant changes made to the query planner in
PG14 that changed how planning works for UPDATEs and DELETEs from
partitioned tables.  Perhaps there's some reason there that the
custom/generic plan choice might differ. I see no reason why INSERT
would have become slower. Both the query planning and execution is
very different for INSERT.

You might also want to have a look at what perf says. If you have the
debug symbols installed, then you could just watch "perf top --pid=". Maybe that will show you
something interesting.

Re: wrong rows and cost estimation when generic plan

2022-12-06 Thread David Rowley

On Tue, 6 Dec 2022 at 20:17, James Pang (chaolpan)  wrote:
>  Could you provide the function name for generic plan selectivity estimation?

If you look at eqsel_internal(), you'll see there are two functions
that it'll call var_eq_const() for Consts and otherwise
var_eq_non_const(). It'll take the non-Const path for planning generic
plans.

David

Re: wrong rows and cost estimation when generic plan

2022-12-05 Thread David Rowley

On Tue, 6 Dec 2022 at 18:28, James Pang (chaolpan)  wrote:
>->  Index Scan using idx_x_time on x  (cost=0.44..8.48 rows=1 
> width=2923) (actual time=8136.242..8136.242 rows=0 loops=1)
>  Index Cond: ((starttime = $7) AND (endtime = $8))
>  Filter: ((password IS NULL) AND ...(aid = $4) AND (bid = $5) AND 
> (btype = $6) AND...
>  Rows Removed by Filter: 5534630

I wonder if you did:

create statistics x_starttime_endtime_stats  (ndistinct) on
starttime,endtime from x;
analyze x;

if the planner would come up with a higher estimate than what it's
getting for the above and cause it to use the other index instead.

> optimzer is very complicated, could you direct me how optimizer to do 
> selectivity estimation when building generic plan, for this case? for 
> custom_plan, optimizer knows boundparams values, but when generic_plan, 
> planner() use boundparams=NULL,  it try to calculate average value based on 
> mcv list of the index attributes (starttime,endtime)  ?

IIRC, generic plan estimates become based on distinct estimations
rather than histograms or MCVs.

David

Re: Catching up with performance & PostgreSQL 15

2022-11-29 Thread David Rowley

On Wed, 30 Nov 2022 at 03:31, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > IMO it was a mistake to turn JIT on in the default config, so that's one
> > thing you'll likely want to change.
>
> I wouldn't necessarily go quite that far, but I do think that the
> default cost thresholds for invoking it are enormously too low,
> or else there are serious bugs in the cost-estimation algorithms
> for deciding when to use it.  A nearby example[1] of a sub-1-sec
> partitioned query that took 30sec after JIT was enabled makes me
> wonder if we're accounting correctly for per-partition JIT costs.

I'm very grateful for JIT. However, I do agree that the costs need to work.

The problem is that the threshold to turn JIT on does not consider how
many expressions need to be compiled. It's quite different to JIT
compile a simple one-node plan with a total cost of 10 than to JIT
compile a plan that  costs the same but queries 1000 partitions. I
think we should be compiling expressions based on the cost of the
individial node rather than the total cost of the plan. We need to
make some changes so we can more easily determine the number of times
a given node will be executed before we can determine how worthwhile
JITting an expression in a node will be.

David

> [1] 
> https://www.postgresql.org/message-id/B6025887-D73F-4B5B-9925-4DA4B675F7E5%40elevated-dev.com

Re: Postgresql 13 partitioning advice

2022-08-02 Thread David Rowley

On Tue, 2 Aug 2022 at 19:48, Ameya Bidwalkar
 wrote:
> We have a Postgresql 13 database where we have a single table with several 
> millions of rows  . We plan to partition it based on timestamp .
> We have  been seeking  advice for best practices for building this.
> This table will get lots of updates for the same rows during a short period 
> of time.During this time rows would be in a single partition .
> After this short time these rows would move to  another partition  .Where no 
> more updates take place on these rows.But might  have some SELECT queries  
> running.
> We plan to l have  partitions based on months and then roll them up in a year 
>  and then archive these older partitions
> One consultant we talked with told us this  row movement between the 
> partitions will have
> huge complications .But  this was an issue during the Postgres 10 version .

Define "huge complications"?

The capabilities of partitioned tables have changed quite a bit since
the feature was added. It's very easy for knowledge to get out-dated
in this area.  I did quite a bit of work on them and I struggle to
remember off the top of my head which versions saw which improvements.
PG12 saw lots. See [1], search for "partition".

One possible complication is what is mentioned in [2] about
"serialization failure error".  UPDATEs that cause a tuple to move to
another partition can cause a serialization failure at transaction
isolation level, not just serializable transactions. If it's not
already, you might want to have your application retry transactions on
SQL:40001 errors.

Apart from that, assuming there's comparatively a small number of rows
in the partition being updated compared to the partition with the
static rows, then it sounds fairly efficient. As you describe it,  the
larger static partition is effectively INSERT only and auto-vacuum
will need to touch it only for tuple freezing work.  The smaller of
the two tables will receive more churn but will be faster to vacuum.
PG13 got a new feature that makes sure auto-vacuum also does the
rounds on INSERT-only tables too, so the static partition is not going
to be neglected until anti-wrap-around-autovacuums trigger, like they
would have in PG12 and earlier.

Another thing to consider is that an UPDATE of a non-partitioned table
has a chance at being a HOT update. That's possible if the tuple can
fit on the same page and does not update any of the indexed columns. A
HOT update means no indexes need to be updated so these perform faster
and require less space in WAL than a non-HOT update.  An UPDATE that
moves a tuple to another partition can never be a HOT update.  That's
something you might want to consider. If you're updating indexed
columns already then it's not a factor to consider.There's also
overhead to postgres having to find the partition for the newly
updated version of the tuple.  That's not hugely expensive, but it's
generally measurable. RANGE partitioned tables with a large number of
partitions will have the most overhead for this. HASH partitioned
tables, the least.

The best thing you can likely do is set up a scenario with pgbench and
compare the performance. pgbench is a pretty flexible tool that will
allow you to run certain queries X% of the time and even throttle the
workload at what you expect your production server to experience. You
could then run it overnight on a test server, or even for weeks and
see how auto-vacuum keeps up when compared to the non-partitioned
case. You can also check how much extra WAL is generated vs the
non-partitioned case.

> So we are seeking advice on the performance perspective and things we should 
> take care of along with manual vacuums on a regular schedule and indexing.
> Are there any tunables I should experiment with in particular ?

Perhaps if you want to keep a small high-chun table in check you might
want to consider if autovacuum_naptime is set low enough. You may not
care if the space being consumed in the standard 1min
autovacuum_naptime is small enough not to be of concern.

David

[1] https://www.postgresql.org/docs/release/12.0/
[2] https://www.postgresql.org/docs/13/sql-update.html

Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread David Rowley

On Thu, 5 May 2022 at 11:15, André Hänsel  wrote:
>
> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;

Unfortunately, the query planner is not quite smart enough to realise
that your shipping_date clauses can only match a single value.
There's quite a bit more we could do with the planner's
EquivalanceClasses. There is a patch around to help improve things in
this area but it requires some more infrastructure to make it more
practical to do from a performance standpoint in the planner.

You'll get the plan you want if you requite the query and replace your
date range with shipping_date = '2022-05-01'.  Your use of WHERE TRUE
indicates to me that you might be building this query in an
application already, so maybe you can just tweak that application to
test if the start and end dates are the same and use equality when
they are.

David

[1] https://commitfest.postgresql.org/38/3524/

Re: Window partial fetch optimization

2022-05-03 Thread David Rowley

On Wed, 4 May 2022 at 06:11, Levi Aul  wrote:
> It is our expectation that this query “should” be able to be cheap-to-compute 
> and effectively instantaneous. (It’s clear to us how we would make it so, 
> given a simple LMDB-like sorted key-value store: prefix-match on 
> holder_address; take the first row you find for the contract-address you’re 
> on; build a comparator key of (holder_address, contract_address, 
> highest-possible-version) and traverse to find the lowest row that sorts 
> greater than it; repeat.)
>
> Which might, in SQL, be expressed as something like this:
>
> WITH ranked_holder_balances AS (
> SELECT
> *,
> row_number() OVER w AS balance_rank
> FROM contract_balance_updates
> WHERE holder_address = '\x'::bytea
> WINDOW w AS (
> PARTITION BY holder_address, contract_address
> ORDER BY start_block_height DESC
> )
> ORDER BY holder_address ASC, contract_address ASC, start_block_height DESC
> )
> SELECT *
> FROM ranked_holder_balances
> WHERE balance_rank = 1

Yes, PostgreSQL 14 is not very smart about realising that WHERE
balance_rank = 1 is only going to match the first row of each window
partition. PostgreSQL 15 (coming later this year) should be better in
this regard as some work was done to teach the query planner about
monotonic window functions [1].  However, that change likely does not
do all you'd like here as the WindowAgg node still must consume and
throw away all tuples until it finds the first tuple belonging to the
next window partition.  It sounds like you really want "Skip Scans" or
"Loose Index Scans" which are implemented by some other RDBMS'.  I
imagine that even with the change to PostgreSQL 15 that it still
wouldn't be as fast as your DISTINCT ON example.

> WITH bup1 AS (
> SELECT DISTINCT bup.holder_address, bup.contract_address
> FROM contract_balance_updates bup
> WHERE bup.holder_address = 
> '\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea
> ORDER BY bup.contract_address ASC
> )
> SELECT
>   bup1.holder_address,
>   bup1.contract_address,
>   (
> SELECT balance
> FROM contract_balance_updates bup2
> WHERE bup2.holder_address = bup1.holder_address
> AND bup2.contract_address = bup1.contract_address
> ORDER BY bup2.holder_address ASC, bup2.contract_address ASC, 
> bup2.start_block_height DESC
> LIMIT 1
>   ) AS balance
> FROM bup1

> I really don’t like this last approach; it scans twice, it’s surprising / 
> confusing for people maintaining the query, etc. I believe that, due to the 
> correlated subquery, the planning overhead is also O(N) with the number of 
> matched entities increases (though I don’t have a good test-case for this.)

No, the subquery is not replanned each time it is rescanned. It's
planned once and that same plan will be executed each time. So no O(n)
planning overhead.

> Is there any way to get PG to do what this last query is doing, purely using 
> window-functions / distinct on / etc.? Because, if there is, I can’t find it.
>
> It seems that PG can in fact do index-range-seeking (since that’s what it’s 
> doing when gathering the distinct contract_addresses in the last query.) It 
> seems intuitive to me that it should be using such an approach to filter for 
> rows in window/group-partitions, when a criteria+index that can be combined 
> to limit the size of the window/group are available to the planner. And that, 
> even when not able to be automatically inferred, it would make sense for 
> there to be control over such behaviour in SQL, using hypothetical syntax 
> like:

Unfortunately, DISTINCT can only be implemented with Hash Aggregate or
Sort / Index Scan + Unique.  We don't have anything currently which
will jump to the next highest key in an index.  There has been some
work on what we're starting to call "Skip Scans", but it's all still
work in progress.

You might find something useful in [2] which might help speed up your
DISTINCT query.

> -- for windows
> row_number() OVER (PARTITION BY x ORDER BY x LIMIT 10 OFFSET 3)
>
> -- for groups
> GROUP BY x, y, z (APPLYING LIMIT 20 OFFSET 5 PER GROUP)
>
>
> Does this make sense? Or is this something PG is already doing, and I just 
> haven’t found the right magic words / built my index correctly to unlock it? 
> (I notice that the last example is an index-only scan; would I get this 
> behaviour from the previous two queries if I made the index a covering index 
> such that those could be index-only scans as well?)

Unfortunately, there is no magic words here. PostgreSQL 14 simply has
no ability to know that row_number() is monotonically increasing,
therefore has no ability to skip any processing for rows that are
never needed.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
[2] https://wiki.postgresql.org/wiki/Loose_indexscan

Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley

On Wed, 4 May 2022 at 00:21, Benjamin Coutu  wrote:
> Thanks David, using extended statistics for both (and only for both) tables 
> solved this problem.

Oh, whoops. I did get that backwards.  The estimate used by the
Memoize costing code is from the outer side of the join, which is the
extdataempty in this case. I don't think the
extdataregular_field_index_stats will do anything. It'll be the ones
you added on extdataempty that are making it work.

> BTW, thank you for all your work on performance in recent releases.

Thanks for the feedback :)

David

Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley

On Tue, 3 May 2022 at 23:05, Benjamin Coutu  wrote:
>   ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
> rows=0 loops=4067215)
> Cache Key: e2.field, e2.index
> Cache Mode: logical
> Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
> Usage: 65537kB
> Buffers: shared hit=16268863
> ->  Index Only Scan using pk_extdataregular on extdataregular e1  
> (cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
> loops=4067215)
>   Index Cond: ((field = e2.field) AND (index = e2.index))

> Why would it ever make sense to generate a memoize plan for a unique join?

It wouldn't ever make sense.

The problem is that estimate_num_groups() is used to estimate the
number of distinct values and that function does not know about
primary keys. There's no way the costing of Memoize would allow a
Memoize plan to be used if it thought all values were unique, so the
only possibility here is that ndistinct is being underestimated by
some amount that makes Memoize look like the most favourable plan.

You could see what the planner thinks about the ndistinct estimate on
field, index by doing:

EXPLAIN SELECT field,index FROM extdataregular GROUP BY 1,2;

Whatever you see in the final row estimate for that plan is what's
being fed into the Memoize costing code.

> I think this issue might tie in with the current discussion over on the 
> hackers mailing list [1]

I'd say it's a pretty different problem. The cache hit ratio
discussion on that thread talks about underestimating the hit ratio.
That particular problem could only lead to Memoize plans *not* being
chosen when they maybe should be. Not the other way around, which is
your case.

create statistics extdataregular_field_index_stats (ndistinct) on
field, index from extdataregular;
analyze extdataregular;

would likely put that right.

David

Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.

2022-04-27 Thread David Rowley

On Wed, 27 Apr 2022 at 19:54, Emil Iggland  wrote:
>
>  > You've got the wrong column order (for this query anyway) in that
>  > index.  It'd work a lot better if dataview were the first column;

> I might be misunderstanding you, but I assume that you are suggesting an
> index on (dataview, valuetimestamp).
> We have that index, it is the primary key. For some reason it isn't
> being selected.

I don't think that index can be used for your original query. It could
only be used if "channel" is unique in "valueseries" and you'd written
the query as:

select * from datavalue
where dataview = (select id from valueseries where channel = 752433)
ORDER BY VALUETIMESTAMP DESC
FETCH FIRST ROW only;

that would allow a backwards index scan using the (dataview,
valuetimestamp) index.  Because you're using the IN clause to possibly
look for multiple "dataview" values matching the given "channel", the
index range scan does not have a single point to start at.  What
you've done with the LATERAL query allows the index to be scanned once
for each "valueseries" row with a "channel" value matching your WHERE
clause.

I guess "channel" must not be the primary key to "valueseries" and
that's why you use an IN().

The above query would return an error if multiple rows were returned
by the subquery.

David

Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley

On Fri, 23 Jul 2021 at 04:14, Peter Geoghegan  wrote:
>
> On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> > That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
> > it's quite new after all.  Can you extract a self-contained test case that
> > behaves this way?
>
> I wonder if this has something to do with the way that the input data
> is clustered. I recall noticing that that could significantly alter
> the behavior of HashAggs as of Postgres 13.

Isn't it more likely to be reaching the group limit rather than the
memory limit?

if (input_groups * hashentrysize < hash_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = hash_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}

There are 55 aggregates on a varchar(255). I think hashentrysize is
pretty big. If it was 255*55 then only 765591 groups fit in the 10GB
of memory.

David

Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley

On Fri, 23 Jul 2021 at 03:56, Tom Lane  wrote:
> So basically, we now have a hard restriction that hashaggs can't use
> more than INT_MAX kilobytes, or approximately 2.5GB, and this use case
> is getting eaten alive by that restriction.  Seems like we need to
> do something about that.

Hmm, math check?

postgres=# select pg_size_pretty(power(2,31)::numeric*1024);
 pg_size_pretty

 2048 GB
(1 row)

David

Re: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread David Rowley

On Thu, 22 Jul 2021 at 16:37, l...@laurent-hasson.com
 wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you
see no "Disk Usage" there.  As mentioned, v11 didn't spill to disk and
just used all the memory it pleased.  That was a bit dangerous as it
could result in OOM, so it was fixed.

David

Re: Linear slow-down while inserting into a table with an ON INSERT trigger ?

2021-07-17 Thread David Rowley

On Sat, 17 Jul 2021 at 16:40, Justin Pryzby  wrote:
> You could run a single UPDATE rather than 30k triggers.
> Or switch to an INSERT on the table, with an index on it, and call
> max(last_parent_table_change) from whatever needs to ingest it.  And prune the
> old entries and vacuum it outside the transaction.  Maybe someone else will
> have a better suggestion.

Maybe just change the UPDATE statement to:

UPDATE data_sync SET last_parent_table_change=CURRENT_TIMESTAMP WHERE
last_parent_table_change <> CURRENT_TIMESTAMP;

That should reduce the number of actual updates to 1 per transaction.

David

Re: Partition column should be part of PK

2021-07-11 Thread David Rowley

On Mon, 12 Jul 2021 at 12:37, Nagaraj Raj  wrote:
> personally, I feel this design is very bad compared to other DB servers.

I'm not sure exactly what you're referring to here as you didn't quote
it, but my guess is you mean our lack of global index support.

Generally, there's not all that much consensus in the community that
this would be a good feature to have.  Why do people want to use
partitioning?  Many people do it so that they can quickly remove data
that's no longer required with a simple DETACH operation.  This is
metadata only and is generally very fast.  Another set of people
partition as their tables are very large and they become much easier
to manage when broken down into parts.  There's also a group of people
who do it for the improved data locality.   Unfortunately, if we had a
global index feature then that requires building a single index over
all partitions.  DETACH is no longer a metadata-only operation as we
must somehow invalidate or remove tuples that belong to the detached
partition. The group of people who partitioned to get away from very
large tables now have a very large index.  Maybe the only group to get
off lightly here are the data locality group. They'll still have the
same data locality on the heap.

So in short, many of the benefits of partitioning disappear when you
have a global index.

So, why did you partition your data in the first place?  If you feel
like you wouldn't mind having a large global index over all partitions
then maybe you're better off just using a non-partitioned table to
store this data.

David

Re: Planning performance problem (67626.278ms)

2021-07-02 Thread David Rowley

On Thu, 1 Jul 2021 at 08:56, Manuel Weitzman  wrote:
> For each of these RestrictInfos there *could* be one cache miss on
> cached_scansel() that *could* force the planner to compute
> get_actual_variable_range() for the same variable (a.a) over and over,
> as mergejoinscansel() always computes the selectivity for the
> intervals that require actual extremal values. In practice this
> re-computing of the variable range seems to happen a lot.

Recently, for some other topic, I was thinking about if we were ever
to have the executor give up on a plan because something did not play
out the way the planner expected it to, that if the executor to ever
be given that ability to throw the plan back at the planner with some
hints about where it went wrong, then I wondered what exactly these
hints would look like.

I guessed these would be a list of some rough guidelines saying that x
JOIN y ON x.a=y.b produces at least Z rows. Then the planner would
find that when estimating the size of the join between x and y and
take the maximum of two values.  That would need to be designed in
such a way that the planner could quickly consult the feedback, e.g
hashtable lookup on Oid.

Anyway, I don't really have any clearly thought-through plans for that
idea as it would be a large project that would need a lot of changes
before it could be even thought about seriously. However, it did cause
me to think about that again when reading this thread as it seems
similar. You've learned the actual variable range through actual
execution, so it does not seem too unreasonable that information might
get stored in a similar place, if that place were to exist.

I'm not saying that's what needs to happen here. It's more just food
for thought.  The caching you need does seem more specific to Oid than
relid.

David

Re: PgSQL 12 on WinSrv ~3x faster than on Linux

2021-06-04 Thread David Rowley

On Fri, 4 Jun 2021 at 23:53, Taras Savchuk  wrote:
> My real life test is to "register" 10 _same_ documents (провести документы) 
> in each of 1C/PostgreSQL DBs. Both PostgreSQL DBs are identical and just 
> before test imported to PostgreSQL via application server (DT import).
> On Windows Server test procedure takes 20-30 seconds, on Linux it takes 
> 1m-1m10seconds. PostgreSQL VMs are running on same Hypervisor with same 
> resources assigned to each of them.
> Tuning PostgreSQL config and/or CentOS don't make any difference. Contrary on 
> Windows VM we have almost 3x better performance with stock PostgreSQL config.
>
> Any ideas what's wrong? For me such a big difference on identical 
> databases/queries looks strange.

It's pretty difficult to say. You've not provided any useful details
about the workload you're running.

If this "register 10 _same_ documents" thing requires running some
query, then you might want to look at EXPLAIN (ANALYZE, BUFFERS) for
that query.  You might want to consider doing SET track_io_timing =
on;  Perhaps Linux is having to read more buffers from disk than
Windows.

David.

Re: issue partition scan

2021-05-25 Thread David Rowley

On Wed, 26 May 2021 at 11:38, Nagaraj Raj  wrote:
>
> Apologies,  I didn't understand you completely.
>
> > 1. Those that have sub_soc.soc = 'NFWJYW0' and sub_soc.curr_ind = 'Y'
>
> > It can use constraint exclusion on these to only scan applicable partitions.
>
> > 2. Those that have (acc.acct = 'I' AND acc.acct_sub IN  ( '4', '5' ) )  OR 
> > sub.ban IN  ( '00','01','02','03','04','05' )
>
> > It can't use constraint exclusion on these since results can come from any 
> > partition.
>
> Why is it not using constraint exclusion on the above two conditions(1 and 2) 
> included in the where clause ?
>
> Both sets are pointing to different tables.

It's because of the OR condition.  If it was an AND condition then the
planner wouldn't have to consider the fact that records in other
partitions might be required for the join.

David

Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley

On Sat, 22 May 2021 at 10:59, Nagaraj Raj  wrote:
> ERROR: empty range bound specified for partition "mytable_z" DETAIL: 
> Specified lower bound ('Z') is greater than or equal to upper bound ('['). 
> SQL state: 42P17

It looks like '[' does not come after 'Z' in your collation.

David

Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley

On Sat, 22 May 2021 at 04:38, Nagaraj Raj  wrote:
> I am trying to create partitions on the table which have around 2BIL records 
> and users will always look for the "name", its not possible to create a 
> partition with a list, so we are trying to create a partition-based first 
> letter of the name column. name column has a combination of alpha numeric 
> values.

Going by the description of your use case, I think HASH partitioning
might be a better option for you. It'll certainly be less painful to
initially set up and maintain.

Here's an example:

create table mytable (a text) partition by hash(a);
create table mytable0 partition of mytable for values with(modulus 10,
remainder 0);
create table mytable1 partition of mytable for values with(modulus 10,
remainder 1);
create table mytable2 partition of mytable for values with(modulus 10,
remainder 2); --etc

Change the modulus to the number of partitions you want and ensure you
create a partition for each modulus. In this case, it would be 0 to 9.

David

Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley

On Fri, 21 May 2021 at 19:02, Nagaraj Raj  wrote:
> then what would be the range of Z
> FROM (Z) to (?) ;

postgres=# select chr(ascii('z')+1) ;
 chr
-
 {
(1 row)

> same way for 9

postgres=# select chr(ascii('9')+1) ;
 chr
-
 :
(1 row)

https://en.wikipedia.org/wiki/ASCII

You can also use MINVALUE and MAXVALUE to mean unbounded at either end
of the range.

But is there a particular need that you want to partition this way? It
seems like it might be a bit painful to maintain, especially if you're
not limiting yourself to ASCII or ANSI characters.

You might want to consider HASH partitioning if you're just looking
for a way to keep your tables and indexes to a more manageable size.
You've not really mentioned your use case here, so it's hard to give
any advice.

There are more details about partitioning in
https://www.postgresql.org/docs/current/ddl-partitioning.html

David

Re: Partition with check constraint with "like"

2021-05-20 Thread David Rowley

On Fri, 21 May 2021 at 12:32, Nagaraj Raj  wrote:
> I am trying to create partitions on the table based on first letter of the 
> column record  value using inherit relation & check constraint.

You'll get much better performance out of native partitioning than you
will with the old inheritance method of doing it.

>  EXECUTE 'CREATE TABLE partition_tab.' || c_table || '(check ( name like 
> '''|| chk_cond||''')) INHERITS (' ||TG_TABLE_NAME|| ');';

This is a bad idea. There's a lock upgrade hazard here that could end
up causing deadlocks on INSERT.  You should just create all the tables
you need beforehand.

I'd recommend you do this using RANGE partitioning. For example:

create table mytable (a text not null) partition by range (a);
create table mytable_a partition of mytable for values from ('a') to
('b'); -- note the upper bound of the range is non-inclusive.
create table mytable_b partition of mytable for values from ('b') to ('c');
insert into mytable values('alpha'),('bravo');

explain select * from mytable where a = 'alpha';
QUERY PLAN
---
 Seq Scan on mytable_a mytable  (cost=0.00..27.00 rows=7 width=32)
   Filter: (a = 'alpha'::text)
(2 rows)

The mytable_b is not scanned.

David

Re: Index and statistics not used

2021-05-17 Thread David Rowley

On Tue, 18 May 2021 at 08:42,  wrote:
> Running delete from table1 where id = 48938 the trigger for constraint runs 
> for 20 seconds
>
> Event when doing a simple select from table2 where table1_id = 48938 takes 
> about 8 seconds

Does EXPLAIN show it uses a seq scan for this 8-second SELECT?

If so, does it use the index if you SET enable_seqscan TO off; ? If
so, how do the costs compare to the seqscan costs?

Is random_page_cost set to something sane? Are all the indexes valid?
(psql's \d table2 would show you INVALID if they're not.)

does: SHOW enable_indexscan; show that index scanning is switched on?

David

Re: BUG #16968: Planner does not recognize optimization

2021-05-17 Thread David Rowley

On Sat, 15 May 2021 at 00:39, KES  wrote:
>
> Thank you for detailed explanation. I glad to hear that I can use aliases and 
> this will be recognized and optimization is applied.
>
> >We'd need some sort of ability to assign ressortgroupref to a particular 
> >column within a
> whole-row var
> Could it be possible to create hidden alias in same way as I did that 
> manually?
>
> Algorithm seems not complex:
> 1. User refer column from composite type/whole-row: (o).agreement_id
> 2. Create hidden column at select: _o_agreement_id
> 3. Replace other references to (o).agreement_id by _o_agreement_id
> 4. Process query as usual after replacements

Internally Postgresql does use a hidden column for columns that are
required for calculations which are not in the SELECT list. e.g ones
that are in the GROUP BY / ORDER BY, or in your case a window
function's PARTITION BY. We call these "resjunk" columns.  The problem
is you can't reference those from the parent query. If you explicitly
had listed that column in the SELECT clause, it won't cost you
anything more since the planner will add it regardless and just hide
it from you.  When you add it yourself you'll be able to use it in the
subquery and you'll be able to filter out the partitions that you
don't want.

I really think you're driving yourself down a difficult path by
expecting queries with whole-row vars to be optimised just as well as
using select * or explicitly listing the columns.

David

Re: BUG #16968: Planner does not recognize optimization

2021-05-17 Thread David Rowley

On Sun, 16 May 2021 at 02:34, Eugen Konkov  wrote:
> I found a case when `not assigning a ressortgroupref to the whole-row var` 
> cause
> wrong window function calculations.
>
> I  use  same  query.  The  difference  come  when I wrap my query into
> function. (see full queries in attachment)
>
> 1.
> SELECT *
> FROM agreement_totals( tstzrange( '2020-07-01', '2020-08-01' ) )
> WHERE agreement_id = 161::int  AND  (o).period_id = 10::int
>
> 2.
> SELECT *
>   sum(  ) over wagreement
> FROM 
> WHERE agreement_id = 161::int  AND  (o).period_id = 10::int
> WINDOW wagreement AS ( PARTITION BY agreement_id )
>
> For  first  query  window function calculates SUM over all agreements,
> then some are filtered out by (o).period_id condition.

This is unrelated to the optimisation that you were asking about before.

All that's going on here is that WHERE is evaluated before SELECT.
This means that your filtering is done before the window functions are
executed.  This is noted in the docs in [1]:

> The rows considered by a window function are those of the “virtual table” 
> produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and 
> HAVING clauses if any. For example, a row removed because it does not meet 
> the WHERE condition is not seen by any window function. A query can contain 
> multiple window functions that slice up the data in different ways using 
> different OVER clauses, but they all act on the same collection of rows 
> defined by this virtual table.

If you want to filter rows after the window functions are evaluated
then you'll likely want to use a subquery.

David

[1] https://www.postgresql.org/docs/13/tutorial-window.html

Re: BUG #16968: Planner does not recognize optimization

2021-05-14 Thread David Rowley

On Fri, 14 May 2021 at 02:38, Eugen Konkov  wrote:
> Now I create minimal reproducible test case.
> https://dbfiddle.uk/?rdbms=postgres_13&fiddle=761a00fb599789d3db31b120851d6341
>
> Optimization is not applyed when I filter/partition by column using composite 
> type name.

You probably already know this part, but let me explain it just in
case it's not clear.

The pushdown of the qual from the top-level query into the subquery,
or function, in this case, is only legal when the qual references a
column that's in the PARTITION BY clause of all window functions in
the subquery.  The reason for this is, if we filter rows before
calling the window function, then it could affect which rows are in
see in the window's frame. If it did filter, that could cause
incorrect results.  We can relax the restriction a bit if we can
eliminate entire partitions at once. The window function results are
independent between partitions, so we can allow qual pushdowns that
are in all PARTITION BY clauses.

As for the reason you're having trouble getting this to work, it's
down to the way you're using whole-row vars in your targetlist.

A slightly simplified case which shows this problem is:

create table ab(a int, b int);
explain select * from (select ab as wholerowvar,row_number() over
(partition by a) from ab) ab where (ab.wholerowvar).a=1;

The reason it does not work is down to how this is implemented
internally.   The details are, transformGroupClause() not assigning a
ressortgroupref to the whole-row var.  It's unable to because there is
no way to track which actual column within the whole row var is in the
partition by clause.  When it comes to the code that tries to push the
qual down into the subquery, check_output_expressions() checks if the
column in the subquery is ok to accept push downs or not. One of the
checks is to see if the query has windowing functions and to ensure
that the column is in all the PARTITION BY clauses of each windowing
function.  That check is done by checking if a ressortgroupref is
assigned and matches a tleSortGroupRef in the PARTITION BY clause.  In
this case, it does not match.  We didn't assign any ressortgroupref to
the whole-row var.

Unfortunately, whole-row vars are a bit to 2nd class citizen when it
comes to the query planner. Also, it would be quite a bit of effort to
make the planner push down the qual in this case. We'd need some sort
of ability to assign ressortgroupref to a particular column within a
whole-row var and we'd need to adjust the code to check for that when
doing subquery pushdowns to allow it to mention which columns within
whole-row vars can legally accept pushdowns.  I imagine that's
unlikely to be fixed any time soon.  Whole-row vars just don't seem to
be used commonly enough to warrant going to the effort of making this
stuff work.

To work around this, you should include a reference to the actual
column in the targetlist of the subquery, or your function, in this
case, and ensure you use that same column in the PARTITION BY clause.
You'll then need to write that column in your condition that you need
pushed into the subquery. I'm sorry if that messes up your design.
However, I imagine this is not the only optimisation that you'll miss
out on by doing things the way you are.

David

Re: 15x slower PreparedStatement vs raw query

2021-05-04 Thread David Rowley

On Tue, 4 May 2021 at 22:05, Alex  wrote:
> Shouldn't this process be automatic based on some heuristics?

When plan_cache_mode is set to "auto", then the decision to use a
generic or custom plan is cost-based. See [1]. There's a fairly crude
method there for estimating the effort required to replan the query.
The remainder is based on the average cost of the previous custom
plans + estimated planning effort vs cost of the generic plan.  The
cheaper one wins.

Certainly, what's there is far from perfect.  There are various
problems with it.  The estimated planning cost is pretty crude and
could do with an overhaul.   There are also issues with the plan costs
not being true to the cost of the query.  One problem there is that
run-time partition pruning is not costed into the plan.  This might
cause choose_custom_plan() to pick a custom plan when a generic one
with run-time pruning might have been better.

In order to get a better idea of where things are going wrong for you,
we'd need to see the EXPLAIN ANALYZE output for both the custom and
the generic plan.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/utils/cache/plancache.c#L1019

Re: Planning performance problem (67626.278ms)

2021-04-21 Thread David Rowley

On Thu, 22 Apr 2021 at 00:03, Jeremy Schneider  wrote:
>
> Two years later, I still remember this. And today I just confirmed
> someone hitting this on open source PG13.

The only thing that changed about get_actual_variable_range() is that
it now uses a SnapshotNonVacuumable snapshot. Previously a
long-running transaction could have caused vacuum to be unable to
remove tuples which could have caused get_actual_variable_range() to
be slow if it had to skip the unvacuumable tuples.

That's now changed as the SnapshotNonVacuumable will see any tuples
required by that long-running transaction and use that to determine
the range instead of skipping over it.

Anyone with a large number of tuples that vacuum can remove that are
at either end of the range on a column that is indexed by a btree
index could still have issues.  Vacuuming more often might be a good
thing to consider.  With the original report on this thread there were
more dead tuples in the table than live tuples.  Disabling auto-vacuum
or tuning it so it waits that long is likely a bad idea.

FWIW, here's a simple test case that shows the problem in current master.

create table a (a int primary key) with (autovacuum_enabled = off);
insert into a select x from generate_series(1,1000) x;
analyze a;
delete from a;
\timing on
explain select * from a where a < 1000;
 QUERY PLAN

 Seq Scan on a  (cost=0.00..169247.71 rows=9998977 width=4)
   Filter: (a < 1000)
(2 rows)

Time: 9062.600 ms (00:09.063)

vacuum a;
explain select * from a where a < 1000;
   QUERY PLAN
-
 Seq Scan on a  (cost=0.00..0.00 rows=1 width=4)
   Filter: (a < 1000)
(2 rows)

Time: 2.665 ms

Notice that it became faster again after I did a vacuum.

David

Re: Potential performance issues related to group by and covering index

2021-03-02 Thread David Rowley

On Wed, 3 Mar 2021 at 10:04, Michael Lewis  wrote:
> Are there guidelines or principles you could share about writing the group by 
> clause such that it is more efficient?

If you have the option of writing them in the same order as an
existing btree index that covers the entire GROUP BY clause (in
version < PG13) or at least prefix of the GROUP BY clause (version >=
PG13), then the planner has a chance to make use of that index to
provide pre-sorted input to do group aggregate.

Since PG13 has Incremental Sort, having an index that covers only a
prefix of the GROUP BY clause may still help.

If no indexes exist then you might get better performance by putting
the most distinct column first.  That's because sorts don't need to
compare the remaining columns once it receives two different values
for one column.  That gets more complex when the most distinct column
is wider than the others. e.g a text compare is more expensive than
comparing two ints.  For Hash Aggregate, I don't think the order will
matter much.

David

Re: Performance issues related to left join and order by

2021-03-02 Thread David Rowley

On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu  wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that 
> PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that there is a missing optimization rule related to pushing 
> the sort operator (i.e., order and limit) through the left join.
> Given the significant query execution time difference, I was wondering if it 
> is worth adding such a rule to make the system evaluate the first query more 
> efficiently.
> It would also be helpful if you could comment on if there is a standard 
> practice to evaluate the tradeoff associated with adding such a rule in 
> Postgresql.

We currently don't attempt to push down LIMIT clauses into subqueries.
Before we did that we'd need to get much better at figuring out how
joins duplicate rows so that we could be sure that we're not limiting
the subquery more than the number of records that the outer query will
need to reach its limit.

If you want some advice, you're likely to get more people on your side
and possible support for making improvements to the query planner if
you provide examples that look remotely like real-world queries.   In
the other emails that I've read from you on this list [1], it seems
you're example queries are all completely bogus.  I suspect that the
queries are generated by some fuzz testing tool.  I very much imagine
that really don't need help with these at all.  With respect, it seems
to me that there's about zero chance that you genuinely need the
results of this query more quickly and you've come for help with that.

Because PostgreSQL does not proactively cache query plans, ad-hoc
queries are always parsed, planned then executed.  This means that
it's often not practical to spend excessive amounts of time planning a
query that gets executed just once. Adding new optimisations to the
query planner means they either have to be very cheap to detect, or
they must pay off in many cases.

If you happen to think there's a genuine case for having the query
planner do a better job of doing LIMIT pushdowns into subqueries, then
you're welcome to submit a patch to implement that.  You'll also need
to carefully document exactly which cases the LIMIT can be pushed down
and when it cannot.  That's the hard part. The actual pushing down of
the clause is dead easy.  If you're going to do that, then I'd suggest
you come up with better examples than this one. I don't think many
people will get on board with your newly proposed optimisations when
the queries are obviously not real.  It's hard to imagine the
optimisation being useful to any queries with a query that's so
obviously not a real one.

David

[1] 
https://www.postgresql.org/message-id/BN7PR07MB52024B973EAB075F4DF6C19ACD999%40BN7PR07MB5202.namprd07.prod.outlook.com

Re: Potential performance issues related to group by and covering index

2021-03-02 Thread David Rowley

On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu  wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that 
> PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that different order of group by clauses triggers different 
> plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the 
> covering index, it will trigger an index scan on associated columns;
> however,  when the group by clauses have different order than the covering 
> index (ps_suppkey, ps_partkey), the index scan will not be triggered.
> Given that the user might not pay close attention to this subtle difference, 
> I was wondering if it is worth making these two queries have the same and 
> predictable performance on Postgresql.

Unfortunately, it would take a pretty major overhaul of the query
planner to do that efficiently.

For now, have a few smarts involved in trying to make the GROUP BY
processing more efficient:

1) We remove columns from the GROUP BY if they're functionally
dependent on the primary key, providing the primary key is present
too. (you're seeing this in your example query)
2) We also change the order of the GROUP BY columns if it's a subset
of the ORDER BY columns.  This is quite good as we'd do grouping by
{b,a} if someone wrote GROUP BY a,b ORDER BY b,a; to which would save
having to re-sort the data for the ORDER BY after doing the GROUP BY.
That's especially useful for queries with a LIMIT clause.

If we want to do anything much smarter than that like trying every
combination of the GROUP BY clause, then plan times are likely going
to explode. The join order search is done based on the chosen query
pathkeys, which in many queries is the pathkeys for the GROUP BY
clause (see standard_qp_callback()). This means throughout the join
search, planner will try and form paths that provide pre-sorted input
that allows the group by to be implemented efficiently with pre-sorted
data. You might see Merge Joins rather than Hash Joins, for example.

If we want to try every combination of the GROUP BY columns then it
means repeating that join search once per combination. The join search
is often, *by far*, the most expensive part of planning a query.

While it would be nice if the planner did a better job on selecting
the best order for group by columns, unless we can come up with some
heuristics that allow us to just try a single combination that is
likely good, then I don't think anyone would thank us for slowing down
the planner by a factor of the number of possible combinations of the
group by columns.

David

Re: Partition pruning with joins

2020-11-04 Thread David Rowley

On Wed, 4 Nov 2020 at 02:20, Ehrenreich, Sigrid  wrote:
>
> -- Statement
> explain SELECT
> count(*)
> FROM
> dim INNER JOIN fact ON (dim.part_key=fact.part_key)
> WHERE dim.part_key >= 110 and dim.part_key <= 160;
>
> Plan shows me, that all partitions are scanned:
> Aggregate  (cost=461.00..461.01 rows=1 width=8)
>   ->  Hash Join  (cost=4.64..448.25 rows=5100 width=0)
> Hash Cond: (fact.part_key = dim.part_key)
> ->  Append  (cost=0.00..390.00 rows=2 width=4)
>   ->  Seq Scan on fact_100 fact_1  (cost=0.00..145.00 rows=1 
> width=4)  ⇐ unnecessarily scanned
>   ->  Seq Scan on fact_200 fact_2  (cost=0.00..145.00 rows=1 
> width=4)
> ->  Hash  (cost=4.00..4.00 rows=51 width=4)
>   ->  Seq Scan on dim  (cost=0.00..4.00 rows=51 width=4)
> Filter: ((part_key >= 110) AND (part_key <= 160))
>
>
> I know, that I could get rid of this problem, by rewriting the query to 
> include the partitioned table in the where clause like this:
> WHERE fact.part_key >= 210 and fact.part_key <= 260
> Partition pruning happens very nicely then.

It sounds like what I mentioned in [1] would be the best way to
optimise this.  Unfortunately, the idea didn't get much support and I
didn't pursue it any further.

I think it would also be possible to perform run-time pruning on each
row from the inner side of the join and bitmap-OR the matching
partitions each time then just scan those ones when performing the
probe phase of the hash join.  However, in practice, I'm not too sure
how that could be made to work well since nodeHashjoin.c would have to
be in charge of collecting the matching partitions when building the
hash table, but nodeAppend.c would have to be in charge of skipping
partitions that can't have any matches. I'm unsure how exactly the
hash join would communicate that to the Append. Traditionally, each
node is oblivious to its children nodes.

I imaging the overheads of performing run-time pruning for each inner
row might be quite high for this.  We do something like this for
parameterised nested loop joins, but generally those are only chosen
when the number of lookups is relatively small. And with Nested Loop,
the lookups are almost certainly much more expensive than a hash
probe, since they'll require scanning an index from the root each time
we get a new parameter.

> Unfortunately this is not an option for us, because the code in our case is 
> generated by some third party software (sigh).
>
> Do you have any suggestions, what else I could do? (Or maybe you could add it 
> as a new feature for v14 )?

If there was some way to make a parameterised nested loop more
favourable, then that might help you. That would require an index on
fact(part_key), but I imagine it's unlikely to work for you as the
benefits of run-time pruning are not really costed into the plan, so
it may be unlikely that you'll coax the planner into choosing that
plan. It may be an inferior plan anyway.

David

[1] 
https://www.postgresql.org/message-id/flat/30810.1449335261%40sss.pgh.pa.us#906319f5e212fc3a6a682f16da079f04

Re: Postgres Optimizer ignores information about foreign key relationship, severly misestimating number of returned rows in join

2020-10-26 Thread David Rowley

On Tue, 27 Oct 2020 at 06:54, Ehrenreich, Sigrid  wrote:
>   ->  Hash Join  (cost=226.27..423.82 rows=115 width=0) (actual 
> time=3.150..7.511 rows=3344 loops=1)   <=== With the FK, the 
> estimation should be 3344, but it is 115 rows

I'd have expected this to find the foreign key and have the join
selectivity of 1.0, but I see it does not due to the fact that one of
the EquivalenceClass has a constant due to the fact.low_card = 1 qual.

In build_join_rel() we call build_joinrel_restrictlist() to get the
join quals that need to be evaluated at the join level, but we only
get the fact.anydata1=dim.anydata1 and fact.anydata2=dim.anydata2
quals there.  The low_card qual gets pushed down to the scan level on
each side of the join, so no need for it to get evaluated at the join
level. Later in build_join_rel() we do set_joinrel_size_estimates().
The restrictlist with just the two quals is what we pass to
get_foreign_key_join_selectivity().  Only two of the foreign key
columns are matched there, therefore we don't class that as a match
and just leave it up to the normal selectivity functions.

I feel like we could probably do better there and perhaps somehow
count ECs with ec_has_const as matched, but there seems to be some
assumptions later in get_foreign_key_join_selectivity() where we
determine the selectivity based on the base rel's tuple count.  We'd
need to account for how many rows remainder after filtering the ECs
with ec_has_const == true, else we'd be doing the wrong thing.  That
needs more thought than I have time for right now.

Your case would work if the foreign key had been on just anydata1 and
anydata2, but there's not much chance of that working without a unique
index on those two columns.

Extended statistics won't help you here either since they're currently
not used for join estimations.

David

Re: Query Performance / Planner estimate off

2020-10-20 Thread David Rowley

On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen  wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join 
> for this query. Essentially making this query (and other queries) take a very 
> long time to complete. In contrast, by toggling `enable_nestloop` and 
> `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 
> minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): 
> https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

> Let me know if there is anything I left out here that would be useful for 
> further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

would be useful.

David

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-09-13 Thread David Rowley

On Tue, 8 Sep 2020 at 06:05, Raj  wrote:
>
> > This would not exactly look like a bug, because the message says "to
> > be locked", so at least it's not allowing two workers to lock the same
> > tuple. But it seems that the skip-locked mode should not make an error
> > out of this, but treat it as the tuple was already locked. Why would
> > it want to lock the tuple (representing the job) if another worker has
> > already finished his UPDATE of the job to mark it as "done" (which is
> > what makes the tuple move to the "completed" partition.)

(It's not very clear who wrote the above text since the quote does not
mention who the author is and the original email didn't appear to have
made it to the list)

It's not a bug. I think the quoted text is expecting a bit too much
from the database. It does not know that if the tuple is updated and
moved to another partition that it can be safely ignored.   For all
the database knows, the new version of the tuple that's in the new
partition still matches the query's WHERE clause and should be locked.
If we just go and ignore moved off tuples then we could miss
processing tuples that still need to be processed.

It's perhaps not impossible to make it work slightly better if it were
somehow possible to inform heapam_tuple_lock() that it's operating on
a partition and the query queried a partitioned table and that all but
1 partition was pruned with partition pruning.  In this case we could
be certain the new verison of the tuple can't match the WHERE clause
of the SELECT since partition pruning determined that all other
partitions don't match the WHERE clause. However, that's:

a) a pretty horrid thing to have to teach heapam_tuple_lock() about, and;
b) only going to work when 1 partition survives partition pruning,
which is pretty horrible since doing ATTACH PARTITION could suddenly
cause your queries to fail randomly.

If you had 3 partitions, one for "pending", "retry" and "complete",
and you wanted to lock all rows that are in a "pending" or "retry"
state, then when we encounter an updated row in the "pending"
partition, we have no knowledge if it was moved into the "retry" or
the "completed" partition.   If it's in "retry", then we do want to
find it and process it, but if it's in "completed", then it does not
match the WHERE clause of the query and we can ignore it.  Since we
don't know which, we can't make assumptions and must force the user to
try again, hence the serialisation failure error.

> > Either the SELECT for jobs to do returned a wrong tuple, which was
> > already updated, or there is some lapse in the locking.
> >
> > Either way it would seem to be a waste of time throwing all these
> > errors when the tuple should not even have been selected for update
> > and locking.
> >
> > I wonder if anybody knows anything about that issue? Of course you'll
> > want to see the DDL and SQL queries, etc. but you can't really try it
> > out unless you do some massively parallel magic.

I ready mentioned why this cannot work that way [1].  If you have some
idea on how to make it work correctly, then it would be interesting to
hear. Otherwise, I'm sorry to say that we can't just ignore these
tuples because it happens to suit your use case.

The solution is just to make the application retry on serialisation failures.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrDH6TQeLxTqnnAnhjrs55ru5g2_QMG=me+wvd5mmp...@mail.gmail.com

Re: Query performance issue

2020-09-05 Thread David Rowley

On Sat, 5 Sep 2020 at 10:20, Nagaraj Raj  wrote:
> I added the index as you suggested and the planner going through the bitmap 
> index scan,heap and the new planner is,
> HaOx | explain.depesz.com

In addition to that index, you could consider moving away from
standard SQL and use DISTINCT ON, which is specific to PostgreSQL and
should give you the same result.

EXPLAIN ANALYZE
SELECT DISTINCT ON (serial_no) serial_no,receivingplant,sku,r3_eventtime
FROM receiving_item_delivered_received
WHERE eventtype='LineItemdetailsReceived'
  AND replenishmenttype = 'DC2SWARRANTY'
  AND coalesce(serial_no,'') <> ''
ORDER BY serial_no,eventtime DESC;

The more duplicate serial_nos you have the better this one should
perform.  It appears you don't have too many so I don't think this
will be significantly faster, but it should be a bit quicker.

David

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread David Rowley

On Fri, 21 Aug 2020 at 11:01, Michael Lewis  wrote:
>
> On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie  wrote:
>>
>> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>>
>> Can you share an explain analyze for the query that does the select for
>> update? I wouldn't assume that partition pruning is possible at all with
>> hash, and it would be interesting to see how it is finding those rows.
>>
>> Well this got interesting  - the already moved error showed up:  Note, the 
>> actual process partitions are regular table partitions, these are not 
>> hashed.  Only the incoming and completed are hashed due to row counts at 
>> either end of the processing; in flight (where the issue shows up) is quite 
>> small:
>>
>> [queuedb] # explain analyze select queueid,txobject,objectid,state from 
>> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
>> txobject = 'ticket' limit 250 for update skip locked;
>> ERROR:  40001: tuple to be locked was already moved to another partition due 
>> to concurrent update
>> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
>> Time: 579.131 ms
>
> That is super curious. I hope that someone will jump in with an explanation 
> or theory on this.
>
> I still wonder why the move between partitions is needed though if the work 
> is either done (failed or successful) or not done... not started, retry 
> needed or in progress... it doesn't matter. It needs to get picked up by the 
> next process if it isn't already row locked.

I may be heading off in the wrong direction as I'm not fully sure I
understand what the complaint is about, but isn't the executor just
hitting dead rows in one of the active or failed partitions that have
been moved off to some other partition?

When updates occur in a non-partitioned table we can follow item
pointer chains to find the live row and check if the WHERE clause
still matches to determine if the row should be updated, or in this
case just locked since it's a SELECT FOR UPDATE. However, with
partitioned table, a concurrent UPDATE may have caused the row to have
been moved off to another partition, in which case the tuple's item
pointer cannot point to it since we don't have enough address space,
we only have 6 bytes for a TID. To get around the fact that we can't
follow these update chains, we just throw the serialization error,
which is what you're getting. Ideally, we'd figure out where the live
version of the tuple is and check if it matches the WHERE clause and
lock it if it does, but we've no means to do that with the current
design.

If the complaint is about the fact that you're getting the error and
you think you shouldn't be because you said "SKIP LOCKED" then I'm not
really sure the fact that you said "SKIP LOCKED" gives us the right to
ignore this case. The user only gave us the go-ahead to skip locked
tuples, not skip tuples that we just failed to follow item pointer
chains for.   It might be okay to do this for rows that have since
been marked as complete since they no longer match your WHERE clause,
however, if the row has gone from the queue_tx_active partition into
the queue_tx_fail_retryable partition then I don't see why we'd have
the right to skip the tuple. Your query says you want tuples that need
to be retried. We can't go skipping them.

So isn't the fix just to code the application to retry on 40001 errors?

David

Re: Performance issue

2020-06-14 Thread David Rowley

On Mon, 15 Jun 2020 at 10:46, Nagaraj Raj  wrote:
> CREATE TABLE test1
> (
...

> CONSTRAINT pk_i_entity_proxy_id PRIMARY KEY (individual_entity_proxy_id)
>
> );

> CREATE TABLE test2
> (
...

> CONSTRAINT pk_entity_proxy_id PRIMARY KEY (individual_entity_proxy_id)
> );
>
>
> User query:
>
> explain analyze select COUNT(*) as "DII_1"
>   from ( select distinct table0."individual_entity_proxy_id" as 
> "INDIVIDUAL_ENTITY_PROXY_ID"
> from test1 table0
> inner join test2 table1
> on table0."individual_entity_proxy_id" = 
> table1."individual_entity_proxy_id"

Why do you use "select distinct". It seems to me that you're putting a
distinct clause on the primary key of test1 and joining to another
table in a way that cannot cause duplicates.

I imagine dropping that distinct will speed up the query quite a bit.

David

Re: view reading information_schema is slow in PostgreSQL 12

2020-06-13 Thread David Rowley

On Sat, 13 Jun 2020 at 19:52, David Rowley  wrote:
>
> On Sat, 13 Jun 2020 at 16:07, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > I wondered if it would be more simple to add some smarts to look a bit
> > > deeper into case statements for selectivity estimation purposes. An
> > > OpExpr like:
> > > CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
> > > 'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';
> >
> > Hm.  Maybe we could reasonably assume that the equality operators used
> > for such constructs are error-and-side-effect-free, thus dodging the
> > semantic problem I mentioned in the other thread?
>
> I'm only really talking about selectivity estimation only for now.
> I'm not really sure why we'd need to ensure that the equality operator
> is error and side effect free.  We'd surely only be executing the case
> statement's operator's oprrest function?  We'd need to ensure we don't
> invoke any casts that could error out.

Hmm, after a bit of thought I now see what you mean.  We'd need to
loop through each WHEN clause to ensure there's a Const and check if
that Const is equal to the Const on the other side of the OpExpr, then
select the first match. That, of course, must perform a comparison,
but, that's not really doing anything additional to what constant
folding code already does, is it?

David

Re: view reading information_schema is slow in PostgreSQL 12

2020-06-13 Thread David Rowley

On Sat, 13 Jun 2020 at 16:07, Tom Lane  wrote:
>
> David Rowley  writes:
> > I wondered if it would be more simple to add some smarts to look a bit
> > deeper into case statements for selectivity estimation purposes. An
> > OpExpr like:
> > CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
> > 'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';
>
> Hm.  Maybe we could reasonably assume that the equality operators used
> for such constructs are error-and-side-effect-free, thus dodging the
> semantic problem I mentioned in the other thread?

I'm only really talking about selectivity estimation only for now.
I'm not really sure why we'd need to ensure that the equality operator
is error and side effect free.  We'd surely only be executing the case
statement's operator's oprrest function?  We'd need to ensure we don't
invoke any casts that could error out.

David

Re: view reading information_schema is slow in PostgreSQL 12

2020-06-12 Thread David Rowley

On Sat, 13 Jun 2020 at 15:11, Tom Lane  wrote:
> I expect you're getting a fairly decent estimate for the "contype <>
> ALL" condition, but the planner has no idea what to make of the CASE
> construct, so it just falls back to a hard-wired default estimate.

This feels quite similar to [1].

I wondered if it would be more simple to add some smarts to look a bit
deeper into case statements for selectivity estimation purposes. An
OpExpr like:

CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';

could be simplified to c.contype = 'c',  which we should have
statistics for. There'd certainly be case statement forms that
couldn't be simplified, but I think this one could.

David

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvr%2B6%3D7SZBAtesEavgOQ0ZC03syaRQk19E%2B%2BpiWLopTRbg%40mail.gmail.com#3ec465f343f1204446941df29fc9e715

Re: view reading information_schema is slow in PostgreSQL 12

2020-06-12 Thread David Rowley

On Sat, 13 Jun 2020 at 06:26, regrog  wrote:
>
> I'm facing performance issues migrating from postgres 10 to 12 (also from 11
> to 12) even with a new DB.
> Th performance difference is huge 300ms in pg10 vs 3 minutes in pg12.

This appears to be down to bad statistics that cause pg12 to choose a
nested loop plan.  The pg12 plan has:

->  Hash Join  (cost=1281.91..2934.18 rows=68 width=192) (actual
time=0.024..21.915 rows=3538 loops=1848)"

on the inner side of a nested loop. 21.915 * 1848 loops is 40498.92
ms, so most of the time.

This comes down to the difference caused by 04fe805a17, where after
that commit we don't bother looking at the NOT NULL constraints in
table_constraints.

explain select * from (select * from
information_schema.table_constraints) c where constraint_type <>
'CHECK';

If you execute the above on both instances, you'll see PG12 does not
do an Append. PG10 does. Which results in more rows being estimated
and the planner choosing something better than a nested loop join.

You could try: SET enable_nestloop TO off;

I'm not really sure there's much you could do to improve the
statistics on the catalogue tables.

Alternatively, you could write a view based directly on the base
tables, bypassing information_schema completely.

David

Re: Windows slowness?

2020-06-10 Thread David Rowley

On Thu, 11 Jun 2020 at 07:41, Mikkel Lauritsen  wrote:
> I have a query that runs much slower in Postgres on Windows than on
> Linux

> Using explain analyze on the database running on Windows I get
>
> ->  Index Scan using event_pkey on event t1  (cost=0.56..0.95 rows=1
> width=295) (actual time=0.075..0.075 rows=0 loops=229227)

> If I copy the database to my laptop running Linux (Postgres 12 on Fedora
> 32, i7-9750H, 16 GB ram) I get the exact same execution plan. Explain
> analyze says
>
> ->  Index Scan using event_pkey on event t1  (cost=0.56..0.95 rows=1
> width=295) (actual time=0.008..0.008 rows=0 loops=229227)
>

> The table is freshly vacuumed. It has about 10M rows and takes about
> 2.6G disk space; the index is about 600M. Everything is cached; there's
> basically no disk I/O happening while the query is executing.

Can you confirm what: SELECT pg_relation_size('event_pkey'),
pg_relation_size('event'); says on each

> The only Postgres configuration difference between the Windows and Linux
> environments is shared_buffers, which is 4G on my laptop and 512M on the
> Windows server, and effective_cache_size which are 8G on the laptop and
> 16G on the server.

There is some slight advantage to having the buffers directly in
shared buffers. Having them in the kernel's page cache does still
require getting them into shared buffers. Going by these sizes it
seems much more likely that the Linux instance could have all buffers
in shared_buffers, but it seems likely the Windows instance won't. I
can't imagine that counts for 10x, but it surely must count for
something.

It would be good to see:

SET track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS) 

David

Re: When to use PARTITION BY HASH?

2020-06-07 Thread David Rowley

On Sun, 7 Jun 2020 at 23:41, MichaelDBA  wrote:
> The article referenced below assumes a worst case scenario for bulk-loading 
> with hash partitioned tables.  It assumes that the values being inserted are 
> in strict ascending or descending order with no gaps (like a sequence number 
> incrementing by 1), thereby ensuring every partition is hit in order before 
> repeating the process.  If the values being inserted are not strictly 
> sequential with no gaps, then the performance is much better.  Obviously, 
> what part of the tables and indexes are in memory has a lot to do with it as 
> well.

In PostgreSQL 12, COPY was modified to support bulk-inserts for
partitioned tables. This did speed up many scenarios.  Internally, how
this works is that we maintain a series of multi insert buffers, one
per partition. We generally only flush those buffers to the table when
the buffer for the partition fills.  However, there is a sort of
sanity limit [1] on the number of multi insert buffers we maintain at
once and currently, that is 32.  Technically we could increase that
limit, but there would still need to be a limit.  Unfortunately, for
this particular case, since we're most likely touching between 199-799
other partitions before hitting the first one again, that will mean
that we really don't get any multi-inserts, which is likely the reason
why the performance is worse for hash partitioning.

With PG12 and for this particular case, you're likely to see COPY
performance drop quite drastically when going from 32 to 33
partitions.  The code was more designed for hitting partitions more
randomly rather than in this sort-of round-robin way that we're likely
to get from hash partitioning on a serial column.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2569

Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley

On Sat, 6 Jun 2020 at 14:49, Cedric Leong  wrote:
> It's less of a complaint rather than just a warning not to do what I did.

My point was really that nobody really knew what you did or what you
did it on. So it didn't seem like a worthwhile warning as it
completely lacked detail.

> These tests are running the exact same query on two different tables with the 
> exception that they use their respective partition keys.

Are you sure?  It looks like the old one does WHERE date =
((now())::date - '7 days'::interval) and the new version does
(date(created_at) = ((now())::date - '7 days'::interval). I guess you
renamed date to "created_at" and changed the query to use date(). If
that expression is not indexed then I imagine that would be a good
reason for the planner to have moved away from using the index on that
column. Also having date(created_at) will also not allow run-time
pruning to work since your partition key is "created_at".

You might be able to change the query to query a range of value on the
new timestamp column. This will allow you to get rid of the date()
function. For example:

where created_at >= date_trunc('day', now() - '7 days'::interval) and
created_at < date_trunc('day', now() - '6 days'::interval)

David

Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley

On Sat, 6 Jun 2020 at 14:12, Cedric Leong  wrote:
> Somewhat unrelated but note to anyone who wants to swap out partition keys. 
> Don't create a clone of the table with the new partition key and insert data. 
> It messes up the query planner massively and makes everything much slower.

That complaint would have more meaning if you'd mentioned which
version of PostgreSQL you're using. The performance of partitioning in
PostgreSQL has changed significantly over the past 3 releases. Also
would be useful to know what you've actually done (actual commands).
I can't imagine it makes *everything* slower, so it might be good to
mention what is actually slower.

David

Re: AutoVacuum and growing transaction XID's

2020-05-07 Thread David Rowley

On Fri, 8 May 2020 at 13:51, github kran  wrote:
>   I can't either DROP or ALTER any other tables ( REMOVE Inheritance for 
> any of old tables where the WRITES are not getting written to). Any of the 
> ALTER TABLE OR DROP TABLE  DDL's arer not getting exeucted even I WAITED FOR 
> SEVERAL MINUTES , so I have terminated those queries as I didn't have luck.

The auto-vacuum freeze holds an SharedUpdateExclusiveLock on the table
being vacuumed. If you try any DDL that requires an
AccessExclusiveLock, it'll have to wait until the vacuum has
completed. If you leave the DDL running then all accesses to the table
will be queued behind the ungranted AccessExclusiveLock.  It's likely
a good idea to always run DDL with a fairly short lock_timeout, just
in case this happens.

>3)  Can I increase the  autovacuum_freeze_max_age on the tables on 
> production system ?

Yes, but you cannot increase the per-table setting above the global
setting. Changing the global setting requires a restart.

David

Re: AutoVacuum and growing transaction XID's

2020-05-07 Thread David Rowley

On Fri, 8 May 2020 at 09:18, github kran  wrote:
> 1)  We haven't changed anything related to autovacuum except a work_mem 
> parameter which was increased to 4 GB which I believe is not related to 
> autovacuum

It might want to look into increasing vacuum_cost_limit to something
well above 200 or dropping autovacuum_vacuum_cost_delay down from 20
to something much lower. However, you say you've not changed the
autovacuum settings, but you've also said:

>1)  I see there are 8 Vacuum workers ( Not sure what changed) running in 
> the background and the concern I have is all of these vacuum processes are 
> running with wrap around and while they are running

The default is 3, so if you have 8 then the settings are non-standard.

It might be good to supply the output of:

SELECT name,setting from pg_Settings where name like '%vacuum%';

You should know that the default speed that autovacuum runs at is
quite slow in 9.6. If you end up with all your autovacuum workers tied
up with anti-wraparound vacuums then other tables are likely to get
neglected and that could lead to stale stats or bloated tables. Best
to aim to get auto-vacuum running faster or aim to perform some manual
vacuums of tables that are over their max freeze age during an
off-peak period to make use of the lower load during those times.
Start with tables in pg_class with the largest age(relfrozenxid).
You'll still likely want to look at the speed autovacuum runs at
either way.

Please be aware that the first time a new cluster crosses the
autovacuum_freeze_max_age threshold can be a bit of a pain point as it
can mean that many tables require auto-vacuum activity all at once.
The impact of this is compounded if you have many tables that never
receive an UPDATE/DELETE as auto-vacuum, in 9.6, does not visit those
tables for any other reason. After the first time, the relfrozenxids
of tables tend to be more staggered so their vacuum freeze
requirements are also more staggered and that tends to cause fewer
problems.

David

Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNER Query )

2020-05-07 Thread David Rowley

On Fri, 8 May 2020 at 10:00, David G. Johnston
 wrote:
>
> On Thu, May 7, 2020 at 11:07 AM Amarendra Konda  wrote:
>>
>> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  SELECT pa.process_activity_id AS 
>> pa_process_activity_id  FROM process_activity pa WHERE pa.app_id = 
>> '126502930200650' AND pa.created > '1970-01-01 00:00:00'  AND EXISTS ( 
>> SELECT 1 FROM process_instance pi where pi.app_id = pa.app_id AND 
>> pi.process_instance_id = pa.process_instance_id  AND pi.user_id = 
>> '137074931866340') ORDER BY pa.process_instance_id,  pa.created limit 50;
>>
>>
>>->  Index Scan using 
>> process_activity_process_instance_id_app_id_created_idx on 
>> public.process_activity pa  (cost=0.70..1061.62 rows=1436 width=32) (actual 
>> time=0.011..20.320 rows=23506 loops=2)
>
> > Index Cond: ((m.process_instance_id = pi.process_instance_id) AND (m.app_id 
> > = '126502930200650'::bigint) AND (m.created > '1970-01-01 
> > 00:00:00'::timestamp without time zone))
>
> I suppose during the nested loop the inner index scan could limit itself to 
> the first 50 entries it finds (since the first two index columns are being 
> held constant on each scan, m.created should define the traversal order...) 
> so that the output of the nested loop ends up being (max 2 x 50) 100 entries 
> which are then sorted and only the top 50 returned.
>
> Whether the executor could but isn't doing that here or isn't programmed to 
> do that (or my logic is totally off) I do not know.

I think the planner is likely not putting the process_activity table
on the outer side of the nested loop join due to the poor row
estimates.  If it knew that so many rows would match the join then it
likely would have done that to save from having to perform the sort at
all.  However, because the planner has put the process_instance on the
outer side of the nested loop join, it's the pathkeys from that path
that the nested loop node has, which is not the same as what the ORDER
BY needs, so the planner must add a sort step, which means that all
rows from the nested loop plan must be read so that they can be
sorted.

It might be worth trying: create index on process_instance
(user_id,app_id); as that might lower the cost of performing the join
in the opposite order and have the planner prefer that order instead.
If doing that, the OP could then ditch the
fki_conv_konotor_user_user_id index to save space.

If that's not enough to convince the planner that the opposite order
is better then certainly SET enable_sort TO off; would.

David

Re: Please help! Query jumps from 1s -> 4m

2020-05-04 Thread David Rowley

On Mon, 4 May 2020 at 02:35, James Thompson  wrote:
> buffers do look different - but still, reading 42k doesn't seem like it would 
> cause a delay of 4m?

You could do: SET track_io_timing TO on;

then: EXPLAIN (ANALYZE, BUFFERS) your query and see if the time is
spent doing IO.

David

Re: 600 million rows of data. Bad hardware or need partitioning?

2020-05-03 Thread David Rowley

On Mon, 4 May 2020 at 15:52, Arya F  wrote:
>
> On Sun, May 3, 2020 at 11:46 PM Michael Lewis  wrote:
> >
> > What kinds of storage (ssd or old 5400 rpm)? What else is this machine 
> > running?
>
> Not an SSD, but an old 1TB 7200 RPM HDD
>
> > What configs have been customized such as work_mem or random_page_cost?
>
> work_mem = 2403kB
> random_page_cost = 1.1

How long does it take if you first do:

SET enable_nestloop TO off;

If you find it's faster then you most likely have random_page_cost set
unrealistically low. In fact, I'd say it's very unlikely that a nested
loop join will be a win in this case when random pages must be read
from a mechanical disk, but by all means, try disabling it with the
above command and see for yourself.

If you set random_page_cost so low to solve some other performance
problem, then you may wish to look at the effective_cache_size
setting. Having that set to something realistic should allow indexes
to be used more in situations where they're likely to not require as
much random I/O from the disk.

David

Re: Duplicate WHERE condition changes performance and plan

2020-04-15 Thread David Rowley

On Thu, 16 Apr 2020 at 07:56, singh...@gmail.com  wrote:
> We have an odd issue where specifying the same where clause twice causes PG 
> to pick a much more efficent plan. We would like to know why.

> The EXPLAIN ANALYZE for both queries can be found here:-
> Query A: https://explain.depesz.com/s/lFuy
> Query B: https://explain.depesz.com/s/Jqmv

This is basically down to just a poor join selectivity estimation.
The selectivity estimation on the duplicate not equal clause is not
removed by the planner and the selectivity of that is taking into
account twice which reduces the selectivity of the table named
"problem". With that selectivity taken into account, the query planner
thinks a nested loop will be a more optimal plan, to which it seems to
be.

Join selectivity estimations can use the most common values lists as
you may see if you look at the pg_stats view for the tables and
columns involved in the join condition.  Perhaps ID columns are not
good candidates to get an MCV list in the stats.  In that case, the
ndistinct estimate will be used.  If there's no MCV list in the stats
then check ndistinct is reasonably accurate. If there is an MCV list,
then you can make that bigger by increasing the statistics targets on
the join columns and running ANALYZE. Note: Planning can become slower
when you increase the statistics targets.

Starting with PostgreSQL 9.6, foreign keys are also used to help with
join selectivity estimations. I see you have a suitable foreign key
from the schema you posted.  You might want to add that to the list of
reasons to upgrade.

David

Re: pg12 partitions show bad performance vs pg96

2020-03-09 Thread David Rowley

On Tue, 10 Mar 2020 at 02:08, Mariel Cherkassky
 wrote:

> PG12 - 3 PARTITIONS
>
>  QUERY 
> PLAN
> 
>  Gather  (cost=1002.26..1563512.35 rows=10 width=44) (actual 
> time=22306.091..22309.209 rows=0 loops=1)
>Workers Planned: 2
>Workers Launched: 2
>->  Hash Join  (cost=2.26..1562511.35 rows=4 width=44) (actual 
> time=22299.412..22299.413 rows=0 loops=3)
>  Hash Cond: (da_2.device = de.id)
>  ->  Parallel Append  (cost=0.00..1562506.90 rows=814 width=37) 
> (actual time=22299.411..22299.411 rows=0 loops=3)
>->  Parallel Seq Scan on iot_data_2 da_2  
> (cost=0.00..584064.14 rows=305 width=37) (actual time=9076.535..9076.535 
> rows=0 loops=3)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 12460009
>->  Parallel Seq Scan on iot_data_1 da_1  
> (cost=0.00..504948.69 rows=262 width=36) (actual time=10296.751..10296.751 
> rows=0 loops=2)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 16158316
>->  Parallel Seq Scan on iot_data_0 da  (cost=0.00..473490.00 
> rows=247 width=37) (actual time=19075.081..19075.081 rows=0 loops=1)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 30303339
>  ->  Hash  (cost=2.25..2.25 rows=1 width=7) (never executed)
>->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 
> width=7) (never executed)
>  Filter: (name = '50a'::text)
>  Planning Time: 30.429 ms
>  Execution Time: 22309.364 ms
> (20 rows)

>From what I can work out, the DDL you used here is:

-- you didn't seem to specify the DDL for iot_device, so I used:
create table iot_device (
id bigint primary key,
name text not null
);

insert into iot_device select x,x::Text || 'a' from generate_Series(1,100) x;

create table iot_data(id serial ,data text,metadata bigint,device
bigint references iot_device(id),primary key(id,device)) partition by
hash(device);
create table iot_data_0 partition of iot_data for values with (MODULUS
3, remainder 0);
create table iot_data_1 partition of iot_data for values with (MODULUS
3, remainder 1);
create table iot_data_2 partition of iot_data for values with (MODULUS
3, remainder 2);

insert into iot_data select
generate_series(1,1000),random()*10,random()*254,random()*99+1;
create index on iot_data(metadata,lower(data));
vacuum analyze iot_data;

In which case, you're getting a pretty different plan than I am. (I
admit that I only tried on current master and not PG12.2, however, I
see no reason that PG12.2 shouldn't produce the same plan)

I get:

# explain analyze select * from iot_data da,iot_device de where
de.name in ('50a') and de.id = da.device and da.metadata=50 and
lower(da.data) like '50';

  QUERY PLAN

 Nested Loop  (cost=0.56..28.04 rows=1 width=49) (actual
time=0.058..0.058 rows=0 loops=1)
   Join Filter: (da.device = de.id)
   ->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 width=11)
(actual time=0.013..0.016 rows=1 loops=1)
 Filter: (name = '50a'::text)
 Rows Removed by Filter: 99
   ->  Append  (cost=0.56..25.76 rows=3 width=38) (actual
time=0.040..0.040 rows=0 loops=1)
 ->  Index Scan using iot_data_0_metadata_lower_idx on
iot_data_0 da_1  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.020..0.020 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_1_metadata_lower_idx on
iot_data_1 da_2  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.010..0.010 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_2_metadata_lower_idx on
iot_data_2 da_3  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.009..0.009 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 Planning Time: 0.280 ms
 Execution Time: 0.094 ms
(17 rows)

Are you certain that you added an index on iot_data (metadata, lower(data)); ?

Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread David Rowley

On Mon, 9 Mar 2020 at 05:05, Mariel Cherkassky
 wrote:
> PG12 :
>  Planning Time: 8.157 ms
>  Execution Time: 2.920 ms
> (22 rows)
>
>
> PG96 :
>  Planning time: 0.815 ms
>  Execution time: 0.158 ms
> (12 rows)

8 ms seems pretty slow to planning that query. Does the planning time
drop if you execute this multiple times in the same session? Does the
time change if you try again without any foreign keys?

The planning time for the partitioned case is certainly going to take
longer. Partitioned tables are never going to improve the times of
query planning. It's only possible that they'll improve the
performance during query execution.

For such small fast queries as the ones you've shown, it's important
to remember that more complex query plans (ones with more nodes) do
lead to longer times for executor startup and shutdown.   EXPLAIN
(without ANALYZE), will perform query planning and executor
startup/shutdown. If you enable \timing on in psql and test the
EXPLAIN performance of these queries in each version, then you might
get an idea of where the overheads are.

Additionally, you're unlikely to see performance improvements with
table partitioning unless you're accessing many rows and partitioning
allows the data locality of the rows that you are accessing to
improve. i.e accesses fewer buffers and/or improves cache hit ratios.
In PG12, if the partition pruning can be done during query planning
then the planning and executor startup overhead is much lower since
there are fewer relations to generate access paths for and fewer nodes
in the final plan. This also improves the situation during execution
as it means fewer locks to take and fewer nodes to startup/shutdown.

> As you can see, the results are better in pg96. This example only shows the 
> results for a small data set. In bigger data sets I get a bigger diff...

Can you share the results of that?

> I tried changing many postgresql.conf parameters that were added 
> (max_workers_per_gather,enable_partitionwise_join and so on..).

The former only does anything for parallel queries. None of the plans
you've shown are parallel ones.   The latter also does not count in
this case. It only counts when joining two identically partitioned
tables.

> I dont understand why in pg12 it scans all the partitions instead of the 
> relevant one..

If you'd specified a specific "device" in the query SQL, then the
query planner would know which partition to scan for that particular
device. However, since you're looking up the device in another table
and performing a join, the device is only known during query
execution. The plan nodes for the non-matching partitions do go
through executor startup, but they're not scanned during execution, as
you've seen with the "(never executed)" appearing in the EXPLAIN
ANALYZE output.  Since executor startup occurs before execution, the
device you mean is still unknown during executor startup, so the
executor must startup the nodes for all partitions that are in the
plan. Starting up a plan node is not free, but not really very
expensive either. However, the overhead of it might be quite large
proportionally in your case since the executor is doing so little
work.

The most basic guidelines for table partitioning are, don't partition
your tables unless it's a net win.   If partitioning was always
faster, we'd just have designed Postgres to implicitly partition all
of your tables for you. There are some other guidelines in [1].

[1] 
https://www.postgresql.org/docs/12/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

David

Re: Slow performance with trivial self-joins

2020-02-05 Thread David Rowley

On Thu, 6 Feb 2020 at 11:12, Adam Brusselback  wrote:
>
> > You can create a library of
> > reusable views that are small, easy-to-understand and readable. Then
> > you build them up into bigger views, and finally query from them. But
> > then you end up with lots of (hidden) self-joins.
>
> I will concur with this use case being pretty common, but also something I
> have actively avoided anywhere performance is important because of the
> lack of this optimization.
>
> Even still, I have 20+ views like that in my database.

I think the best direction to move in to push that forward would be to
go and benchmark the proposed patch and see if the overhead of
detecting the self joined relations is measurable with various queries
with varying numbers of joins.

It does not sound too like it would be a great deal of effort to look
through the rangetable for duplicate Oids and only do further
processing to attempt self-join removal if there are. However, if that
effort happened to slow down all queries by say 5%, then perhaps it
would be a bad idea.  People's opinions don't really have much
traction for arguments on this. Unbiased and reproducible benchmarks
should be used as evidence to support discussion. Doing worst-case and
average-case benchmarks initially will save you time, as someone will
almost certainly ask if you don't do it.

(I've not been following the thread for the patch)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Slow planning, fast execution for particular 3-table query

2019-11-06 Thread David Rowley

On Thu, 7 Nov 2019 at 11:59, Justin Pryzby  wrote:
>
> On Mon, Nov 04, 2019 at 03:04:45AM +, David Wheeler wrote:
> > Postgres version 9.5.19
> > Each of the tables has between 3-4 indexes, and all the indexes include tid 
> > as first parameter.


> But note that index definition will be prohibited since:
>
> https://www.postgresql.org/docs/9.6/release-9-6.html
> |Disallow creation of indexes on system columns, except for OID columns 
> (David Rowley)
> |Such indexes were never considered supported, and would very possibly 
> misbehave since the system might change the system-column fields of a tuple 
> without updating indexes. However, previously there were no error checks to 
> prevent them from being created.

David will have meant the user column named "tid" rather than the
system column named "ctid".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Query slow again after adding an `OR` operation (was: Slow PostgreSQL 10.6 query)

2019-10-09 Thread David Rowley

On Thu, 10 Oct 2019 at 01:07, Behrang Saeedzadeh  wrote:
>
> This is a follow up to 
> https://www.postgresql.org/message-id/flat/CAERAJ%2B-1buiJ%2B_JWEo0a9Ao-CVMWpgp%3DEnFx1dJtnB3WmMi2zQ%40mail.gmail.com
>
> The query (generated by Hibernate) got a bit more complex and performance 
> degraded again. I have uploaded all the details here (with changed table 
> names, etc.): https://github.com/behrangsa/slow-query
>
> In short, the new query is:

The query mostly appears slow due to the "Rows Removed By Filter" in
the OR condition. The only way to get around not scanning the entire
branch_invoices table would be to somehow write the way in such a way
that allows it to go on the inner side of the join.

You could do that if you ensure there's an index on branch_invoices
(branch_id) and format the query as:

SELECT inv.id   AS i_id,
   inv.invoice_date AS inv_d,
   inv.invoice_xid  AS inv_xid,
   inv.invoice_type AS inv_type,
   brs.branch_idAS br_id,
   cinvs.company_id AS c_id
FROM invoices inv
 LEFT OUTER JOIN branch_invoices brs ON inv.id = brs.invoice_id
 LEFT OUTER JOIN company_invoices cinvs ON inv.id = cinvs.invoice_id
 INNER JOIN branches br ON brs.branch_id = br.id
WHERE brs.branch_id IN (SELECT br1.id
FROM branches br1
 INNER JOIN access_rights ar1 ON
br1.id = ar1.branch_id
 INNER JOIN users usr1 ON ar1.user_id = usr1.id
 INNER JOIN groups grp1 ON
ar1.group_id = grp1.id
 INNER JOIN group_permissions gpr1 ON
grp1.id = gpr1.group_id
 INNER JOIN permissions prm1 ON
gpr1.permission_id = prm1.id
WHERE usr1.id = 1636
  AND prm1.code = 'C2'
  AND ar1.access_type = 'T1')
UNION ALL
SELECT br3.id
FROM companies cmp
 INNER JOIN branches br3 ON cmp.id =
br3.company_id
 INNER JOIN access_rights ar2 ON
cmp.id = ar2.company_id
 INNER JOIN users usr2 ON ar2.user_id = usr2.id
 INNER JOIN groups g2 ON ar2.group_id = g2.id
 INNER JOIN group_permissions gpr2 ON
g2.id = gpr2.group_id
 INNER JOIN permissions prm2 ON
gpr2.permission_id = prm2.id
WHERE usr2.id = 1636
  AND prm2.code = 'C2'
  AND ar2.access_type = 'T1')
ORDER BY inv.invoice_date DESC, br.name ASC
LIMIT 12;

The planner may then choose to pullup the subquery and uniquify it
then put it on the outside of a nested loop join then lookup the
branch_invoices record using the index on branch_id.  I think this is
quite a likely plan since the planner estimates there's only going to
be 1 row from each of the subqueries.

Also note, that the LEFT JOIN you have to branch_invoices is not
really a left join since you're insisting that the branch_id must be
in the first or 2nd sub-plan. There's no room for it to be NULL. The
planner will just convert that to an INNER JOIN with the above query
since that'll give it the flexibility to put the subquery in the IN
clause on the outside of the join (after having uniquified it).
You'll need to decide what you actually want the behaviour to be here.
If you do need those NULL rows then you'd better move your WHERE quals
down into the join condition for branch_invoices table. I'd suggest
testing with some mock-up data if you're uncertain of what I mean.

If you find that is faster and you can't rewrite the query due to it
having been generated by Hibernate, then that sounds like a problem
with Hibernate.  PostgreSQL does not currently attempt to do any
rewrites which convert OR clauses to use UNION or UNION ALL. No amount
of tweaking the planner settings is going to change that fact.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Get the planner used by a query?

2019-10-09 Thread David Rowley

On Wed, 9 Oct 2019 at 19:21, Behrang Saeedzadeh  wrote:
>
> Is there a way to display the planner algorithm used by a query, either in 
> EXPLAIN or in a different way?

There's not really any simple way to know. If the number of relations
in the join search meets or exceeds geqo_threshold then it'll use the
genetic query optimizer. However, knowing exactly how many relations
are in the join search is not often simple since certain types of
subqueries can be pulled up into the main query and that can increase
the number of relations in the search.

If you don't mind writing C code, then you could write an extension
that hooks into join_search_hook and somehow outputs this information
to you before going on to call the geqo if the "enable_geqo &&
levels_needed >= geqo_threshold" condition is met.  Besides that, I
don't really know if there's any way.  You could try editing the
geqo_seed and seeing if the plan changes, but if it does not, then
that does not mean the geqo was not used, so doing it that way could
be quite error-prone.  You'd only be able to tell the geqo was being
used if you could confirm that changing geqo_seed did change the plan.
(And you could be certain the plan did not change for some other
reason like an auto-analyze).

--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Planner performance in partitions

2019-08-12 Thread David Rowley

On Tue, 13 Aug 2019 at 08:03, Piotr Włodarczyk
 wrote:
> PostgreSQL11.5:
>
> select on main partition (public.book): planner: 60ms, execution: 5ms
> select on partitioned table (public.book_1-1000): planner: 2.7 ms, execution: 
> 2,4 ms
>
> PostgreSQL 12B3:
>
> select on main partition (public.book): planner: 2,5ms , execution: 1,2ms
> select on partitioned table (public.book_1-1000): planner: 2.5 ms, execution: 
> 1,2 ms
>
> So looking at above results we have two options:
>
> Wait for 12.0 stable version
> Wait for patches to 11 – PostgreSQL Team: can You do this?

You'll need to either reduce the number of partitions down to
something realistic or wait for 12.0.

The work done to speed up the planner with partitioned tables for v12
won't be going into v11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: scans on table fail to be excluded by partition bounds

2019-06-26 Thread David Rowley

On Tue, 25 Jun 2019 at 05:31, Justin Pryzby  wrote:
> ts=# explain SELECT * FROM eric_enodeb_cell_metrics WHERE start_time BETWEEN 
> '2019-01-01 04:00' AND '2019-01-01 05:00' OR start_time BETWEEN '2019-01-02 
> 04:00' AND '2019-01-02 05:00';
>  Append  (cost=36.04..39668.56 rows=12817 width=2730)
>->  Bitmap Heap Scan on eric_enodeb_cell_20190101  (cost=36.04..19504.14 
> rows=6398 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.04..36.04 rows=6723 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.81 rows=6465 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Heap Scan on eric_enodeb_cell_20190102  (cost=36.08..20100.34 
> rows=6419 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.08..36.08 rows=6982 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.84 rows=6723 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>
> Is there some reason why the partition constraints aren't excluding any of the
> index scans ?

Yeah, we don't do anything to remove base quals that are redundant due
to the partition constraint.

There was a patch [1] to try and fix this but it's not seen any recent activity.

[1] https://commitfest.postgresql.org/19/1264/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley

On Tue, 21 May 2019 at 14:04, Walter Smith  wrote:
> I'm so sorry -- I meant to give the version, of course. It's 9.6.13.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc3
has been applied since then.

It would be good if you could confirm the problem is resolved after a
vacuum.  Maybe run VACUUM VERBOSE on the table and double check
there's not some large amount of tuples that are "nonremovable".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Analyze results in more expensive query plan

2019-05-20 Thread David Rowley

On Tue, 21 May 2019 at 08:23, Jeremy Altavilla
 wrote:
>
> Thanks for the suggestion. I created extended statistics objects for the two 
> tables in question. Unfortunately the resulting plan was the same (and had 
> the same estimates). It looks like the extended stats discovered a 
> potentially useful correlation on bag: "2, 3 => 1" (owner_id, bag_type_id => 
> id). I'm guessing this wasn't usable because the docs state "They are not 
> used to improve estimates for equality conditions comparing two columns".

I'd say that since the time spent planning is near 3x what is spent
during execution that you're wasting your time trying to speed up the
execution.  What you should be thinking about is using PREPAREd
statements to avoid the planning overhead completely.  If that's not
possible then you've more chance of reducing the time spent planning
by reducing the statistics on the table rather than adding more
planning overhead by adding extended stats.  You might want to
experiment with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ..;
and setting those down a bit then analyzing the tables again.
Although, that's likely only going to make a very small difference, if
any, than getting rid of the planning completely.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley

On Tue, 21 May 2019 at 12:44, Walter Smith  wrote:
>
> We had a mysterious (to us) slowdown today that I'm hoping someone can 
> explain just based on PG's principles of operation. It got better by itself 
> so it seems like it was "normal" behavior -- I just don't know what behavior 
> it was exhibiting.
>
> We have a table of user notifications containing about 80 million rows. It 
> gets a lot of inserts continually, and is cleaned up once a day. There are no 
> updates. In all history there have been about 330 million rows created.
>
> Today we deleted about 15 million rows in one transaction from this table. 
> Immediately afterwards, a particular SELECT started running very slowly -- 
> 500 to 3000 ms rather than the usual <1ms.
>
> We did an EXPLAIN ANALYZE on this select and it was still doing an index scan 
> as usual. The *planning time* for the query is what had gotten slow. The 
> query itself was still executing in <1ms.

It would be good to know which version you're running here.  It
basically sounds very much like get_actual_variable_range() will be
the culprit.  Basically, if a constant value that's being used by the
planner to determine row estimates with falls outside the statistic's
histogram and a btree index exists that we can use to look up the
actual bound of the data, then we do so in that function.  If you've
just deleted a bunch of rows then that index scan may have to traverse
a bunch of dead tuples before finding that first live tuple.  This
code has changed a few times in recent times, see fccebe421 and
3ca930fc3, which is why your version is of interest.

Apart from that, if you want to confirm that's the issue and you just
want it fixed, just VACUUM the table. You should likely be doing that
anyway directly after your bulk delete.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Postgresql Sort cost Poor performance?

2019-04-02 Thread David Rowley

On Tue, 2 Apr 2019 at 20:00, tank.zhang <6220...@qq.com> wrote:
> 2、 Adding a DISTINCT response time was very slow
>
> qis3_dp2=# SELECT COUNT(DISTINCT SVIN) AS CHECKCARNUM FROM
> QIS_CARPASSEDSTATION A WHERE 1=1 AND A.SSTATIONCD = 'VQ3_LYG' AND A.SLINENO
> IN ( '1F' , '2F' , '3F' ) AND A.DWORKDATE >=
> TO_DATE('2017-02-11','-MM-DD') AND A.DWORKDATE <=
> TO_DATE('2019-03-11','-MM-DD');
>  checkcarnum
> -
>  1071367
> (1 row)

That's because of how DISTINCT is implemented within an aggregate
function in PostgreSQL. Internally within the aggregate code in the
executor, a sort is performed on the entire input to the aggregate
node.  The planner is currently unable to make use of any indexes that
provide pre-sorted input.

One way to work around this would be to perform the DISTINCT and
COUNT(*) in separate stages using a subquery.

>From your original query, something like:

SELECT COUNT(SVIN) AS CHECKCARNUM,SMTOC
FROM (
SELECT SMTOC,SVIN
FROM QIS_CARPASSEDSTATION A
WHERE 1=1 AND A.SSTATIONCD = 'VQ3_LYG'
AND A.SLINENO IN ( '1F' , '2F' , '3F' )
AND A.DWORKDATE >= TO_DATE('2017-02-11','-MM-DD')
AND A.DWORKDATE <= TO_DATE('2019-03-11','-MM-DD')
GROUP BY  SMTOC,SVIN
) A GROUP BY SMTOC;

An index something like:
CREATE INDEX ON QIS_CARPASSEDSTATION (SMTOC, SVIN, SSTATIONCD, DWORKDATE);

Should help speed up the subquery and provide pre-sorted input to the
outer aggregate.  If you like, you could add SLINENO to the end of the
index to allow an index-only scan which may result in further
performance improvements.

Without the index, you're forced to sort, but at least it's just one
sort instead of two.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Poor man's partitioned index .... not being used?

2019-03-22 Thread David Rowley

On Fri, 22 Mar 2019 at 07:57, Gunther  wrote:
> foo=# PREPARE testplan(int) AS
> foo-#SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1;
> PREPARE
> foo=# EXPLAIN EXECUTE testplan(8934);
> QUERY PLAN
> --
>  Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
>Index Cond: (id = 8934)
> (2 rows)
>
> That's quite alright actually. Now the questions is, could we use this in a 
> nested loop query plan? That's where I think it can't work:

Not really. In that case, the parameters were replaced with the
specified values (a.k.a custom plan).  That happens for the first 5
executions of a prepared statement, and in this case likely the
planner will continue to use the custom plan since the generic plan
won't know that the partial index is okay to use and the plan costs
would likely go up enough that the custom plan would continue to be
favoured.

> foo=# SET enable_mergejoin TO off;
> SET
> foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
> mod(a.id,2) AND b.id = a.id) LIMIT 10;
>QUERY PLAN
> 
>  Limit  (cost=0.00..102516.78 rows=10 width=8)
>->  Nested Loop Left Join  (cost=0.00..38238760.24 rows=3730 width=8)
>  Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2)))
>  ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
>  ->  Materialize  (cost=0.00..9056.93 rows=388129 width=4)
>->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 
> width=4)
> (6 rows)
>
> It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to 
> the index query for the nested loop.

Whether partial indexes can be used are not is determined using only
quals that can be applied at the scan level.  In this case your qual
is a join qual, and since no other qual exists that can be evaluated
at the scan level where the index can be used, then it's not
considered.  In any case, nothing there guarantees that one of your
indexes will match all records. For it to work, both of you indexes
would have to be scanned. It's not clear why you think that would be
any better than scanning just one index. I imagine it would only ever
be a win if you could eliminate one of the index scans with some qual
that guarantees that the index can't contain any records matching your
query.

> I wonder if there was a way of marking such expressions as safe in the query, 
> like suggesting a certain evaluation order, i.e.,
>
> SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = 
> EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10;
>
> It's OK though. It just goes to show that in a case like this, it is best to 
> just go with the partitioned table anyway.

It sounds like you might want something like partition-wise join that
exists in PG11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

1 2 >

1 - 100 of 129 matches

Mail list logo