Re: Hash Right join and seq scan

2024-07-05 Thread David Rowley
On Sat, 6 Jul 2024 at 02:43, James Pang  wrote:
>for nest loop path, since the first one estimated only "8" rows , 
> and they use partitionkeyid as joinkey and all are hash partitions , is it 
> better to estimate cost to  8 (loop times) * 1600 = 12800 (each one loop  map 
> to only 1 hash partition bitmap scan ,avg one partition cost), that's much 
> less than 398917.29 of all partitions ?

I'm not really sure where you're getting the numbers from here. The
outer side of the deepest nested loop has an 8 row estimate, not the
nested loop itself.  I'm unsure where the 1600 is from. I only see
1669.

As of now, we don't do a great job of costing for partition pruning
that will happen during execution.  We won't be inventing anything to
fix that in existing releases of PostgreSQL, so you'll need to either
adjust the code yourself, or find a workaround.

You've not shown us your schema, but perhaps enable_partitionwise_join
= on might help you. Other things that might help are further lowering
random_page_cost or raising effective_cache_size artificially high.
It's hard to tell from here how much random I/O is being costed into
the index scans.  You could determine this by checking if the nested
loop plan costs change as a result of doing further increases to
effective_cache_size. You could maybe nudge it up enough for it to win
over the hash join plan. It is possible that this won't work, however.

>  for secondary Nest Loop Anti join could be rows 299118 rows *  15.78(avg 
> index scan cost of one partition) = 4,720,082 that still much less than  
> 132168227.57 ?
>  for Hash Right join, is it possible to estimate by  8 seq partition 
> scan instead of all 32 hash partitions since the first query estimated 8 rows 
> only ?
>  extend statistics may help estimate count(partitionkeyid) based on 
> other columns bind variables, but looks like that did not help table join 
> case.

I can't quite follow this. You'll need to better explain where you're
getting these numbers for me to be able to understand.

David




Re: Hash Right join and seq scan

2024-07-04 Thread David Rowley
On Fri, 5 Jul 2024 at 12:50, James Pang  wrote:
>we have a daily vacuumdb and analyze job, generally speaking it's done in 
> seconds, sometimes it suddenly running more than tens of minutes with same 
> bind variable values and huge temp space got used  and at that time, explain 
> show  "Hash Anti join, Hash Right join" with seq scan two tables.

There was talk about adding costing for run-time partition pruning
factors but nothing was ever agreed, so nothing was done.  It's just
not that obvious to me how we'd do that.  If the Append had 10
partitions as subnodes, with an equality join condition, you could
assume we'll only match to 1 of those 10, but we've no idea at plan
time which one that'll be and the partitions might drastically vary in
size.  The best I think we could do is take the total cost of those 10
and divide by 10 to get the average cost.  It's much harder for range
conditions as those could match anything from 0 to all partitions. The
best suggestion I saw for that was to multiply the costs by
DEFAULT_INEQ_SEL.

I think for now, you might want to lower the random_page_cost or
increase effective_cache_size to encourage the nested loop -> index
scan plan. Good ranges for effective_cache_size is anywhere between 50
- 75% of your servers's RAM.  However, that might not be ideal if your
server is under memory pressure from other running processes. It also
depends on how large shared_buffers are as a percentage of total RAM.

David




Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley
On Mon, 1 Jul 2024 at 22:20, Pavel Stehule  wrote:
> The planners get min/max range from indexes. So some user's indexes can be 
> bloated too with similar effect

I considered that, but it doesn't apply to this query as there are no
range quals.

David




Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley
On Mon, 1 Jul 2024 at 21:45, James Pang  wrote:
>Buffers: shared hit=110246 <<< here planning need access a lot of 
> buffers
>  Planning Time: 81.850 ms
>  Execution Time: 0.034 ms
>
>could you help why planning need a lot of shared buffers access ?

Perhaps you have lots of bloat in your system catalogue tables. That
could happen if you make heavy use of temporary tables. There are many
other reasons too. It's maybe worth doing some vacuum work on the
catalogue tables.

David




Re: Plan selection based on worst case scenario

2024-05-29 Thread David Rowley
On Thu, 30 May 2024 at 13:03, Darwin O'Connor  wrote:
> Is there a PostgreSQL setting that can control how it judges plans?

There's nothing like that, unfortunately.

> Here is a recent example of a query that finds the last time at a stop 
> filtered for a certain route it has to look up another table to find. 
> PostgreSQL initially chose the plan that cost "37357.45..37357.45" rather 
> than the one that cost "1.15..61088.32".
>
> transsee=# explain analyze select t.time+coalesce(t.timespent,0)*interval '1 
> second' from trackstopscurr t join tracktrip r on r.a=0 and r.route='501' and 
> r.id=t.trackid and t.stopid='4514' order by t.time desc limit 1;

>  Limit  (cost=37357.45..37357.45 rows=1 width=16) (actual 
> time=2667.674..2694.047 rows=1 loops=1)

>->  Nested Loop  (cost=182.60..36357.34 rows=1 width=16) 
> (actual time=381.913..2659.412 rows=1342 loops=3)
>  ->  Parallel Bitmap Heap Scan on trackstopscurr t  
> (cost=182.03..19048.63 rows=2014 width=14) (actual time=380.467..1231.788 
> rows=8097 loops=3)
>Recheck Cond: ((stopid)::text = '4514'::text)
>Heap Blocks: exact=8103
>->  Bitmap Index Scan on trackstopscurr_2  
> (cost=0.00..180.82 rows=4833 width=0) (actual time=382.653..382.653 
> rows=24379 loops=1)
>  Index Cond: ((stopid)::text = '4514'::text)
>  ->  Index Scan using tracktrip_0 on tracktrip r  
> (cost=0.57..8.59 rows=1 width=4) (actual time=0.175..0.175 rows=0 loops=24290)
>Index Cond: (id = t.trackid)
>Filter: ((a = 0) AND ((route)::text = '501'::text))
>Rows Removed by Filter: 1
>  Planning Time: 0.228 ms
>  Execution Time: 2694.077 ms

The problem here is primarily down to the poor estimates for the scan
on tracktrip.  You can see that the Nested Loop estimates 1 row, so
therefore the LIMIT costing code thinks LIMIT 1 will require reading
all rows, all 1 of them.  In which case that's expected to cost
36357.34, which is cheaper than the other plan which costs 61088.32 to
get one row.

If you can fix the row estimate to even estimate 2 rows rather than 1,
then it'll choose the other plan.  An estimate of 2 rows would mean
the total cost of the best path after the limit would be 61088.32 / 2
= 30544.16, which is cheaper than the 36357.34 of the bad plan.

You could try ANALYZE on tracktrip, or perhaps increasing the
statistics targets on the columns being queried here.

If there's a correlation between the "a" and "route" columns then you
might want to try CREATE STATISTICS:

CREATE STATISTICS ON a,route FROM tracktrip;
ANALYZE tracktrip;

David




Re: Slow query in table where many rows were deleted. VACUUM FULL fixes it

2024-01-30 Thread David Rowley
On Wed, 31 Jan 2024 at 09:09, Philip Semanchuk
 wrote:
> So in your case those 5m rows that you deleted were probably still clogging 
> up your table until you ran VACUUM FULL.

It seems more likely to me that the VACUUM removed the rows and just
left empty pages in the table.  Since there's no index on expires_at,
the only way to answer that query is to Seq Scan and Seq Scan will
need to process those empty pages.  While that processing is very fast
if the page's item pointers array is empty, it could still be slow if
the page needs to be read from disk. Laurenz's request for the explain
(analyze, buffers) output with track_io_timing on will help confirm
this.

If it is just reading empty pages that's causing this issue then
adding that missing index would improve the situation after running
just plain VACUUM each time there's a bulk delete.

David




Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley
On Fri, 26 Jan 2024 at 17:23, Jean-Christophe Boggio
 wrote:
> Let me know if I can do anything to provide you with more useful
> benchmark. The DB is still very small so it is easy to do tests.

What I was looking to find out was if there was some enable_* GUC that
you could turn off that would make the unaccounted time that you were
complaining about go away.

Because it seems when you reduce work_mem this unaccounted for time
goes away, it makes me think that some executor node is allocating a
load of memory during executor startup.  I was hoping to find out
which node is the offending one by the process of elimination.

Are you still seeing this unaccounted for time with your notebook?
i.e. the relatively high "Execution Time" reported by EXPLAIN ANALYZE
and low total actual execution time on the plan's top-level node.

I probably didn't need to mention the planning time as it seems
unlikely that disabling an enable* GUC would result in increased
planning time. However, it does not seem impossible that that *could*
happen.

David




Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley
On Fri, 26 Jan 2024 at 02:31, Jean-Christophe Boggio
 wrote:
> You are absolutely correct : the EXPLAIN without ANALYZE gives about the same 
> results. Also, minimizing the amount of workmem in postgresql.conf changes 
> drastically the timings. So that means memory allocation is eating up a lot 
> of time _PER_QUERY_ ?

We do reuse pallocs to create memory context, but only for I believe
1k and 8k blocks.  That likely allows most small allocations in the
executor to be done without malloc.  Speaking in vague terms as I
don't have the exact numbers to hand, but larger allocations will go
directly to malloc.

There was a bug fixed in [1] that did cause behaviour like this, but
you seem to be on 14.10 which will have that fix.  Also, the 2nd plan
you sent has no Memoize nodes.

I do wonder now if it was a bad idea to make Memoize build the hash
table on plan startup rather than delaying that until we fetch the
first tuple. I see Hash Join only builds its table during executor
run.

> Since we have quite some RAM on our machines, I dedicated as much as possible 
> to workmem (initially I was allocating 1GB) but this looks quite 
> counterproductive (I didn't think that memory was allocated every time, I 
> thought it was "available" for the current query but not necessarily used). 
> Is this an issue specific to that version of PostgreSQL? (I guess no) Or can 
> this be hardware-related? Or OS-related (both systems on which I have done 
> tests are running Ubuntu, I will try on Debian)?

It would be good to narrow down which plan node is causing this.  Can
you try disabling various planner enable_* GUCs before running EXPLAIN
(SUMMARY ON)  with \timing on and see if you can find
which enable_* GUC causes the EXPLAIN to run more quickly?  Just watch
out for variations in the timing of "Planning Time:". You're still
looking for a large portion of time not accounted for by planning
time.

I'd start with:

SET enable_memoize=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_memoize;

SET enable_hashjoin=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_hashjoin;

The following will show others that you could try.
select name,setting from pg_settings where name like 'enable%';

David

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e731ed12aa




Re: I don't understand that EXPLAIN PLAN timings

2024-01-23 Thread David Rowley
On Tue, 23 Jan 2024 at 20:45, Jean-Christophe Boggio
 wrote:
> explain says actual time between 1.093→1.388 but final execution time says 
> 132.880ms?!?

The 1.388 indicates the total time spent in that node starting from
just before the node was executed for the first time up until the node
returned no more tuples.

The 132.88 ms is the total time spent to start up the plan, which
includes doing things like obtaining locks (if not already obtained
during planning), opening files and allocating memory for various
things needed for the plan.  After the top-level node returns NULL it
then will print out the plan before shutting the plan down. This
shutdown time is also included in the 132.88 ms.

I don't know where exactly the extra time is spent with your query,
but it must be in one of the operations that I mentioned which takes
place either before the top node is first executed or after the top
node returns NULL.

If you're using psql, if you do \timing on, how long does EXPLAIN take
without ANALYZE? That also goes through executor startup and shutdown.
It just skips the running the executor part.

David




Re: [EXTERNAL] Re: Performance down with JDBC 42

2023-11-05 Thread David Rowley
On Mon, 6 Nov 2023 at 08:37, Abraham, Danny  wrote:
>
> Both plans refer to the same DB.

JDBC is making use of PREPARE statements, whereas psql, unless you're
using PREPARE is not.

> #1 – Fast – using psql or old JDBC driver

The absence of any $1 type parameters here shows that's a custom plan
that's planned specifically using the parameter values given.

> Slow – when using JDBC 42

Because this query has $1, $2, etc, that's a generic plan. When
looking up statistics histogram bounds and MCV slots cannot be
checked. Only ndistinct is used. If you have a skewed dataset, then
this might not be very good.

You might find things run better if you adjust postgresql.conf and set
plan_cache_mode = force_custom_plan then select pg_reload_conf();

Please also check the documentation so that you understand the full
implications for that.

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Tue, 12 Sept 2023 at 02:27, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm not sure if you're asking for help here because you need planning
> > to be faster than it currently is, or if it's because you believe that
> > planning should always be faster than execution. If you think the
> > latter, then you're mistaken.
>
> Yeah.  I don't see anything particularly troubling here.  Taking
> circa three-quarters of a millisecond (on typical current hardware)
> to plan a four-way join on large tables is not unreasonable.

I took a few minutes to reverse engineer the tables in question (with
assistance from an AI bot) and ran the query in question.
Unsurprisingly, I also see planning as slower than execution, but with
a ratio of about planning being 12x slower than execution vs the
reported ~18x.

Planning Time: 0.581 ms
Execution Time: 0.048 ms

Nothing alarming in perf top of executing the query in pgbench with -M
simple.  I think this confirms the problem is just with expectations.

   5.09%  postgres  [.] AllocSetAlloc
   2.99%  postgres  [.] SearchCatCacheInternal
   2.52%  postgres  [.] palloc
   2.38%  postgres  [.] expression_tree_walker_impl
   1.82%  postgres  [.] add_path_precheck
   1.78%  postgres  [.] add_path
   1.73%  postgres  [.] MemoryContextAllocZeroAligned
   1.63%  postgres  [.] base_yyparse
   1.61%  postgres  [.] CatalogCacheComputeHashValue
   1.38%  postgres  [.] try_nestloop_path
   1.36%  postgres  [.] stack_is_too_deep
   1.33%  postgres  [.] add_paths_to_joinrel
   1.19%  postgres  [.] core_yylex
   1.18%  postgres  [.] lappend
   1.15%  postgres  [.] initial_cost_nestloop
   1.13%  postgres  [.] hash_search_with_hash_value
   1.01%  postgres  [.] palloc0
   0.95%  postgres  [.] get_memoize_path
   0.90%  postgres  [.] equal
   0.88%  postgres  [.] get_eclass_for_sort_expr
   0.81%  postgres  [.] compare_pathkeys
   0.80%  postgres  [.] bms_is_subset
   0.77%  postgres  [.] ResourceArrayRemove
   0.77%  postgres  [.] check_stack_depth
   0.77%  libc.so.6 [.] __memmove_avx_unaligned_erms
   0.74%  libc.so.6 [.] __memset_avx2_unaligned
   0.73%  postgres  [.] AllocSetFree
   0.71%  postgres  [.] final_cost_nestloop
   0.69%  postgres  [.] compare_path_costs_fuzzily
   0.68%  postgres  [.] initial_cost_mergejoin
   0.64%  libc.so.6 [.] __memset_avx2_unaligned_erms
   0.61%  postgres  [.] create_nestloop_path
   0.61%  postgres  [.] examine_variable
   0.59%  postgres  [.] hash_bytes
   0.56%  postgres  [.] truncate_useless_pathkeys
   0.56%  postgres  [.] bms_overlap

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Mon, 11 Sept 2023 at 21:54, Mikhail Balayan  wrote:
> Could it be a regression? I'll check it on PG14 when I get a chance.

I'm not sure if you're asking for help here because you need planning
to be faster than it currently is, or if it's because you believe that
planning should always be faster than execution. If you think the
latter, then you're mistaken. It seems to me that the complexity of
planning this query is much more complex than executing it.  The outer
side of the inner-most nested loop finds 0 rows, so it need not scan
the inner side, which results in that nested loop producing 0 rows,
therefore the outer side of none of the subsequent nested loops find
any rows. This is why you see "(never executed)" in the EXPLAIN
ANALYZE.

You could use perf record or perf top to dig into what's slow.

On the other hand, please report back if you find PG14 to be much faster here.

You could also experiment with a set of tables which are empty.  It's
possible getting the relation sizes are a factor to consider here.
mdnblocks() needs to do a bit more work when the relation has multiple
segments.

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Mon, 11 Sept 2023 at 18:16, Laurenz Albe  wrote:
> Also, there are quite a lot of indexes on "test_db_bench_1".  On a test 
> database, drop some
> indexes and see if that makes a difference.

Yeah, I count 3 that either have the key columns as some prefix of
another index or are just a duplicate of some other index.

Getting rid of those 3 will save some time in create_index_paths().

David




Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-30 Thread David Rowley
On Thu, 31 Aug 2023 at 06:32, Rondat Flyag  wrote:
> I tried VACUUM ANALYZE for three tables, but without success. I also tried to 
> set enable_seqscan=off and the query took even more time. If I set 
> enable_sort=off then the query takes a lot of time and I cancel it.
>
> Please see the attached query plans.

It's a little hard to comment here as I don't see what the plan was
before when you were happy with the performance. I also see the
queries you mentioned in the initial email don't match the plans.
There's no table called "isbns" in the query. I guess this is "asins"?

Likely you could get a faster plan if there was an index on
asins_statistics (asin_id).  That would allow a query plan that scans
the isbns_value_key index and performs a parameterised nested loop on
asins_statistics using the asins_statistics (asin_id) index.  Looking
at your schema, I don't see that index, so it's pretty hard to guess
why the plan used to be faster.  Even if the books/asins merge join
used to take place first, there'd have been no efficient way to join
to the asins_statistics table and preserve the Merge Join's order (I'm
assuming non-parameterized nested loops would be inefficient in this
case). Doing that would have also required the asins_statistics
(asin_id) index.  Are you sure that index wasn't dropped?

However, likely it's a waste of time to try to figure out what the
plan used to be. Better to focus on trying to make it faster. I
suggest you create the asins_statistics (asin_id) index. However, I
can't say with any level of confidence that the planner would opt to
use that index if it did exist.   Lowering random_page_cost or
increasing effective_cache_size would increase the chances of that.

David




Re: Range partitioning query performance with date_trunc (vs timescaledb)

2023-08-29 Thread David Rowley
On Tue, 29 Aug 2023 at 19:40, Philippe Pepiot  wrote:
> I'm trying to implement some range partitioning on timeseries data. But it
> looks some queries involving date_trunc() doesn't make use of partitioning.
>
> BEGIN;
> CREATE TABLE test (
> time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
> value FLOAT NOT NULL
> ) PARTITION BY RANGE (time);
> CREATE INDEX test_time_idx ON test(time DESC);
> CREATE TABLE test_y2010 PARTITION OF test FOR VALUES FROM ('2020-01-01') TO 
> ('2021-01-01');
> CREATE TABLE test_y2011 PARTITION OF test FOR VALUES FROM ('2021-01-01') TO 
> ('2022-01-01');
> CREATE VIEW vtest AS SELECT DATE_TRUNC('year', time) AS time, SUM(value) AS 
> value FROM test GROUP BY 1;
> EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
> ROLLBACK;
>
> The plan query all partitions:

> I wonder if there is a way with a reasonable amount of SQL code to achieve 
> this
> with vanilla postgres ?

The only options I see for you are

1) partition by LIST(date_Trunc('year', time)), or;
2) use a set-returning function instead of a view and pass the date
range you want to select from the underlying table via parameters.

I imagine you won't want to do #1. However, it would at least also
allow the aggregation to be performed before the Append if you SET
enable_partitionwise_aggregate=1.

#2 isn't as flexible as a view as you'd have to create another
function or expand the parameters of the existing one if you want to
add items to the WHERE clause.

Unfortunately, date_trunc is just a black box to partition pruning, so
it's not able to determine that DATE_TRUNC('year', time) >=
'2021-01-01'  is the same as time >= '2021-01-01'.  It would be
possible to make PostgreSQL do that, but that's a core code change,
not something that you can do from SQL.

David




Re: Slow query, possibly not using index

2023-08-28 Thread David Rowley
On Mon, 28 Aug 2023 at 19:21, Les  wrote:
> More important question is, how can I find out why the index was not auto 
> vacuumed.

You should have a look at pg_stat_user_tables.  It'll let you know if
the table is being autovacuumed and how often.  If you're concerned
about autovacuum not running properly, then you might want to lower
log_autovacuum_min_duration.  Generally, anything that takes a
conflicting lock will cause autovacuum to cancel so that the
conflicting locker can get through.  Things like ALTER TABLE or even
an ANALYZE running will cancel most autovacuum runs on tables.

Also, this is a fairly large table and you do have the standard
autovacuum settings.  Going by pgstattuple, the table has 39652836
tuples.  Autovacuum will trigger when the statistics indicate that 20%
of tuples are dead, which is about 8 million tuples. Perhaps that's
enough for the index scan to have to skip over a large enough number
of dead tuples to make it slow. You might want to consider lowering
the autovacuum scale factor for this table.

Also, ensure you're not doing anything like calling pg_stat_reset();

It might be worth showing us the output of:

select * from pg_stat_user_tables where relid = 'media.block'::regclass;

David




Re: Partitioning update-heavy queue with hash partitions vs partial indexes

2023-08-10 Thread David Rowley
On Thu, 10 Aug 2023 at 20:36, Dorian Hoxha  wrote:
> > Do Not Use Partial Indexes as a Substitute for Partitioning
> > While a search in this larger index might have to descend through a couple 
> > more tree levels than a search in a smaller index, that's almost certainly 
> > going to be cheaper than the planner effort needed to select the 
> > appropriate one of the partial indexes. The core of the problem is that the 
> > system does not understand the relationship among the partial indexes, and 
> > will laboriously test each one to see if it's applicable to the current 
> > query.
>
> Would this be true in my case too?

Yes.  The process of determining which partial indexes are valid for
the given query must consider each index one at a time and validate
the index's WHERE clause against the query's WHERE clause to see if it
can be used.  There is no shortcut that sees you have a series of
partial indexes with WHERE id % 10 = N; which just picks 1 index
without searching all of them.

> Is it faster for the planner to select a correct partition(hash partitioning 
> on `id` column) instead of a correct partial index like in my case? I don't 
> think I'll need more than ~32 partitions/partial-indexes in an extreme 
> scenario.

I mean, test it and find out, but probably, yes, the partition pruning
code for hash partitioning is an O(1) operation and is very fast.
Once the given Constants have been hashed, finding the partition is
just a single divide operation away.

David




Re: High QPS, random index writes and vacuum

2023-04-17 Thread David Rowley
On Tue, 18 Apr 2023 at 12:35, peter plachta  wrote:
> I increased work_mem to 2Gb

maintenance_work_mem is the configuration option that vacuum uses to
control how much memory it'll make available for storage of dead
tuples. I believe 1GB would allow 178,956,970 tuples to be stored
before multiple passes would be required. The chunk of memory for dead
tuple storage is capped at 1GB.

David




Re: speeding up grafana sensor-data query on raspberry pi 3

2023-04-16 Thread David Rowley
On Mon, 17 Apr 2023 at 05:00, Clemens Eisserer  wrote:
> Example:
> table smartmeter with non-null column ts (timestamp with time zone)
> and brinc index on ts, no pk to avoid a btree index.
> Sensor values are stored every 5s, so for 1 month there are about 370k
> rows - and in total the table currently holds about 3M rows.
> The query to display the values for 1 month takes ~3s, with the bitmap
> heap scan as well as aggregation taking up most of the time, with
> sorting in between.

I know you likely don't have much RAM to spare here, but more work_mem
might help, even just 16MBs might be enough. This would help the Sort
and to a lesser extent the Bitmap Heap Scan too.

Also, if you'd opted to use PostgreSQL 14 or 15, then you could have
performed CREATE STATISTICS on your GROUP BY clause expression and
then run ANALYZE.  That might cause the planner to flip to a Hash
Aggregate which would eliminate the Sort before aggregation.  You'd
only need to sort 236 rows after the Hash Aggregate for the ORDER BY.

Plus, what Justin said.

David




Re: multicolumn partitioning help

2023-03-15 Thread David Rowley
On Thu, 16 Mar 2023 at 00:47, James Robertson  wrote:
> or do we get?
>
> TopLevelTable
> |
> |> worker-thread 1 (default catch)
> |> worker thread 2 -> sub-table 1.1
> |> worker thread 3 -> sub-table 1.2
> |> worker thread 4 -> sub-table 1.n
> |
> |> worker-thread 5 (default catch)
> |> worker thread 6 -> sub-table 2.1
> |> worker thread 7 -> sub-table 2.2
> |> worker thread 8 -> sub-table 2.n

The planner generally flattens out the scans to each partition into a
single Append or MergeAppend node.  Append nodes can be parallelised.
Assuming there's no reason that a particular partition can't support
it, the parallel workers can be distributed to assist without
restriction to which partition they help with. Multiple workers can
even help with a single partition. Workers can then move over to
helping with other partitions when they're done with the partition
they've been working on. I believe some other databases do or did at
least restrict parallelism to 1 worker per partition (which perhaps is
why you raised this).  There's no such restriction with PostgreSQL.

> Summary:
> 1) if we create nested partitions, do we create performance issues:

If you create too many partitions, it can create performance issues.
You should look at the partitioning best practices section of the
documents for details about that. I recommend a careful read of those.

> 2) if nested partitions are the solutions, what is the point of multi-column 
> partitioning?

There are many reasons.  If you have multiple levels of partitioning,
then the partition pruning done during planning is going to have more
work to do as it'll be executed once, then once again for each
partitioned table remaining after running it for the first level.
Also, it seems much easier to PARTITION BY HASH(a,b) than to first do
HASH(a) then another level to HASH(b).  However, there may be
advantages to having multiple levels here as the planner would still
be able to prune partitions if the WHERE clause didn't contain any
quals like "b = ".  The key take away here is that they're
different, so we support both.

> wish list) wouldn't it be neat if we can do mult-mode multi-column? like 
> PARTITION BY RANGE (EXTRACT(YEAR FROM dob)) LIST (SUBSTRING(hash, 1, 1));

Effectively, multi-level partitioning gives you that, It's just the
DDL is different from how you wrote it.

David




Re: multicolumn partitioning help

2023-03-15 Thread David Rowley
On Wed, 15 Mar 2023 at 10:41, Laurenz Albe  wrote:
> I think you want subpartitioning, like
>
>   CREATE TABLE humans (
>  hash bytea,
>  fname text,
>  dob date
>   ) PARTITION BY LIST (EXTRACT (YEAR FROM dob));

This may be perfectly fine, but it is also important to highlight that
partitioning in this way may hinder partition pruning.

If the first level partitioned table was to be BY RANGE (dob); then
the partitions could be defined like FOR VALUES FROM ('2023-01-01') TO
('2024-01-01'). For a query that had something like WHERE dob =
'2023-03-16', then PostgreSQL could prune away all the partitions for
the other years.  The same wouldn't occur if the table was partitioned
by LIST (EXTRACT (YEAR FROM dob)) unless you added a AND EXTRACT (YEAR
FROM dob) = 2023 to the query's WHERE clause.

Robert, there are a few tips about partitioning in [1] that you may
wish to review.

David

[1] https://www.postgresql.org/docs/devel/ddl-partitioning.html




Re: Window Functions & Table Partitions

2023-02-20 Thread David Rowley
On Fri, 10 Feb 2023 at 06:40, Benjamin Tingle  wrote:
> Thanks for the helpful response david! I'll have a shot at getting the patch 
> to work myself & submitting to pgsql-hackers.

I took some time today for this and fixed up a few mistakes in the
patch and added it to the March commitfest [1].  Time is ticking away
for v16, so given this is a fairly trivial patch, I thought it might
be nice to have it.

Any discussion on the patch can be directed at [2]

David

[1] https://commitfest.postgresql.org/42/4198/
[2] 
https://www.postgresql.org/message-id/flat/caaphdvojkdbr3mr59jxmacybyhb6q_5qpru+dy93en8wm+x...@mail.gmail.com

> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).>
> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).




Re: Window Functions & Table Partitions

2023-02-08 Thread David Rowley
On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
> Basically- window partition functions don't take advantage of existing table 
> partitions. I use window functions as a more powerful GROUP BY clause that 
> preserves row-by-row information- super handy for a lot of things.
>
> In particular, I want to use window functions on already partitioned tables, 
> like the below example:
>
> create table abb (a int, b int, g int) partition by hash(b)
> /* populate table etc... */
> select a, b, min(a) over (partition by b) as g from abb
>
> Ideally with a query plan like this:
>
> Window:
> Append:
> Sort on table_p0
> Sort on table_p1
> Sort on table_p2

There was some effort [1] in version 12 to take advantage of the order
defined by the partitioning scheme. The release notes [2] mention:

"Avoid sorting when partitions are already being scanned in the necessary order"

However, it's not 100% of what you need as there'd have to be a btree
index on abb(b) for the planner to notice.

Likely this could be made better so that add_paths_to_append_rel()
added the pathkeys defined by the partitioned table into
all_child_pathkeys if they didn't exist already. In fact, I've
attached a very quickly hacked together patch against master to do
this.  I've given it very little thought and it comes complete with
failing regression tests.

If you're interested in pursuing this then feel free to take the patch
to the pgsql-hackers mailing list and propose it. It's unlikely I'll
get time to do that for a while, but I will keep a branch locally with
it to remind me in case I do at some point in the future.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
[2] https://www.postgresql.org/docs/release/12.0/
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index ae0f9bdc8a..c4271c9179 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1365,6 +1365,27 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo 
*rel,
}
}
 
+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+   partitions_are_ordered(rel->boundinfo, rel->live_parts))
+   {
+   List   *partition_pathkeys;
+   boolpartial;
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  ForwardScanDirection,
+   
  );
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  BackwardScanDirection,
+   
  );
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+   }
+
/*
 * Collect lists of all the available path orderings and
 * parameterizations for all the children.  We use these as a


Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley
On Thu, 2 Feb 2023 at 14:49, Thomas Munro  wrote:
> If I had more timerons myself, I'd like to try to make parallel
> function scans, or parallel CTE scans, work...

I've not really looked in detail but I thought parallel VALUES scan
might be easier than those two.

David




Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley
On Wed, 1 Feb 2023 at 18:39, Alex Kaiser  wrote:
> postgres=# set force_parallel_mode = on;

There's been a bit of debate about that GUC and I'm wondering how you
came to the conclusion that it might help you. Can you share details
of how you found out about it and what made you choose to set it to
"on"?

David




Re: Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-18 Thread David Rowley
On Sun, 18 Dec 2022 at 23:44, João Paulo Luís  wrote:
> Meanwhile, as a one-time workaround I've disabled the hashagg algorithm,

The way the query planner determines if Hash Aggregate's hash table
will fit in work_mem or not is based on the n_distinct estimate of the
columns being grouped on.  You may want to review what analyze set
n_distinct to on this table. That can be done by looking at:

select attname,n_distinct from pg_Stats where tablename =
'sentencesource' and attname = 'sentence';

If what that's set to does not seem realistic, then you can overwrite this with:

ALTER TABLE sentencesource ALTER COLUMN sentence SET (n_distinct = N);

Please see the paragraph in [1] about n_distinct.  Using an absolute
value is likely not a great idea if the table is going to grow. You
could maybe give it a better estimate about how many times values are
repeated by setting some negative value, as described in the
documents. You'll need to analyze the table again after changing this
setting.

David

[1] https://www.postgresql.org/docs/12/sql-altertable.html




Re: DML sql execution time slow down PGv14 compared with PGv13

2022-12-15 Thread David Rowley
On Thu, 15 Dec 2022 at 21:12, James Pang (chaolpan)  wrote:
>We had some load test ( DML inserts/deletes/updates/ on tens of hash 
> partition tables)  and found that PGV14 slow down 10-15% compared with PGV13. 
>  Same test server, same schema tables and data. From pg_stat_statements, sql 
> exec_time, we did found similar mean_exec_time increased from 5%-25% with 
> same SQL statements. Both v14 and v13 give very fast sql response time, just 
> compare the %diff from sql statements mean_exec_time.

I tried this out on the tip of the PG13 and PG14 branch with the same
scale of pgbench as you mentioned and I don't see the same slowdown as
you do.

PG13:
tps = 1711.980109 (excluding connections establishing)

PG14:
tps = 1736.466835 (without initial connection time)

As for why yours might be slower.  You might want to have a look at
the EXPLAIN ANALYZE output for the UPDATE statements. You can recreate
the -M prepared by using PREPARE and EXECUTE. You might want to
execute the statements 6 times and see if the plan changes on the 6th
execution.  It's likely not impossible that PG14 is using custom
plans, whereas PG13 might be using generic plans for these updates.
There were some quite significant changes made to the query planner in
PG14 that changed how planning works for UPDATEs and DELETEs from
partitioned tables.  Perhaps there's some reason there that the
custom/generic plan choice might differ. I see no reason why INSERT
would have become slower. Both the query planning and execution is
very different for INSERT.

You might also want to have a look at what perf says. If you have the
debug symbols installed, then you could just watch "perf top --pid=". Maybe that will show you
something interesting.




Re: wrong rows and cost estimation when generic plan

2022-12-06 Thread David Rowley
On Tue, 6 Dec 2022 at 20:17, James Pang (chaolpan)  wrote:
>  Could you provide the function name for generic plan selectivity estimation?

If you look at eqsel_internal(), you'll see there are two functions
that it'll call var_eq_const() for Consts and otherwise
var_eq_non_const(). It'll take the non-Const path for planning generic
plans.

David




Re: wrong rows and cost estimation when generic plan

2022-12-05 Thread David Rowley
On Tue, 6 Dec 2022 at 18:28, James Pang (chaolpan)  wrote:
>->  Index Scan using idx_x_time on x  (cost=0.44..8.48 rows=1 
> width=2923) (actual time=8136.242..8136.242 rows=0 loops=1)
>  Index Cond: ((starttime = $7) AND (endtime = $8))
>  Filter: ((password IS NULL) AND ...(aid = $4) AND (bid = $5) AND 
> (btype = $6) AND...
>  Rows Removed by Filter: 5534630

I wonder if you did:

create statistics x_starttime_endtime_stats  (ndistinct) on
starttime,endtime from x;
analyze x;

if the planner would come up with a higher estimate than what it's
getting for the above and cause it to use the other index instead.

> optimzer is very complicated, could you direct me how optimizer to do 
> selectivity estimation when building generic plan, for this case? for 
> custom_plan, optimizer knows boundparams values, but when generic_plan, 
> planner() use boundparams=NULL,  it try to calculate average value based on 
> mcv list of the index attributes (starttime,endtime)  ?

IIRC, generic plan estimates become based on distinct estimations
rather than histograms or MCVs.

David




Re: Catching up with performance & PostgreSQL 15

2022-11-29 Thread David Rowley
On Wed, 30 Nov 2022 at 03:31, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > IMO it was a mistake to turn JIT on in the default config, so that's one
> > thing you'll likely want to change.
>
> I wouldn't necessarily go quite that far, but I do think that the
> default cost thresholds for invoking it are enormously too low,
> or else there are serious bugs in the cost-estimation algorithms
> for deciding when to use it.  A nearby example[1] of a sub-1-sec
> partitioned query that took 30sec after JIT was enabled makes me
> wonder if we're accounting correctly for per-partition JIT costs.

I'm very grateful for JIT. However, I do agree that the costs need to work.

The problem is that the threshold to turn JIT on does not consider how
many expressions need to be compiled. It's quite different to JIT
compile a simple one-node plan with a total cost of 10 than to JIT
compile a plan that  costs the same but queries 1000 partitions. I
think we should be compiling expressions based on the cost of the
individial node rather than the total cost of the plan. We need to
make some changes so we can more easily determine the number of times
a given node will be executed before we can determine how worthwhile
JITting an expression in a node will be.

David

> [1] 
> https://www.postgresql.org/message-id/B6025887-D73F-4B5B-9925-4DA4B675F7E5%40elevated-dev.com




Re: Postgresql 13 partitioning advice

2022-08-02 Thread David Rowley
On Tue, 2 Aug 2022 at 19:48, Ameya Bidwalkar
 wrote:
> We have a Postgresql 13 database where we have a single table with several 
> millions of rows  . We plan to partition it based on timestamp .
> We have  been seeking  advice for best practices for building this.
> This table will get lots of updates for the same rows during a short period 
> of time.During this time rows would be in a single partition .
> After this short time these rows would move to  another partition  .Where no 
> more updates take place on these rows.But might  have some SELECT queries  
> running.
> We plan to l have  partitions based on months and then roll them up in a year 
>  and then archive these older partitions
> One consultant we talked with told us this  row movement between the 
> partitions will have
> huge complications .But  this was an issue during the Postgres 10 version .

Define "huge complications"?

The capabilities of partitioned tables have changed quite a bit since
the feature was added. It's very easy for knowledge to get out-dated
in this area.  I did quite a bit of work on them and I struggle to
remember off the top of my head which versions saw which improvements.
PG12 saw lots. See [1], search for "partition".

One possible complication is what is mentioned in [2] about
"serialization failure error".  UPDATEs that cause a tuple to move to
another partition can cause a serialization failure at transaction
isolation level, not just serializable transactions. If it's not
already, you might want to have your application retry transactions on
SQL:40001 errors.

Apart from that, assuming there's comparatively a small number of rows
in the partition being updated compared to the partition with the
static rows, then it sounds fairly efficient. As you describe it,  the
larger static partition is effectively INSERT only and auto-vacuum
will need to touch it only for tuple freezing work.  The smaller of
the two tables will receive more churn but will be faster to vacuum.
PG13 got a new feature that makes sure auto-vacuum also does the
rounds on INSERT-only tables too, so the static partition is not going
to be neglected until anti-wrap-around-autovacuums trigger, like they
would have in PG12 and earlier.

Another thing to consider is that an UPDATE of a non-partitioned table
has a chance at being a HOT update. That's possible if the tuple can
fit on the same page and does not update any of the indexed columns. A
HOT update means no indexes need to be updated so these perform faster
and require less space in WAL than a non-HOT update.  An UPDATE that
moves a tuple to another partition can never be a HOT update.  That's
something you might want to consider. If you're updating indexed
columns already then it's not a factor to consider.There's also
overhead to postgres having to find the partition for the newly
updated version of the tuple.  That's not hugely expensive, but it's
generally measurable. RANGE partitioned tables with a large number of
partitions will have the most overhead for this. HASH partitioned
tables, the least.

The best thing you can likely do is set up a scenario with pgbench and
compare the performance. pgbench is a pretty flexible tool that will
allow you to run certain queries X% of the time and even throttle the
workload at what you expect your production server to experience. You
could then run it overnight on a test server, or even for weeks and
see how auto-vacuum keeps up when compared to the non-partitioned
case. You can also check how much extra WAL is generated vs the
non-partitioned case.

> So we are seeking advice on the performance perspective and things we should 
> take care of along with manual vacuums on a regular schedule and indexing.
> Are there any tunables I should experiment with in particular ?

Perhaps if you want to keep a small high-chun table in check you might
want to consider if autovacuum_naptime is set low enough. You may not
care if the space being consumed in the standard 1min
autovacuum_naptime is small enough not to be of concern.

David

[1] https://www.postgresql.org/docs/release/12.0/
[2] https://www.postgresql.org/docs/13/sql-update.html




Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread David Rowley
On Thu, 5 May 2022 at 11:15, André Hänsel  wrote:
>
> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;

Unfortunately, the query planner is not quite smart enough to realise
that your shipping_date clauses can only match a single value.
There's quite a bit more we could do with the planner's
EquivalanceClasses. There is a patch around to help improve things in
this area but it requires some more infrastructure to make it more
practical to do from a performance standpoint in the planner.

You'll get the plan you want if you requite the query and replace your
date range with shipping_date = '2022-05-01'.  Your use of WHERE TRUE
indicates to me that you might be building this query in an
application already, so maybe you can just tweak that application to
test if the start and end dates are the same and use equality when
they are.

David

[1] https://commitfest.postgresql.org/38/3524/




Re: Window partial fetch optimization

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 06:11, Levi Aul  wrote:
> It is our expectation that this query “should” be able to be cheap-to-compute 
> and effectively instantaneous. (It’s clear to us how we would make it so, 
> given a simple LMDB-like sorted key-value store: prefix-match on 
> holder_address; take the first row you find for the contract-address you’re 
> on; build a comparator key of (holder_address, contract_address, 
> highest-possible-version) and traverse to find the lowest row that sorts 
> greater than it; repeat.)
>
> Which might, in SQL, be expressed as something like this:
>
> WITH ranked_holder_balances AS (
> SELECT
> *,
> row_number() OVER w AS balance_rank
> FROM contract_balance_updates
> WHERE holder_address = '\x'::bytea
> WINDOW w AS (
> PARTITION BY holder_address, contract_address
> ORDER BY start_block_height DESC
> )
> ORDER BY holder_address ASC, contract_address ASC, start_block_height DESC
> )
> SELECT *
> FROM ranked_holder_balances
> WHERE balance_rank = 1

Yes, PostgreSQL 14 is not very smart about realising that WHERE
balance_rank = 1 is only going to match the first row of each window
partition. PostgreSQL 15 (coming later this year) should be better in
this regard as some work was done to teach the query planner about
monotonic window functions [1].  However, that change likely does not
do all you'd like here as the WindowAgg node still must consume and
throw away all tuples until it finds the first tuple belonging to the
next window partition.  It sounds like you really want "Skip Scans" or
"Loose Index Scans" which are implemented by some other RDBMS'.  I
imagine that even with the change to PostgreSQL 15 that it still
wouldn't be as fast as your DISTINCT ON example.

> WITH bup1 AS (
> SELECT DISTINCT bup.holder_address, bup.contract_address
> FROM contract_balance_updates bup
> WHERE bup.holder_address = 
> '\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea
> ORDER BY bup.contract_address ASC
> )
> SELECT
>   bup1.holder_address,
>   bup1.contract_address,
>   (
> SELECT balance
> FROM contract_balance_updates bup2
> WHERE bup2.holder_address = bup1.holder_address
> AND bup2.contract_address = bup1.contract_address
> ORDER BY bup2.holder_address ASC, bup2.contract_address ASC, 
> bup2.start_block_height DESC
> LIMIT 1
>   ) AS balance
> FROM bup1

> I really don’t like this last approach; it scans twice, it’s surprising / 
> confusing for people maintaining the query, etc. I believe that, due to the 
> correlated subquery, the planning overhead is also O(N) with the number of 
> matched entities increases (though I don’t have a good test-case for this.)

No, the subquery is not replanned each time it is rescanned. It's
planned once and that same plan will be executed each time. So no O(n)
planning overhead.

> Is there any way to get PG to do what this last query is doing, purely using 
> window-functions / distinct on / etc.? Because, if there is, I can’t find it.
>
> It seems that PG can in fact do index-range-seeking (since that’s what it’s 
> doing when gathering the distinct contract_addresses in the last query.) It 
> seems intuitive to me that it should be using such an approach to filter for 
> rows in window/group-partitions, when a criteria+index that can be combined 
> to limit the size of the window/group are available to the planner. And that, 
> even when not able to be automatically inferred, it would make sense for 
> there to be control over such behaviour in SQL, using hypothetical syntax 
> like:

Unfortunately, DISTINCT can only be implemented with Hash Aggregate or
Sort / Index Scan + Unique.  We don't have anything currently which
will jump to the next highest key in an index.  There has been some
work on what we're starting to call "Skip Scans", but it's all still
work in progress.

You might find something useful in [2] which might help speed up your
DISTINCT query.

> -- for windows
> row_number() OVER (PARTITION BY x ORDER BY x LIMIT 10 OFFSET 3)
>
> -- for groups
> GROUP BY x, y, z (APPLYING LIMIT 20 OFFSET 5 PER GROUP)
>
>
> Does this make sense? Or is this something PG is already doing, and I just 
> haven’t found the right magic words / built my index correctly to unlock it? 
> (I notice that the last example is an index-only scan; would I get this 
> behaviour from the previous two queries if I made the index a covering index 
> such that those could be index-only scans as well?)

Unfortunately, there is no magic words here. PostgreSQL 14 simply has
no ability to know that row_number() is monotonically increasing,
therefore has no ability to skip any processing for rows that are
never needed.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
[2] https://wiki.postgresql.org/wiki/Loose_indexscan




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 00:21, Benjamin Coutu  wrote:
> Thanks David, using extended statistics for both (and only for both) tables 
> solved this problem.

Oh, whoops. I did get that backwards.  The estimate used by the
Memoize costing code is from the outer side of the join, which is the
extdataempty in this case. I don't think the
extdataregular_field_index_stats will do anything. It'll be the ones
you added on extdataempty that are making it work.

> BTW, thank you for all your work on performance in recent releases.

Thanks for the feedback :)

David




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Tue, 3 May 2022 at 23:05, Benjamin Coutu  wrote:
>   ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
> rows=0 loops=4067215)
> Cache Key: e2.field, e2.index
> Cache Mode: logical
> Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
> Usage: 65537kB
> Buffers: shared hit=16268863
> ->  Index Only Scan using pk_extdataregular on extdataregular e1  
> (cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
> loops=4067215)
>   Index Cond: ((field = e2.field) AND (index = e2.index))

> Why would it ever make sense to generate a memoize plan for a unique join?

It wouldn't ever make sense.

The problem is that estimate_num_groups() is used to estimate the
number of distinct values and that function does not know about
primary keys. There's no way the costing of Memoize would allow a
Memoize plan to be used if it thought all values were unique, so the
only possibility here is that ndistinct is being underestimated by
some amount that makes Memoize look like the most favourable plan.

You could see what the planner thinks about the ndistinct estimate on
field, index by doing:

EXPLAIN SELECT field,index FROM extdataregular GROUP BY 1,2;

Whatever you see in the final row estimate for that plan is what's
being fed into the Memoize costing code.

> I think this issue might tie in with the current discussion over on the 
> hackers mailing list [1]

I'd say it's a pretty different problem. The cache hit ratio
discussion on that thread talks about underestimating the hit ratio.
That particular problem could only lead to Memoize plans *not* being
chosen when they maybe should be. Not the other way around, which is
your case.

create statistics extdataregular_field_index_stats (ndistinct) on
field, index from extdataregular;
analyze extdataregular;

would likely put that right.

David




Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.

2022-04-27 Thread David Rowley
On Wed, 27 Apr 2022 at 19:54, Emil Iggland  wrote:
>
>  > You've got the wrong column order (for this query anyway) in that
>  > index.  It'd work a lot better if dataview were the first column;

> I might be misunderstanding you, but I assume that you are suggesting an
> index on (dataview, valuetimestamp).
> We have that index, it is the primary key. For some reason it isn't
> being selected.

I don't think that index can be used for your original query. It could
only be used if "channel" is unique in "valueseries" and you'd written
the query as:

select * from datavalue
where dataview = (select id from valueseries where channel = 752433)
ORDER BY VALUETIMESTAMP DESC
FETCH FIRST ROW only;

that would allow a backwards index scan using the (dataview,
valuetimestamp) index.  Because you're using the IN clause to possibly
look for multiple "dataview" values matching the given "channel", the
index range scan does not have a single point to start at.  What
you've done with the LATERAL query allows the index to be scanned once
for each "valueseries" row with a "channel" value matching your WHERE
clause.

I guess "channel" must not be the primary key to "valueseries" and
that's why you use an IN().

The above query would return an error if multiple rows were returned
by the subquery.

David




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley
On Fri, 23 Jul 2021 at 04:14, Peter Geoghegan  wrote:
>
> On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> > That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
> > it's quite new after all.  Can you extract a self-contained test case that
> > behaves this way?
>
> I wonder if this has something to do with the way that the input data
> is clustered. I recall noticing that that could significantly alter
> the behavior of HashAggs as of Postgres 13.

Isn't it more likely to be reaching the group limit rather than the
memory limit?

if (input_groups * hashentrysize < hash_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = hash_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}

There are 55 aggregates on a varchar(255). I think hashentrysize is
pretty big. If it was 255*55 then only 765591 groups fit in the 10GB
of memory.

David




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley
On Fri, 23 Jul 2021 at 03:56, Tom Lane  wrote:
> So basically, we now have a hard restriction that hashaggs can't use
> more than INT_MAX kilobytes, or approximately 2.5GB, and this use case
> is getting eaten alive by that restriction.  Seems like we need to
> do something about that.

Hmm, math check?

postgres=# select pg_size_pretty(power(2,31)::numeric*1024);
 pg_size_pretty

 2048 GB
(1 row)

David




Re: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread David Rowley
On Thu, 22 Jul 2021 at 16:37, l...@laurent-hasson.com
 wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you
see no "Disk Usage" there.  As mentioned, v11 didn't spill to disk and
just used all the memory it pleased.  That was a bit dangerous as it
could result in OOM, so it was fixed.

David




Re: Linear slow-down while inserting into a table with an ON INSERT trigger ?

2021-07-17 Thread David Rowley
On Sat, 17 Jul 2021 at 16:40, Justin Pryzby  wrote:
> You could run a single UPDATE rather than 30k triggers.
> Or switch to an INSERT on the table, with an index on it, and call
> max(last_parent_table_change) from whatever needs to ingest it.  And prune the
> old entries and vacuum it outside the transaction.  Maybe someone else will
> have a better suggestion.

Maybe just change the UPDATE statement to:

UPDATE data_sync SET last_parent_table_change=CURRENT_TIMESTAMP WHERE
last_parent_table_change <> CURRENT_TIMESTAMP;

That should reduce the number of actual updates to 1 per transaction.

David




Re: Partition column should be part of PK

2021-07-11 Thread David Rowley
On Mon, 12 Jul 2021 at 12:37, Nagaraj Raj  wrote:
> personally, I feel this design is very bad compared to other DB servers.

I'm not sure exactly what you're referring to here as you didn't quote
it, but my guess is you mean our lack of global index support.

Generally, there's not all that much consensus in the community that
this would be a good feature to have.  Why do people want to use
partitioning?  Many people do it so that they can quickly remove data
that's no longer required with a simple DETACH operation.  This is
metadata only and is generally very fast.  Another set of people
partition as their tables are very large and they become much easier
to manage when broken down into parts.  There's also a group of people
who do it for the improved data locality.   Unfortunately, if we had a
global index feature then that requires building a single index over
all partitions.  DETACH is no longer a metadata-only operation as we
must somehow invalidate or remove tuples that belong to the detached
partition. The group of people who partitioned to get away from very
large tables now have a very large index.  Maybe the only group to get
off lightly here are the data locality group. They'll still have the
same data locality on the heap.

So in short, many of the benefits of partitioning disappear when you
have a global index.

So, why did you partition your data in the first place?  If you feel
like you wouldn't mind having a large global index over all partitions
then maybe you're better off just using a non-partitioned table to
store this data.

David




Re: Planning performance problem (67626.278ms)

2021-07-02 Thread David Rowley
On Thu, 1 Jul 2021 at 08:56, Manuel Weitzman  wrote:
> For each of these RestrictInfos there *could* be one cache miss on
> cached_scansel() that *could* force the planner to compute
> get_actual_variable_range() for the same variable (a.a) over and over,
> as mergejoinscansel() always computes the selectivity for the
> intervals that require actual extremal values. In practice this
> re-computing of the variable range seems to happen a lot.

Recently, for some other topic, I was thinking about if we were ever
to have the executor give up on a plan because something did not play
out the way the planner expected it to, that if the executor to ever
be given that ability to throw the plan back at the planner with some
hints about where it went wrong, then I wondered what exactly these
hints would look like.

I guessed these would be a list of some rough guidelines saying that x
JOIN y ON x.a=y.b produces at least Z rows. Then the planner would
find that when estimating the size of the join between x and y and
take the maximum of two values.  That would need to be designed in
such a way that the planner could quickly consult the feedback, e.g
hashtable lookup on Oid.

Anyway, I don't really have any clearly thought-through plans for that
idea as it would be a large project that would need a lot of changes
before it could be even thought about seriously. However, it did cause
me to think about that again when reading this thread as it seems
similar. You've learned the actual variable range through actual
execution, so it does not seem too unreasonable that information might
get stored in a similar place, if that place were to exist.

I'm not saying that's what needs to happen here. It's more just food
for thought.  The caching you need does seem more specific to Oid than
relid.

David




Re: PgSQL 12 on WinSrv ~3x faster than on Linux

2021-06-04 Thread David Rowley
On Fri, 4 Jun 2021 at 23:53, Taras Savchuk  wrote:
> My real life test is to "register" 10 _same_ documents (провести документы) 
> in each of 1C/PostgreSQL DBs. Both PostgreSQL DBs are identical and just 
> before test imported to PostgreSQL via application server (DT import).
> On Windows Server test procedure takes 20-30 seconds, on Linux it takes 
> 1m-1m10seconds. PostgreSQL VMs are running on same Hypervisor with same 
> resources assigned to each of them.
> Tuning PostgreSQL config and/or CentOS don't make any difference. Contrary on 
> Windows VM we have almost 3x better performance with stock PostgreSQL config.
>
> Any ideas what's wrong? For me such a big difference on identical 
> databases/queries looks strange.

It's pretty difficult to say. You've not provided any useful details
about the workload you're running.

If this "register 10 _same_ documents" thing requires running some
query, then you might want to look at EXPLAIN (ANALYZE, BUFFERS) for
that query.  You might want to consider doing SET track_io_timing =
on;  Perhaps Linux is having to read more buffers from disk than
Windows.

David.




Re: issue partition scan

2021-05-25 Thread David Rowley
On Wed, 26 May 2021 at 11:38, Nagaraj Raj  wrote:
>
> Apologies,  I didn't understand you completely.
>
> > 1. Those that have sub_soc.soc = 'NFWJYW0' and sub_soc.curr_ind = 'Y'
>
> > It can use constraint exclusion on these to only scan applicable partitions.
>
> > 2. Those that have (acc.acct = 'I' AND acc.acct_sub IN  ( '4', '5' ) )  OR 
> > sub.ban IN  ( '00','01','02','03','04','05' )
>
> > It can't use constraint exclusion on these since results can come from any 
> > partition.
>
> Why is it not using constraint exclusion on the above two conditions(1 and 2) 
> included in the where clause ?
>
> Both sets are pointing to different tables.

It's because of the OR condition.  If it was an AND condition then the
planner wouldn't have to consider the fact that records in other
partitions might be required for the join.

David




Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley
On Sat, 22 May 2021 at 10:59, Nagaraj Raj  wrote:
> ERROR: empty range bound specified for partition "mytable_z" DETAIL: 
> Specified lower bound ('Z') is greater than or equal to upper bound ('['). 
> SQL state: 42P17

It looks like '[' does not come after 'Z' in your collation.

David




Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley
On Sat, 22 May 2021 at 04:38, Nagaraj Raj  wrote:
> I am trying to create partitions on the table which have around 2BIL records 
> and users will always look for the "name", its not possible to create a 
> partition with a list, so we are trying to create a partition-based first 
> letter of the name column. name column has a combination of alpha numeric 
> values.

Going by the description of your use case, I think HASH partitioning
might be a better option for you. It'll certainly be less painful to
initially set up and maintain.

Here's an example:

create table mytable (a text) partition by hash(a);
create table mytable0 partition of mytable for values with(modulus 10,
remainder 0);
create table mytable1 partition of mytable for values with(modulus 10,
remainder 1);
create table mytable2 partition of mytable for values with(modulus 10,
remainder 2); --etc

Change the modulus to the number of partitions you want and ensure you
create a partition for each modulus. In this case, it would be 0 to 9.

David




Re: Partition with check constraint with "like"

2021-05-21 Thread David Rowley
On Fri, 21 May 2021 at 19:02, Nagaraj Raj  wrote:
> then what would be the range of Z
> FROM (Z) to (?) ;

postgres=# select chr(ascii('z')+1) ;
 chr
-
 {
(1 row)


> same way for 9

postgres=# select chr(ascii('9')+1) ;
 chr
-
 :
(1 row)

https://en.wikipedia.org/wiki/ASCII

You can also use MINVALUE and MAXVALUE to mean unbounded at either end
of the range.

But is there a particular need that you want to partition this way? It
seems like it might be a bit painful to maintain, especially if you're
not limiting yourself to ASCII or ANSI characters.

You might want to consider HASH partitioning if you're just looking
for a way to keep your tables and indexes to a more manageable size.
You've not really mentioned your use case here, so it's hard to give
any advice.

There are more details about partitioning in
https://www.postgresql.org/docs/current/ddl-partitioning.html

David




Re: Partition with check constraint with "like"

2021-05-20 Thread David Rowley
On Fri, 21 May 2021 at 12:32, Nagaraj Raj  wrote:
> I am trying to create partitions on the table based on first letter of the 
> column record  value using inherit relation & check constraint.

You'll get much better performance out of native partitioning than you
will with the old inheritance method of doing it.

>  EXECUTE 'CREATE TABLE partition_tab.' || c_table || '(check ( name like 
> '''|| chk_cond||''')) INHERITS (' ||TG_TABLE_NAME|| ');';

This is a bad idea. There's a lock upgrade hazard here that could end
up causing deadlocks on INSERT.  You should just create all the tables
you need beforehand.

I'd recommend you do this using RANGE partitioning. For example:

create table mytable (a text not null) partition by range (a);
create table mytable_a partition of mytable for values from ('a') to
('b'); -- note the upper bound of the range is non-inclusive.
create table mytable_b partition of mytable for values from ('b') to ('c');
insert into mytable values('alpha'),('bravo');

explain select * from mytable where a = 'alpha';
QUERY PLAN
---
 Seq Scan on mytable_a mytable  (cost=0.00..27.00 rows=7 width=32)
   Filter: (a = 'alpha'::text)
(2 rows)

The mytable_b is not scanned.

David




Re: Index and statistics not used

2021-05-17 Thread David Rowley
On Tue, 18 May 2021 at 08:42,  wrote:
> Running delete from table1 where id = 48938 the trigger for constraint runs 
> for 20 seconds
>
> Event when doing a simple select from table2 where table1_id = 48938 takes 
> about 8 seconds

Does EXPLAIN show it uses a seq scan for this 8-second SELECT?

If so, does it use the index if you SET enable_seqscan TO off; ? If
so, how do the costs compare to the seqscan costs?

Is random_page_cost set to something sane? Are all the indexes valid?
(psql's \d table2 would show you INVALID if they're not.)

does: SHOW enable_indexscan; show that index scanning is switched on?

David




Re: BUG #16968: Planner does not recognize optimization

2021-05-17 Thread David Rowley
On Sat, 15 May 2021 at 00:39, KES  wrote:
>
> Thank you for detailed explanation. I glad to hear that I can use aliases and 
> this will be recognized and optimization is applied.
>
> >We'd need some sort of ability to assign ressortgroupref to a particular 
> >column within a
> whole-row var
> Could it be possible to create hidden alias in same way as I did that 
> manually?
>
> Algorithm seems not complex:
> 1. User refer column from composite type/whole-row: (o).agreement_id
> 2. Create hidden column at select: _o_agreement_id
> 3. Replace other references to (o).agreement_id by _o_agreement_id
> 4. Process query as usual after replacements

Internally Postgresql does use a hidden column for columns that are
required for calculations which are not in the SELECT list. e.g ones
that are in the GROUP BY / ORDER BY, or in your case a window
function's PARTITION BY. We call these "resjunk" columns.  The problem
is you can't reference those from the parent query. If you explicitly
had listed that column in the SELECT clause, it won't cost you
anything more since the planner will add it regardless and just hide
it from you.  When you add it yourself you'll be able to use it in the
subquery and you'll be able to filter out the partitions that you
don't want.

I really think you're driving yourself down a difficult path by
expecting queries with whole-row vars to be optimised just as well as
using select * or explicitly listing the columns.

David




Re: BUG #16968: Planner does not recognize optimization

2021-05-17 Thread David Rowley
On Sun, 16 May 2021 at 02:34, Eugen Konkov  wrote:
> I found a case when `not assigning a ressortgroupref to the whole-row var` 
> cause
> wrong window function calculations.
>
> I  use  same  query.  The  difference  come  when I wrap my query into
> function. (see full queries in attachment)
>
> 1.
> SELECT *
> FROM agreement_totals( tstzrange( '2020-07-01', '2020-08-01' ) )
> WHERE agreement_id = 161::int  AND  (o).period_id = 10::int
>
> 2.
> SELECT *
>   sum(  ) over wagreement
> FROM 
> WHERE agreement_id = 161::int  AND  (o).period_id = 10::int
> WINDOW wagreement AS ( PARTITION BY agreement_id )
>
> For  first  query  window function calculates SUM over all agreements,
> then some are filtered out by (o).period_id condition.

This is unrelated to the optimisation that you were asking about before.

All that's going on here is that WHERE is evaluated before SELECT.
This means that your filtering is done before the window functions are
executed.  This is noted in the docs in [1]:

> The rows considered by a window function are those of the “virtual table” 
> produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and 
> HAVING clauses if any. For example, a row removed because it does not meet 
> the WHERE condition is not seen by any window function. A query can contain 
> multiple window functions that slice up the data in different ways using 
> different OVER clauses, but they all act on the same collection of rows 
> defined by this virtual table.

If you want to filter rows after the window functions are evaluated
then you'll likely want to use a subquery.

David

[1] https://www.postgresql.org/docs/13/tutorial-window.html




Re: BUG #16968: Planner does not recognize optimization

2021-05-14 Thread David Rowley
On Fri, 14 May 2021 at 02:38, Eugen Konkov  wrote:
> Now I create minimal reproducible test case.
> https://dbfiddle.uk/?rdbms=postgres_13=761a00fb599789d3db31b120851d6341
>
> Optimization is not applyed when I filter/partition by column using composite 
> type name.

You probably already know this part, but let me explain it just in
case it's not clear.

The pushdown of the qual from the top-level query into the subquery,
or function, in this case, is only legal when the qual references a
column that's in the PARTITION BY clause of all window functions in
the subquery.  The reason for this is, if we filter rows before
calling the window function, then it could affect which rows are in
see in the window's frame. If it did filter, that could cause
incorrect results.  We can relax the restriction a bit if we can
eliminate entire partitions at once. The window function results are
independent between partitions, so we can allow qual pushdowns that
are in all PARTITION BY clauses.

As for the reason you're having trouble getting this to work, it's
down to the way you're using whole-row vars in your targetlist.

A slightly simplified case which shows this problem is:

create table ab(a int, b int);
explain select * from (select ab as wholerowvar,row_number() over
(partition by a) from ab) ab where (ab.wholerowvar).a=1;

The reason it does not work is down to how this is implemented
internally.   The details are, transformGroupClause() not assigning a
ressortgroupref to the whole-row var.  It's unable to because there is
no way to track which actual column within the whole row var is in the
partition by clause.  When it comes to the code that tries to push the
qual down into the subquery, check_output_expressions() checks if the
column in the subquery is ok to accept push downs or not. One of the
checks is to see if the query has windowing functions and to ensure
that the column is in all the PARTITION BY clauses of each windowing
function.  That check is done by checking if a ressortgroupref is
assigned and matches a tleSortGroupRef in the PARTITION BY clause.  In
this case, it does not match.  We didn't assign any ressortgroupref to
the whole-row var.

Unfortunately, whole-row vars are a bit to 2nd class citizen when it
comes to the query planner. Also, it would be quite a bit of effort to
make the planner push down the qual in this case. We'd need some sort
of ability to assign ressortgroupref to a particular column within a
whole-row var and we'd need to adjust the code to check for that when
doing subquery pushdowns to allow it to mention which columns within
whole-row vars can legally accept pushdowns.  I imagine that's
unlikely to be fixed any time soon.  Whole-row vars just don't seem to
be used commonly enough to warrant going to the effort of making this
stuff work.

To work around this, you should include a reference to the actual
column in the targetlist of the subquery, or your function, in this
case, and ensure you use that same column in the PARTITION BY clause.
You'll then need to write that column in your condition that you need
pushed into the subquery. I'm sorry if that messes up your design.
However, I imagine this is not the only optimisation that you'll miss
out on by doing things the way you are.

David




Re: 15x slower PreparedStatement vs raw query

2021-05-05 Thread David Rowley
On Tue, 4 May 2021 at 22:05, Alex  wrote:
> Shouldn't this process be automatic based on some heuristics?

When plan_cache_mode is set to "auto", then the decision to use a
generic or custom plan is cost-based. See [1]. There's a fairly crude
method there for estimating the effort required to replan the query.
The remainder is based on the average cost of the previous custom
plans + estimated planning effort vs cost of the generic plan.  The
cheaper one wins.

Certainly, what's there is far from perfect.  There are various
problems with it.  The estimated planning cost is pretty crude and
could do with an overhaul.   There are also issues with the plan costs
not being true to the cost of the query.  One problem there is that
run-time partition pruning is not costed into the plan.  This might
cause choose_custom_plan() to pick a custom plan when a generic one
with run-time pruning might have been better.

In order to get a better idea of where things are going wrong for you,
we'd need to see the EXPLAIN ANALYZE output for both the custom and
the generic plan.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/utils/cache/plancache.c#L1019




Re: Planning performance problem (67626.278ms)

2021-04-21 Thread David Rowley
On Thu, 22 Apr 2021 at 00:03, Jeremy Schneider  wrote:
>
> Two years later, I still remember this. And today I just confirmed
> someone hitting this on open source PG13.

The only thing that changed about get_actual_variable_range() is that
it now uses a SnapshotNonVacuumable snapshot. Previously a
long-running transaction could have caused vacuum to be unable to
remove tuples which could have caused get_actual_variable_range() to
be slow if it had to skip the unvacuumable tuples.

That's now changed as the SnapshotNonVacuumable will see any tuples
required by that long-running transaction and use that to determine
the range instead of skipping over it.

Anyone with a large number of tuples that vacuum can remove that are
at either end of the range on a column that is indexed by a btree
index could still have issues.  Vacuuming more often might be a good
thing to consider.  With the original report on this thread there were
more dead tuples in the table than live tuples.  Disabling auto-vacuum
or tuning it so it waits that long is likely a bad idea.

FWIW, here's a simple test case that shows the problem in current master.

create table a (a int primary key) with (autovacuum_enabled = off);
insert into a select x from generate_series(1,1000) x;
analyze a;
delete from a;
\timing on
explain select * from a where a < 1000;
 QUERY PLAN

 Seq Scan on a  (cost=0.00..169247.71 rows=9998977 width=4)
   Filter: (a < 1000)
(2 rows)


Time: 9062.600 ms (00:09.063)

vacuum a;
explain select * from a where a < 1000;
   QUERY PLAN
-
 Seq Scan on a  (cost=0.00..0.00 rows=1 width=4)
   Filter: (a < 1000)
(2 rows)

Time: 2.665 ms

Notice that it became faster again after I did a vacuum.

David




Re: Potential performance issues related to group by and covering index

2021-03-02 Thread David Rowley
On Wed, 3 Mar 2021 at 10:04, Michael Lewis  wrote:
> Are there guidelines or principles you could share about writing the group by 
> clause such that it is more efficient?

If you have the option of writing them in the same order as an
existing btree index that covers the entire GROUP BY clause (in
version < PG13) or at least prefix of the GROUP BY clause (version >=
PG13), then the planner has a chance to make use of that index to
provide pre-sorted input to do group aggregate.

Since PG13 has Incremental Sort, having an index that covers only a
prefix of the GROUP BY clause may still help.

If no indexes exist then you might get better performance by putting
the most distinct column first.  That's because sorts don't need to
compare the remaining columns once it receives two different values
for one column.  That gets more complex when the most distinct column
is wider than the others. e.g a text compare is more expensive than
comparing two ints.  For Hash Aggregate, I don't think the order will
matter much.

David




Re: Performance issues related to left join and order by

2021-03-02 Thread David Rowley
On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu  wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that 
> PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that there is a missing optimization rule related to pushing 
> the sort operator (i.e., order and limit) through the left join.
> Given the significant query execution time difference, I was wondering if it 
> is worth adding such a rule to make the system evaluate the first query more 
> efficiently.
> It would also be helpful if you could comment on if there is a standard 
> practice to evaluate the tradeoff associated with adding such a rule in 
> Postgresql.

We currently don't attempt to push down LIMIT clauses into subqueries.
Before we did that we'd need to get much better at figuring out how
joins duplicate rows so that we could be sure that we're not limiting
the subquery more than the number of records that the outer query will
need to reach its limit.

If you want some advice, you're likely to get more people on your side
and possible support for making improvements to the query planner if
you provide examples that look remotely like real-world queries.   In
the other emails that I've read from you on this list [1], it seems
you're example queries are all completely bogus.  I suspect that the
queries are generated by some fuzz testing tool.  I very much imagine
that really don't need help with these at all.  With respect, it seems
to me that there's about zero chance that you genuinely need the
results of this query more quickly and you've come for help with that.

Because PostgreSQL does not proactively cache query plans, ad-hoc
queries are always parsed, planned then executed.  This means that
it's often not practical to spend excessive amounts of time planning a
query that gets executed just once. Adding new optimisations to the
query planner means they either have to be very cheap to detect, or
they must pay off in many cases.

If you happen to think there's a genuine case for having the query
planner do a better job of doing LIMIT pushdowns into subqueries, then
you're welcome to submit a patch to implement that.  You'll also need
to carefully document exactly which cases the LIMIT can be pushed down
and when it cannot.  That's the hard part. The actual pushing down of
the clause is dead easy.  If you're going to do that, then I'd suggest
you come up with better examples than this one. I don't think many
people will get on board with your newly proposed optimisations when
the queries are obviously not real.  It's hard to imagine the
optimisation being useful to any queries with a query that's so
obviously not a real one.

David

[1] 
https://www.postgresql.org/message-id/BN7PR07MB52024B973EAB075F4DF6C19ACD999%40BN7PR07MB5202.namprd07.prod.outlook.com




Re: Potential performance issues related to group by and covering index

2021-03-02 Thread David Rowley
On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu  wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that 
> PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that different order of group by clauses triggers different 
> plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the 
> covering index, it will trigger an index scan on associated columns;
> however,  when the group by clauses have different order than the covering 
> index (ps_suppkey, ps_partkey), the index scan will not be triggered.
> Given that the user might not pay close attention to this subtle difference, 
> I was wondering if it is worth making these two queries have the same and 
> predictable performance on Postgresql.

Unfortunately, it would take a pretty major overhaul of the query
planner to do that efficiently.

For now, have a few smarts involved in trying to make the GROUP BY
processing more efficient:

1) We remove columns from the GROUP BY if they're functionally
dependent on the primary key, providing the primary key is present
too. (you're seeing this in your example query)
2) We also change the order of the GROUP BY columns if it's a subset
of the ORDER BY columns.  This is quite good as we'd do grouping by
{b,a} if someone wrote GROUP BY a,b ORDER BY b,a; to which would save
having to re-sort the data for the ORDER BY after doing the GROUP BY.
That's especially useful for queries with a LIMIT clause.

If we want to do anything much smarter than that like trying every
combination of the GROUP BY clause, then plan times are likely going
to explode. The join order search is done based on the chosen query
pathkeys, which in many queries is the pathkeys for the GROUP BY
clause (see standard_qp_callback()). This means throughout the join
search, planner will try and form paths that provide pre-sorted input
that allows the group by to be implemented efficiently with pre-sorted
data. You might see Merge Joins rather than Hash Joins, for example.

If we want to try every combination of the GROUP BY columns then it
means repeating that join search once per combination. The join search
is often, *by far*, the most expensive part of planning a query.

While it would be nice if the planner did a better job on selecting
the best order for group by columns, unless we can come up with some
heuristics that allow us to just try a single combination that is
likely good, then I don't think anyone would thank us for slowing down
the planner by a factor of the number of possible combinations of the
group by columns.

David




Re: Postgres Optimizer ignores information about foreign key relationship, severly misestimating number of returned rows in join

2020-10-26 Thread David Rowley
On Tue, 27 Oct 2020 at 06:54, Ehrenreich, Sigrid  wrote:
>   ->  Hash Join  (cost=226.27..423.82 rows=115 width=0) (actual 
> time=3.150..7.511 rows=3344 loops=1)   <=== With the FK, the 
> estimation should be 3344, but it is 115 rows

I'd have expected this to find the foreign key and have the join
selectivity of 1.0, but I see it does not due to the fact that one of
the EquivalenceClass has a constant due to the fact.low_card = 1 qual.

In build_join_rel() we call build_joinrel_restrictlist() to get the
join quals that need to be evaluated at the join level, but we only
get the fact.anydata1=dim.anydata1 and fact.anydata2=dim.anydata2
quals there.  The low_card qual gets pushed down to the scan level on
each side of the join, so no need for it to get evaluated at the join
level. Later in build_join_rel() we do set_joinrel_size_estimates().
The restrictlist with just the two quals is what we pass to
get_foreign_key_join_selectivity().  Only two of the foreign key
columns are matched there, therefore we don't class that as a match
and just leave it up to the normal selectivity functions.

I feel like we could probably do better there and perhaps somehow
count ECs with ec_has_const as matched, but there seems to be some
assumptions later in get_foreign_key_join_selectivity() where we
determine the selectivity based on the base rel's tuple count.  We'd
need to account for how many rows remainder after filtering the ECs
with ec_has_const == true, else we'd be doing the wrong thing.  That
needs more thought than I have time for right now.

Your case would work if the foreign key had been on just anydata1 and
anydata2, but there's not much chance of that working without a unique
index on those two columns.

Extended statistics won't help you here either since they're currently
not used for join estimations.

David




Re: Query Performance / Planner estimate off

2020-10-20 Thread David Rowley
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen  wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join 
> for this query. Essentially making this query (and other queries) take a very 
> long time to complete. In contrast, by toggling `enable_nestloop` and 
> `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 
> minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): 
> https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

> Let me know if there is anything I left out here that would be useful for 
> further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

would be useful.

David




Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-09-13 Thread David Rowley
On Tue, 8 Sep 2020 at 06:05, Raj  wrote:
>
> > This would not exactly look like a bug, because the message says "to
> > be locked", so at least it's not allowing two workers to lock the same
> > tuple. But it seems that the skip-locked mode should not make an error
> > out of this, but treat it as the tuple was already locked. Why would
> > it want to lock the tuple (representing the job) if another worker has
> > already finished his UPDATE of the job to mark it as "done" (which is
> > what makes the tuple move to the "completed" partition.)

(It's not very clear who wrote the above text since the quote does not
mention who the author is and the original email didn't appear to have
made it to the list)

It's not a bug. I think the quoted text is expecting a bit too much
from the database. It does not know that if the tuple is updated and
moved to another partition that it can be safely ignored.   For all
the database knows, the new version of the tuple that's in the new
partition still matches the query's WHERE clause and should be locked.
If we just go and ignore moved off tuples then we could miss
processing tuples that still need to be processed.

It's perhaps not impossible to make it work slightly better if it were
somehow possible to inform heapam_tuple_lock() that it's operating on
a partition and the query queried a partitioned table and that all but
1 partition was pruned with partition pruning.  In this case we could
be certain the new verison of the tuple can't match the WHERE clause
of the SELECT since partition pruning determined that all other
partitions don't match the WHERE clause. However, that's:

a) a pretty horrid thing to have to teach heapam_tuple_lock() about, and;
b) only going to work when 1 partition survives partition pruning,
which is pretty horrible since doing ATTACH PARTITION could suddenly
cause your queries to fail randomly.

If you had 3 partitions, one for "pending", "retry" and "complete",
and you wanted to lock all rows that are in a "pending" or "retry"
state, then when we encounter an updated row in the "pending"
partition, we have no knowledge if it was moved into the "retry" or
the "completed" partition.   If it's in "retry", then we do want to
find it and process it, but if it's in "completed", then it does not
match the WHERE clause of the query and we can ignore it.  Since we
don't know which, we can't make assumptions and must force the user to
try again, hence the serialisation failure error.

> > Either the SELECT for jobs to do returned a wrong tuple, which was
> > already updated, or there is some lapse in the locking.
> >
> > Either way it would seem to be a waste of time throwing all these
> > errors when the tuple should not even have been selected for update
> > and locking.
> >
> > I wonder if anybody knows anything about that issue? Of course you'll
> > want to see the DDL and SQL queries, etc. but you can't really try it
> > out unless you do some massively parallel magic.

I ready mentioned why this cannot work that way [1].  If you have some
idea on how to make it work correctly, then it would be interesting to
hear. Otherwise, I'm sorry to say that we can't just ignore these
tuples because it happens to suit your use case.

The solution is just to make the application retry on serialisation failures.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrDH6TQeLxTqnnAnhjrs55ru5g2_QMG=me+wvd5mmp...@mail.gmail.com




Re: Query performance issue

2020-09-05 Thread David Rowley
On Sat, 5 Sep 2020 at 10:20, Nagaraj Raj  wrote:
> I added the index as you suggested and the planner going through the bitmap 
> index scan,heap and the new planner is,
> HaOx | explain.depesz.com

In addition to that index, you could consider moving away from
standard SQL and use DISTINCT ON, which is specific to PostgreSQL and
should give you the same result.

EXPLAIN ANALYZE
SELECT DISTINCT ON (serial_no) serial_no,receivingplant,sku,r3_eventtime
FROM receiving_item_delivered_received
WHERE eventtype='LineItemdetailsReceived'
  AND replenishmenttype = 'DC2SWARRANTY'
  AND coalesce(serial_no,'') <> ''
ORDER BY serial_no,eventtime DESC;

The more duplicate serial_nos you have the better this one should
perform.  It appears you don't have too many so I don't think this
will be significantly faster, but it should be a bit quicker.

David




Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-20 Thread David Rowley
On Fri, 21 Aug 2020 at 11:01, Michael Lewis  wrote:
>
> On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie  wrote:
>>
>> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>>
>> Can you share an explain analyze for the query that does the select for
>> update? I wouldn't assume that partition pruning is possible at all with
>> hash, and it would be interesting to see how it is finding those rows.
>>
>> Well this got interesting  - the already moved error showed up:  Note, the 
>> actual process partitions are regular table partitions, these are not 
>> hashed.  Only the incoming and completed are hashed due to row counts at 
>> either end of the processing; in flight (where the issue shows up) is quite 
>> small:
>>
>> [queuedb] # explain analyze select queueid,txobject,objectid,state from 
>> mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and 
>> txobject = 'ticket' limit 250 for update skip locked;
>> ERROR:  40001: tuple to be locked was already moved to another partition due 
>> to concurrent update
>> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
>> Time: 579.131 ms
>
> That is super curious. I hope that someone will jump in with an explanation 
> or theory on this.
>
> I still wonder why the move between partitions is needed though if the work 
> is either done (failed or successful) or not done... not started, retry 
> needed or in progress... it doesn't matter. It needs to get picked up by the 
> next process if it isn't already row locked.

I may be heading off in the wrong direction as I'm not fully sure I
understand what the complaint is about, but isn't the executor just
hitting dead rows in one of the active or failed partitions that have
been moved off to some other partition?

When updates occur in a non-partitioned table we can follow item
pointer chains to find the live row and check if the WHERE clause
still matches to determine if the row should be updated, or in this
case just locked since it's a SELECT FOR UPDATE. However, with
partitioned table, a concurrent UPDATE may have caused the row to have
been moved off to another partition, in which case the tuple's item
pointer cannot point to it since we don't have enough address space,
we only have 6 bytes for a TID. To get around the fact that we can't
follow these update chains, we just throw the serialization error,
which is what you're getting. Ideally, we'd figure out where the live
version of the tuple is and check if it matches the WHERE clause and
lock it if it does, but we've no means to do that with the current
design.

If the complaint is about the fact that you're getting the error and
you think you shouldn't be because you said "SKIP LOCKED" then I'm not
really sure the fact that you said "SKIP LOCKED" gives us the right to
ignore this case. The user only gave us the go-ahead to skip locked
tuples, not skip tuples that we just failed to follow item pointer
chains for.   It might be okay to do this for rows that have since
been marked as complete since they no longer match your WHERE clause,
however, if the row has gone from the queue_tx_active partition into
the queue_tx_fail_retryable partition then I don't see why we'd have
the right to skip the tuple. Your query says you want tuples that need
to be retried. We can't go skipping them.

So isn't the fix just to code the application to retry on 40001 errors?

David




Re: Performance issue

2020-06-14 Thread David Rowley
On Mon, 15 Jun 2020 at 10:46, Nagaraj Raj  wrote:
> CREATE TABLE test1
> (
...

> CONSTRAINT pk_i_entity_proxy_id PRIMARY KEY (individual_entity_proxy_id)
>
> );

> CREATE TABLE test2
> (
...

> CONSTRAINT pk_entity_proxy_id PRIMARY KEY (individual_entity_proxy_id)
> );
>
>
> User query:
>
> explain analyze select COUNT(*) as "DII_1"
>   from ( select distinct table0."individual_entity_proxy_id" as 
> "INDIVIDUAL_ENTITY_PROXY_ID"
> from test1 table0
> inner join test2 table1
> on table0."individual_entity_proxy_id" = 
> table1."individual_entity_proxy_id"

Why do you use "select distinct". It seems to me that you're putting a
distinct clause on the primary key of test1 and joining to another
table in a way that cannot cause duplicates.

I imagine dropping that distinct will speed up the query quite a bit.

David




Re: view reading information_schema is slow in PostgreSQL 12

2020-06-13 Thread David Rowley
On Sat, 13 Jun 2020 at 19:52, David Rowley  wrote:
>
> On Sat, 13 Jun 2020 at 16:07, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > I wondered if it would be more simple to add some smarts to look a bit
> > > deeper into case statements for selectivity estimation purposes. An
> > > OpExpr like:
> > > CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
> > > 'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';
> >
> > Hm.  Maybe we could reasonably assume that the equality operators used
> > for such constructs are error-and-side-effect-free, thus dodging the
> > semantic problem I mentioned in the other thread?
>
> I'm only really talking about selectivity estimation only for now.
> I'm not really sure why we'd need to ensure that the equality operator
> is error and side effect free.  We'd surely only be executing the case
> statement's operator's oprrest function?  We'd need to ensure we don't
> invoke any casts that could error out.

Hmm, after a bit of thought I now see what you mean.  We'd need to
loop through each WHEN clause to ensure there's a Const and check if
that Const is equal to the Const on the other side of the OpExpr, then
select the first match. That, of course, must perform a comparison,
but, that's not really doing anything additional to what constant
folding code already does, is it?

David




Re: view reading information_schema is slow in PostgreSQL 12

2020-06-13 Thread David Rowley
On Sat, 13 Jun 2020 at 16:07, Tom Lane  wrote:
>
> David Rowley  writes:
> > I wondered if it would be more simple to add some smarts to look a bit
> > deeper into case statements for selectivity estimation purposes. An
> > OpExpr like:
> > CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
> > 'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';
>
> Hm.  Maybe we could reasonably assume that the equality operators used
> for such constructs are error-and-side-effect-free, thus dodging the
> semantic problem I mentioned in the other thread?

I'm only really talking about selectivity estimation only for now.
I'm not really sure why we'd need to ensure that the equality operator
is error and side effect free.  We'd surely only be executing the case
statement's operator's oprrest function?  We'd need to ensure we don't
invoke any casts that could error out.

David




Re: view reading information_schema is slow in PostgreSQL 12

2020-06-12 Thread David Rowley
On Sat, 13 Jun 2020 at 15:11, Tom Lane  wrote:
> I expect you're getting a fairly decent estimate for the "contype <>
> ALL" condition, but the planner has no idea what to make of the CASE
> construct, so it just falls back to a hard-wired default estimate.

This feels quite similar to [1].

I wondered if it would be more simple to add some smarts to look a bit
deeper into case statements for selectivity estimation purposes. An
OpExpr like:

CASE c.contype WHEN 'c' THEN 'CHECK' WHEN 'f' THEN 'FOREIGN KEY' WHEN
'p' THEN 'PRIMARY KEY' WHEN 'u' THEN 'UNIQUE' END = 'CHECK';

could be simplified to c.contype = 'c',  which we should have
statistics for. There'd certainly be case statement forms that
couldn't be simplified, but I think this one could.

David

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvr%2B6%3D7SZBAtesEavgOQ0ZC03syaRQk19E%2B%2BpiWLopTRbg%40mail.gmail.com#3ec465f343f1204446941df29fc9e715




Re: view reading information_schema is slow in PostgreSQL 12

2020-06-12 Thread David Rowley
On Sat, 13 Jun 2020 at 06:26, regrog  wrote:
>
> I'm facing performance issues migrating from postgres 10 to 12 (also from 11
> to 12) even with a new DB.
> Th performance difference is huge 300ms in pg10 vs 3 minutes in pg12.

This appears to be down to bad statistics that cause pg12 to choose a
nested loop plan.  The pg12 plan has:

->  Hash Join  (cost=1281.91..2934.18 rows=68 width=192) (actual
time=0.024..21.915 rows=3538 loops=1848)"

on the inner side of a nested loop. 21.915 * 1848 loops is 40498.92
ms, so most of the time.

This comes down to the difference caused by 04fe805a17, where after
that commit we don't bother looking at the NOT NULL constraints in
table_constraints.

explain select * from (select * from
information_schema.table_constraints) c where constraint_type <>
'CHECK';

If you execute the above on both instances, you'll see PG12 does not
do an Append. PG10 does. Which results in more rows being estimated
and the planner choosing something better than a nested loop join.

You could try: SET enable_nestloop TO off;

I'm not really sure there's much you could do to improve the
statistics on the catalogue tables.

Alternatively, you could write a view based directly on the base
tables, bypassing information_schema completely.

David




Re: Windows slowness?

2020-06-10 Thread David Rowley
On Thu, 11 Jun 2020 at 07:41, Mikkel Lauritsen  wrote:
> I have a query that runs much slower in Postgres on Windows than on
> Linux

> Using explain analyze on the database running on Windows I get
>
> ->  Index Scan using event_pkey on event t1  (cost=0.56..0.95 rows=1
> width=295) (actual time=0.075..0.075 rows=0 loops=229227)

> If I copy the database to my laptop running Linux (Postgres 12 on Fedora
> 32, i7-9750H, 16 GB ram) I get the exact same execution plan. Explain
> analyze says
>
> ->  Index Scan using event_pkey on event t1  (cost=0.56..0.95 rows=1
> width=295) (actual time=0.008..0.008 rows=0 loops=229227)
>

> The table is freshly vacuumed. It has about 10M rows and takes about
> 2.6G disk space; the index is about 600M. Everything is cached; there's
> basically no disk I/O happening while the query is executing.

Can you confirm what: SELECT pg_relation_size('event_pkey'),
pg_relation_size('event'); says on each

> The only Postgres configuration difference between the Windows and Linux
> environments is shared_buffers, which is 4G on my laptop and 512M on the
> Windows server, and effective_cache_size which are 8G on the laptop and
> 16G on the server.

There is some slight advantage to having the buffers directly in
shared buffers. Having them in the kernel's page cache does still
require getting them into shared buffers. Going by these sizes it
seems much more likely that the Linux instance could have all buffers
in shared_buffers, but it seems likely the Windows instance won't. I
can't imagine that counts for 10x, but it surely must count for
something.

It would be good to see:

SET track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS) 

David




Re: When to use PARTITION BY HASH?

2020-06-07 Thread David Rowley
On Sun, 7 Jun 2020 at 23:41, MichaelDBA  wrote:
> The article referenced below assumes a worst case scenario for bulk-loading 
> with hash partitioned tables.  It assumes that the values being inserted are 
> in strict ascending or descending order with no gaps (like a sequence number 
> incrementing by 1), thereby ensuring every partition is hit in order before 
> repeating the process.  If the values being inserted are not strictly 
> sequential with no gaps, then the performance is much better.  Obviously, 
> what part of the tables and indexes are in memory has a lot to do with it as 
> well.

In PostgreSQL 12, COPY was modified to support bulk-inserts for
partitioned tables. This did speed up many scenarios.  Internally, how
this works is that we maintain a series of multi insert buffers, one
per partition. We generally only flush those buffers to the table when
the buffer for the partition fills.  However, there is a sort of
sanity limit [1] on the number of multi insert buffers we maintain at
once and currently, that is 32.  Technically we could increase that
limit, but there would still need to be a limit.  Unfortunately, for
this particular case, since we're most likely touching between 199-799
other partitions before hitting the first one again, that will mean
that we really don't get any multi-inserts, which is likely the reason
why the performance is worse for hash partitioning.

With PG12 and for this particular case, you're likely to see COPY
performance drop quite drastically when going from 32 to 33
partitions.  The code was more designed for hitting partitions more
randomly rather than in this sort-of round-robin way that we're likely
to get from hash partitioning on a serial column.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2569




Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley
On Sat, 6 Jun 2020 at 14:49, Cedric Leong  wrote:
> It's less of a complaint rather than just a warning not to do what I did.

My point was really that nobody really knew what you did or what you
did it on. So it didn't seem like a worthwhile warning as it
completely lacked detail.

> These tests are running the exact same query on two different tables with the 
> exception that they use their respective partition keys.

Are you sure?  It looks like the old one does WHERE date =
((now())::date - '7 days'::interval) and the new version does
(date(created_at) = ((now())::date - '7 days'::interval). I guess you
renamed date to "created_at" and changed the query to use date(). If
that expression is not indexed then I imagine that would be a good
reason for the planner to have moved away from using the index on that
column. Also having date(created_at) will also not allow run-time
pruning to work since your partition key is "created_at".

You might be able to change the query to query a range of value on the
new timestamp column. This will allow you to get rid of the date()
function. For example:

where created_at >= date_trunc('day', now() - '7 days'::interval) and
created_at < date_trunc('day', now() - '6 days'::interval)

David




Re: Date vs Timestamp without timezone Partition Key

2020-06-05 Thread David Rowley
On Sat, 6 Jun 2020 at 14:12, Cedric Leong  wrote:
> Somewhat unrelated but note to anyone who wants to swap out partition keys. 
> Don't create a clone of the table with the new partition key and insert data. 
> It messes up the query planner massively and makes everything much slower.

That complaint would have more meaning if you'd mentioned which
version of PostgreSQL you're using. The performance of partitioning in
PostgreSQL has changed significantly over the past 3 releases. Also
would be useful to know what you've actually done (actual commands).
I can't imagine it makes *everything* slower, so it might be good to
mention what is actually slower.

David




Re: AutoVacuum and growing transaction XID's

2020-05-07 Thread David Rowley
On Fri, 8 May 2020 at 09:18, github kran  wrote:
> 1)  We haven't changed anything related to autovacuum except a work_mem 
> parameter which was increased to 4 GB which I believe is not related to 
> autovacuum

It might want to look into increasing vacuum_cost_limit to something
well above 200 or dropping autovacuum_vacuum_cost_delay down from 20
to something much lower. However, you say you've not changed the
autovacuum settings, but you've also said:

>1)  I see there are 8 Vacuum workers ( Not sure what changed) running in 
> the background and the concern I have is all of these vacuum processes are 
> running with wrap around and while they are running

The default is 3, so if you have 8 then the settings are non-standard.

It might be good to supply the output of:

SELECT name,setting from pg_Settings where name like '%vacuum%';

You should know that the default speed that autovacuum runs at is
quite slow in 9.6. If you end up with all your autovacuum workers tied
up with anti-wraparound vacuums then other tables are likely to get
neglected and that could lead to stale stats or bloated tables. Best
to aim to get auto-vacuum running faster or aim to perform some manual
vacuums of tables that are over their max freeze age during an
off-peak period to make use of the lower load during those times.
Start with tables in pg_class with the largest age(relfrozenxid).
You'll still likely want to look at the speed autovacuum runs at
either way.

Please be aware that the first time a new cluster crosses the
autovacuum_freeze_max_age threshold can be a bit of a pain point as it
can mean that many tables require auto-vacuum activity all at once.
The impact of this is compounded if you have many tables that never
receive an UPDATE/DELETE as auto-vacuum, in 9.6, does not visit those
tables for any other reason. After the first time, the relfrozenxids
of tables tend to be more staggered so their vacuum freeze
requirements are also more staggered and that tends to cause fewer
problems.

David




Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNER Query )

2020-05-07 Thread David Rowley
On Fri, 8 May 2020 at 10:00, David G. Johnston
 wrote:
>
> On Thu, May 7, 2020 at 11:07 AM Amarendra Konda  wrote:
>>
>> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  SELECT pa.process_activity_id AS 
>> pa_process_activity_id  FROM process_activity pa WHERE pa.app_id = 
>> '126502930200650' AND pa.created > '1970-01-01 00:00:00'  AND EXISTS ( 
>> SELECT 1 FROM process_instance pi where pi.app_id = pa.app_id AND 
>> pi.process_instance_id = pa.process_instance_id  AND pi.user_id = 
>> '137074931866340') ORDER BY pa.process_instance_id,  pa.created limit 50;
>>
>>
>>->  Index Scan using 
>> process_activity_process_instance_id_app_id_created_idx on 
>> public.process_activity pa  (cost=0.70..1061.62 rows=1436 width=32) (actual 
>> time=0.011..20.320 rows=23506 loops=2)
>
> > Index Cond: ((m.process_instance_id = pi.process_instance_id) AND (m.app_id 
> > = '126502930200650'::bigint) AND (m.created > '1970-01-01 
> > 00:00:00'::timestamp without time zone))
>
> I suppose during the nested loop the inner index scan could limit itself to 
> the first 50 entries it finds (since the first two index columns are being 
> held constant on each scan, m.created should define the traversal order...) 
> so that the output of the nested loop ends up being (max 2 x 50) 100 entries 
> which are then sorted and only the top 50 returned.
>
> Whether the executor could but isn't doing that here or isn't programmed to 
> do that (or my logic is totally off) I do not know.

I think the planner is likely not putting the process_activity table
on the outer side of the nested loop join due to the poor row
estimates.  If it knew that so many rows would match the join then it
likely would have done that to save from having to perform the sort at
all.  However, because the planner has put the process_instance on the
outer side of the nested loop join, it's the pathkeys from that path
that the nested loop node has, which is not the same as what the ORDER
BY needs, so the planner must add a sort step, which means that all
rows from the nested loop plan must be read so that they can be
sorted.

It might be worth trying: create index on process_instance
(user_id,app_id); as that might lower the cost of performing the join
in the opposite order and have the planner prefer that order instead.
If doing that, the OP could then ditch the
fki_conv_konotor_user_user_id index to save space.

If that's not enough to convince the planner that the opposite order
is better then certainly SET enable_sort TO off; would.

David




Re: Please help! Query jumps from 1s -> 4m

2020-05-04 Thread David Rowley
On Mon, 4 May 2020 at 02:35, James Thompson  wrote:
> buffers do look different - but still, reading 42k doesn't seem like it would 
> cause a delay of 4m?

You could do: SET track_io_timing TO on;

then: EXPLAIN (ANALYZE, BUFFERS) your query and see if the time is
spent doing IO.

David




Re: 600 million rows of data. Bad hardware or need partitioning?

2020-05-03 Thread David Rowley
On Mon, 4 May 2020 at 15:52, Arya F  wrote:
>
> On Sun, May 3, 2020 at 11:46 PM Michael Lewis  wrote:
> >
> > What kinds of storage (ssd or old 5400 rpm)? What else is this machine 
> > running?
>
> Not an SSD, but an old 1TB 7200 RPM HDD
>
> > What configs have been customized such as work_mem or random_page_cost?
>
> work_mem = 2403kB
> random_page_cost = 1.1

How long does it take if you first do:

SET enable_nestloop TO off;

If you find it's faster then you most likely have random_page_cost set
unrealistically low. In fact, I'd say it's very unlikely that a nested
loop join will be a win in this case when random pages must be read
from a mechanical disk, but by all means, try disabling it with the
above command and see for yourself.

If you set random_page_cost so low to solve some other performance
problem, then you may wish to look at the effective_cache_size
setting. Having that set to something realistic should allow indexes
to be used more in situations where they're likely to not require as
much random I/O from the disk.

David




Re: Duplicate WHERE condition changes performance and plan

2020-04-15 Thread David Rowley
On Thu, 16 Apr 2020 at 07:56, singh...@gmail.com  wrote:
> We have an odd issue where specifying the same where clause twice causes PG 
> to pick a much more efficent plan. We would like to know why.

> The EXPLAIN ANALYZE for both queries can be found here:-
> Query A: https://explain.depesz.com/s/lFuy
> Query B: https://explain.depesz.com/s/Jqmv

This is basically down to just a poor join selectivity estimation.
The selectivity estimation on the duplicate not equal clause is not
removed by the planner and the selectivity of that is taking into
account twice which reduces the selectivity of the table named
"problem". With that selectivity taken into account, the query planner
thinks a nested loop will be a more optimal plan, to which it seems to
be.

Join selectivity estimations can use the most common values lists as
you may see if you look at the pg_stats view for the tables and
columns involved in the join condition.  Perhaps ID columns are not
good candidates to get an MCV list in the stats.  In that case, the
ndistinct estimate will be used.  If there's no MCV list in the stats
then check ndistinct is reasonably accurate. If there is an MCV list,
then you can make that bigger by increasing the statistics targets on
the join columns and running ANALYZE. Note: Planning can become slower
when you increase the statistics targets.

Starting with PostgreSQL 9.6, foreign keys are also used to help with
join selectivity estimations. I see you have a suitable foreign key
from the schema you posted.  You might want to add that to the list of
reasons to upgrade.

David




Re: pg12 partitions show bad performance vs pg96

2020-03-09 Thread David Rowley
On Tue, 10 Mar 2020 at 02:08, Mariel Cherkassky
 wrote:

> PG12 - 3 PARTITIONS
>
>  QUERY 
> PLAN
> 
>  Gather  (cost=1002.26..1563512.35 rows=10 width=44) (actual 
> time=22306.091..22309.209 rows=0 loops=1)
>Workers Planned: 2
>Workers Launched: 2
>->  Hash Join  (cost=2.26..1562511.35 rows=4 width=44) (actual 
> time=22299.412..22299.413 rows=0 loops=3)
>  Hash Cond: (da_2.device = de.id)
>  ->  Parallel Append  (cost=0.00..1562506.90 rows=814 width=37) 
> (actual time=22299.411..22299.411 rows=0 loops=3)
>->  Parallel Seq Scan on iot_data_2 da_2  
> (cost=0.00..584064.14 rows=305 width=37) (actual time=9076.535..9076.535 
> rows=0 loops=3)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 12460009
>->  Parallel Seq Scan on iot_data_1 da_1  
> (cost=0.00..504948.69 rows=262 width=36) (actual time=10296.751..10296.751 
> rows=0 loops=2)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 16158316
>->  Parallel Seq Scan on iot_data_0 da  (cost=0.00..473490.00 
> rows=247 width=37) (actual time=19075.081..19075.081 rows=0 loops=1)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 30303339
>  ->  Hash  (cost=2.25..2.25 rows=1 width=7) (never executed)
>->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 
> width=7) (never executed)
>  Filter: (name = '50a'::text)
>  Planning Time: 30.429 ms
>  Execution Time: 22309.364 ms
> (20 rows)

>From what I can work out, the DDL you used here is:

-- you didn't seem to specify the DDL for iot_device, so I used:
create table iot_device (
id bigint primary key,
name text not null
);

insert into iot_device select x,x::Text || 'a' from generate_Series(1,100) x;

create table iot_data(id serial ,data text,metadata bigint,device
bigint references iot_device(id),primary key(id,device)) partition by
hash(device);
create table iot_data_0 partition of iot_data for values with (MODULUS
3, remainder 0);
create table iot_data_1 partition of iot_data for values with (MODULUS
3, remainder 1);
create table iot_data_2 partition of iot_data for values with (MODULUS
3, remainder 2);

insert into iot_data select
generate_series(1,1000),random()*10,random()*254,random()*99+1;
create index on iot_data(metadata,lower(data));
vacuum analyze iot_data;

In which case, you're getting a pretty different plan than I am. (I
admit that I only tried on current master and not PG12.2, however, I
see no reason that PG12.2 shouldn't produce the same plan)

I get:

# explain analyze select * from iot_data da,iot_device de where
de.name in ('50a') and de.id = da.device and da.metadata=50 and
lower(da.data) like '50';

  QUERY PLAN

 Nested Loop  (cost=0.56..28.04 rows=1 width=49) (actual
time=0.058..0.058 rows=0 loops=1)
   Join Filter: (da.device = de.id)
   ->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 width=11)
(actual time=0.013..0.016 rows=1 loops=1)
 Filter: (name = '50a'::text)
 Rows Removed by Filter: 99
   ->  Append  (cost=0.56..25.76 rows=3 width=38) (actual
time=0.040..0.040 rows=0 loops=1)
 ->  Index Scan using iot_data_0_metadata_lower_idx on
iot_data_0 da_1  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.020..0.020 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_1_metadata_lower_idx on
iot_data_1 da_2  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.010..0.010 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_2_metadata_lower_idx on
iot_data_2 da_3  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.009..0.009 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 Planning Time: 0.280 ms
 Execution Time: 0.094 ms
(17 rows)

Are you certain that you added an index on iot_data (metadata, lower(data)); ?




Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread David Rowley
On Mon, 9 Mar 2020 at 05:05, Mariel Cherkassky
 wrote:
> PG12 :
>  Planning Time: 8.157 ms
>  Execution Time: 2.920 ms
> (22 rows)
>
>
> PG96 :
>  Planning time: 0.815 ms
>  Execution time: 0.158 ms
> (12 rows)

8 ms seems pretty slow to planning that query. Does the planning time
drop if you execute this multiple times in the same session? Does the
time change if you try again without any foreign keys?

The planning time for the partitioned case is certainly going to take
longer. Partitioned tables are never going to improve the times of
query planning. It's only possible that they'll improve the
performance during query execution.

For such small fast queries as the ones you've shown, it's important
to remember that more complex query plans (ones with more nodes) do
lead to longer times for executor startup and shutdown.   EXPLAIN
(without ANALYZE), will perform query planning and executor
startup/shutdown. If you enable \timing on in psql and test the
EXPLAIN performance of these queries in each version, then you might
get an idea of where the overheads are.

Additionally, you're unlikely to see performance improvements with
table partitioning unless you're accessing many rows and partitioning
allows the data locality of the rows that you are accessing to
improve. i.e accesses fewer buffers and/or improves cache hit ratios.
In PG12, if the partition pruning can be done during query planning
then the planning and executor startup overhead is much lower since
there are fewer relations to generate access paths for and fewer nodes
in the final plan. This also improves the situation during execution
as it means fewer locks to take and fewer nodes to startup/shutdown.

> As you can see, the results are better in pg96. This example only shows the 
> results for a small data set. In bigger data sets I get a bigger diff...

Can you share the results of that?

> I tried changing many postgresql.conf parameters that were added 
> (max_workers_per_gather,enable_partitionwise_join and so on..).

The former only does anything for parallel queries. None of the plans
you've shown are parallel ones.   The latter also does not count in
this case. It only counts when joining two identically partitioned
tables.

> I dont understand why in pg12 it scans all the partitions instead of the 
> relevant one..

If you'd specified a specific "device" in the query SQL, then the
query planner would know which partition to scan for that particular
device. However, since you're looking up the device in another table
and performing a join, the device is only known during query
execution. The plan nodes for the non-matching partitions do go
through executor startup, but they're not scanned during execution, as
you've seen with the "(never executed)" appearing in the EXPLAIN
ANALYZE output.  Since executor startup occurs before execution, the
device you mean is still unknown during executor startup, so the
executor must startup the nodes for all partitions that are in the
plan. Starting up a plan node is not free, but not really very
expensive either. However, the overhead of it might be quite large
proportionally in your case since the executor is doing so little
work.

The most basic guidelines for table partitioning are, don't partition
your tables unless it's a net win.   If partitioning was always
faster, we'd just have designed Postgres to implicitly partition all
of your tables for you. There are some other guidelines in [1].

[1] 
https://www.postgresql.org/docs/12/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

David




Re: Slow performance with trivial self-joins

2020-02-05 Thread David Rowley
On Thu, 6 Feb 2020 at 11:12, Adam Brusselback  wrote:
>
> > You can create a library of
> > reusable views that are small, easy-to-understand and readable. Then
> > you build them up into bigger views, and finally query from them. But
> > then you end up with lots of (hidden) self-joins.
>
> I will concur with this use case being pretty common, but also something I
> have actively avoided anywhere performance is important because of the
> lack of this optimization.
>
> Even still, I have 20+ views like that in my database.

I think the best direction to move in to push that forward would be to
go and benchmark the proposed patch and see if the overhead of
detecting the self joined relations is measurable with various queries
with varying numbers of joins.

It does not sound too like it would be a great deal of effort to look
through the rangetable for duplicate Oids and only do further
processing to attempt self-join removal if there are. However, if that
effort happened to slow down all queries by say 5%, then perhaps it
would be a bad idea.  People's opinions don't really have much
traction for arguments on this. Unbiased and reproducible benchmarks
should be used as evidence to support discussion. Doing worst-case and
average-case benchmarks initially will save you time, as someone will
almost certainly ask if you don't do it.

(I've not been following the thread for the patch)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Slow planning, fast execution for particular 3-table query

2019-11-06 Thread David Rowley
On Thu, 7 Nov 2019 at 11:59, Justin Pryzby  wrote:
>
> On Mon, Nov 04, 2019 at 03:04:45AM +, David Wheeler wrote:
> > Postgres version 9.5.19
> > Each of the tables has between 3-4 indexes, and all the indexes include tid 
> > as first parameter.


> But note that index definition will be prohibited since:
>
> https://www.postgresql.org/docs/9.6/release-9-6.html
> |Disallow creation of indexes on system columns, except for OID columns 
> (David Rowley)
> |Such indexes were never considered supported, and would very possibly 
> misbehave since the system might change the system-column fields of a tuple 
> without updating indexes. However, previously there were no error checks to 
> prevent them from being created.

David will have meant the user column named "tid" rather than the
system column named "ctid".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Query slow again after adding an `OR` operation (was: Slow PostgreSQL 10.6 query)

2019-10-09 Thread David Rowley
On Thu, 10 Oct 2019 at 01:07, Behrang Saeedzadeh  wrote:
>
> This is a follow up to 
> https://www.postgresql.org/message-id/flat/CAERAJ%2B-1buiJ%2B_JWEo0a9Ao-CVMWpgp%3DEnFx1dJtnB3WmMi2zQ%40mail.gmail.com
>
> The query (generated by Hibernate) got a bit more complex and performance 
> degraded again. I have uploaded all the details here (with changed table 
> names, etc.): https://github.com/behrangsa/slow-query
>
> In short, the new query is:

The query mostly appears slow due to the "Rows Removed By Filter" in
the OR condition. The only way to get around not scanning the entire
branch_invoices table would be to somehow write the way in such a way
that allows it to go on the inner side of the join.

You could do that if you ensure there's an index on branch_invoices
(branch_id) and format the query as:

SELECT inv.id   AS i_id,
   inv.invoice_date AS inv_d,
   inv.invoice_xid  AS inv_xid,
   inv.invoice_type AS inv_type,
   brs.branch_idAS br_id,
   cinvs.company_id AS c_id
FROM invoices inv
 LEFT OUTER JOIN branch_invoices brs ON inv.id = brs.invoice_id
 LEFT OUTER JOIN company_invoices cinvs ON inv.id = cinvs.invoice_id
 INNER JOIN branches br ON brs.branch_id = br.id
WHERE brs.branch_id IN (SELECT br1.id
FROM branches br1
 INNER JOIN access_rights ar1 ON
br1.id = ar1.branch_id
 INNER JOIN users usr1 ON ar1.user_id = usr1.id
 INNER JOIN groups grp1 ON
ar1.group_id = grp1.id
 INNER JOIN group_permissions gpr1 ON
grp1.id = gpr1.group_id
 INNER JOIN permissions prm1 ON
gpr1.permission_id = prm1.id
WHERE usr1.id = 1636
  AND prm1.code = 'C2'
  AND ar1.access_type = 'T1')
UNION ALL
SELECT br3.id
FROM companies cmp
 INNER JOIN branches br3 ON cmp.id =
br3.company_id
 INNER JOIN access_rights ar2 ON
cmp.id = ar2.company_id
 INNER JOIN users usr2 ON ar2.user_id = usr2.id
 INNER JOIN groups g2 ON ar2.group_id = g2.id
 INNER JOIN group_permissions gpr2 ON
g2.id = gpr2.group_id
 INNER JOIN permissions prm2 ON
gpr2.permission_id = prm2.id
WHERE usr2.id = 1636
  AND prm2.code = 'C2'
  AND ar2.access_type = 'T1')
ORDER BY inv.invoice_date DESC, br.name ASC
LIMIT 12;

The planner may then choose to pullup the subquery and uniquify it
then put it on the outside of a nested loop join then lookup the
branch_invoices record using the index on branch_id.  I think this is
quite a likely plan since the planner estimates there's only going to
be 1 row from each of the subqueries.

Also note, that the LEFT JOIN you have to branch_invoices is not
really a left join since you're insisting that the branch_id must be
in the first or 2nd sub-plan. There's no room for it to be NULL. The
planner will just convert that to an INNER JOIN with the above query
since that'll give it the flexibility to put the subquery in the IN
clause on the outside of the join (after having uniquified it).
You'll need to decide what you actually want the behaviour to be here.
If you do need those NULL rows then you'd better move your WHERE quals
down into the join condition for branch_invoices table. I'd suggest
testing with some mock-up data if you're uncertain of what I mean.

If you find that is faster and you can't rewrite the query due to it
having been generated by Hibernate, then that sounds like a problem
with Hibernate.  PostgreSQL does not currently attempt to do any
rewrites which convert OR clauses to use UNION or UNION ALL. No amount
of tweaking the planner settings is going to change that fact.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Get the planner used by a query?

2019-10-09 Thread David Rowley
On Wed, 9 Oct 2019 at 19:21, Behrang Saeedzadeh  wrote:
>
> Is there a way to display the planner algorithm used by a query, either in 
> EXPLAIN or in a different way?

There's not really any simple way to know. If the number of relations
in the join search meets or exceeds geqo_threshold then it'll use the
genetic query optimizer. However, knowing exactly how many relations
are in the join search is not often simple since certain types of
subqueries can be pulled up into the main query and that can increase
the number of relations in the search.

If you don't mind writing C code, then you could write an extension
that hooks into join_search_hook and somehow outputs this information
to you before going on to call the geqo if the "enable_geqo &&
levels_needed >= geqo_threshold" condition is met.  Besides that, I
don't really know if there's any way.  You could try editing the
geqo_seed and seeing if the plan changes, but if it does not, then
that does not mean the geqo was not used, so doing it that way could
be quite error-prone.  You'd only be able to tell the geqo was being
used if you could confirm that changing geqo_seed did change the plan.
(And you could be certain the plan did not change for some other
reason like an auto-analyze).



--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Planner performance in partitions

2019-08-12 Thread David Rowley
On Tue, 13 Aug 2019 at 08:03, Piotr Włodarczyk
 wrote:
> PostgreSQL11.5:
>
> select on main partition (public.book): planner: 60ms, execution: 5ms
> select on partitioned table (public.book_1-1000): planner: 2.7 ms, execution: 
> 2,4 ms
>
> PostgreSQL 12B3:
>
> select on main partition (public.book): planner: 2,5ms , execution: 1,2ms
> select on partitioned table (public.book_1-1000): planner: 2.5 ms, execution: 
> 1,2 ms
>
> So looking at above results we have two options:
>
> Wait for 12.0 stable version
> Wait for patches to 11 – PostgreSQL Team: can You do this?

You'll need to either reduce the number of partitions down to
something realistic or wait for 12.0.

The work done to speed up the planner with partitioned tables for v12
won't be going into v11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: scans on table fail to be excluded by partition bounds

2019-06-26 Thread David Rowley
On Tue, 25 Jun 2019 at 05:31, Justin Pryzby  wrote:
> ts=# explain SELECT * FROM eric_enodeb_cell_metrics WHERE start_time BETWEEN 
> '2019-01-01 04:00' AND '2019-01-01 05:00' OR start_time BETWEEN '2019-01-02 
> 04:00' AND '2019-01-02 05:00';
>  Append  (cost=36.04..39668.56 rows=12817 width=2730)
>->  Bitmap Heap Scan on eric_enodeb_cell_20190101  (cost=36.04..19504.14 
> rows=6398 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.04..36.04 rows=6723 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.81 rows=6465 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Heap Scan on eric_enodeb_cell_20190102  (cost=36.08..20100.34 
> rows=6419 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.08..36.08 rows=6982 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.84 rows=6723 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>
> Is there some reason why the partition constraints aren't excluding any of the
> index scans ?

Yeah, we don't do anything to remove base quals that are redundant due
to the partition constraint.

There was a patch [1] to try and fix this but it's not seen any recent activity.

[1] https://commitfest.postgresql.org/19/1264/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley
On Tue, 21 May 2019 at 14:04, Walter Smith  wrote:
> I'm so sorry -- I meant to give the version, of course. It's 9.6.13.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc3
has been applied since then.

It would be good if you could confirm the problem is resolved after a
vacuum.  Maybe run VACUUM VERBOSE on the table and double check
there's not some large amount of tuples that are "nonremovable".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley
On Tue, 21 May 2019 at 12:44, Walter Smith  wrote:
>
> We had a mysterious (to us) slowdown today that I'm hoping someone can 
> explain just based on PG's principles of operation. It got better by itself 
> so it seems like it was "normal" behavior -- I just don't know what behavior 
> it was exhibiting.
>
> We have a table of user notifications containing about 80 million rows. It 
> gets a lot of inserts continually, and is cleaned up once a day. There are no 
> updates. In all history there have been about 330 million rows created.
>
> Today we deleted about 15 million rows in one transaction from this table. 
> Immediately afterwards, a particular SELECT started running very slowly -- 
> 500 to 3000 ms rather than the usual <1ms.
>
> We did an EXPLAIN ANALYZE on this select and it was still doing an index scan 
> as usual. The *planning time* for the query is what had gotten slow. The 
> query itself was still executing in <1ms.

It would be good to know which version you're running here.  It
basically sounds very much like get_actual_variable_range() will be
the culprit.  Basically, if a constant value that's being used by the
planner to determine row estimates with falls outside the statistic's
histogram and a btree index exists that we can use to look up the
actual bound of the data, then we do so in that function.  If you've
just deleted a bunch of rows then that index scan may have to traverse
a bunch of dead tuples before finding that first live tuple.  This
code has changed a few times in recent times, see fccebe421 and
3ca930fc3, which is why your version is of interest.

Apart from that, if you want to confirm that's the issue and you just
want it fixed, just VACUUM the table. You should likely be doing that
anyway directly after your bulk delete.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Poor man's partitioned index .... not being used?

2019-03-22 Thread David Rowley
On Fri, 22 Mar 2019 at 07:57, Gunther  wrote:
> foo=# PREPARE testplan(int) AS
> foo-#SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1;
> PREPARE
> foo=# EXPLAIN EXECUTE testplan(8934);
> QUERY PLAN
> --
>  Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
>Index Cond: (id = 8934)
> (2 rows)
>
> That's quite alright actually. Now the questions is, could we use this in a 
> nested loop query plan? That's where I think it can't work:

Not really. In that case, the parameters were replaced with the
specified values (a.k.a custom plan).  That happens for the first 5
executions of a prepared statement, and in this case likely the
planner will continue to use the custom plan since the generic plan
won't know that the partial index is okay to use and the plan costs
would likely go up enough that the custom plan would continue to be
favoured.

> foo=# SET enable_mergejoin TO off;
> SET
> foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
> mod(a.id,2) AND b.id = a.id) LIMIT 10;
>QUERY PLAN
> 
>  Limit  (cost=0.00..102516.78 rows=10 width=8)
>->  Nested Loop Left Join  (cost=0.00..38238760.24 rows=3730 width=8)
>  Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2)))
>  ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
>  ->  Materialize  (cost=0.00..9056.93 rows=388129 width=4)
>->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 
> width=4)
> (6 rows)
>
> It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to 
> the index query for the nested loop.

Whether partial indexes can be used are not is determined using only
quals that can be applied at the scan level.  In this case your qual
is a join qual, and since no other qual exists that can be evaluated
at the scan level where the index can be used, then it's not
considered.  In any case, nothing there guarantees that one of your
indexes will match all records. For it to work, both of you indexes
would have to be scanned. It's not clear why you think that would be
any better than scanning just one index. I imagine it would only ever
be a win if you could eliminate one of the index scans with some qual
that guarantees that the index can't contain any records matching your
query.

> I wonder if there was a way of marking such expressions as safe in the query, 
> like suggesting a certain evaluation order, i.e.,
>
> SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = 
> EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10;
>
> It's OK though. It just goes to show that in a case like this, it is best to 
> just go with the partitioned table anyway.

It sounds like you might want something like partition-wise join that
exists in PG11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Poor man's partitioned index .... not being used?

2019-03-20 Thread David Rowley
On Thu, 21 Mar 2019 at 15:51, Gunther  wrote:
> foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0;
> CREATE INDEX
> foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1;
> CREATE INDEX
> foo=# ANALYZE Test;
> ANALYZE
> foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934';
>   QUERY PLAN
> ---
>  Seq Scan on test  (cost=0.00..6569.61 rows=1 width=4)
>Filter: (id = 8934)
> (2 rows)
>
> Why is that index never used?

When the planner looks at partial indexes to see if they'll suit the
scan, the code that does the matching (predicate_implied_by()) simply
does not go to that much trouble to determine if it matches. If you
look at operator_predicate_proof() you'll see it requires the
expression on at least one side of the OpExpr to match your predicate.
Yours matches on neither side since "id" is wrapped up in a mod()
function call.

Certainly, predicate_implied_by() is by no means finished, new smarts
have been added to it over the years to allow it to prove more cases,
but each time something is added we still need to carefully weigh up
the additional overhead of the new code vs. possible benefits.

It may be possible to do something with immutable functions found in
the expr but someone doing so might have a hard time proving that it's
always safe to do so. For example, arg 2 of your mod() call is a
Const. If it had been another Var then it wouldn't be safe to use.
What other unsafe cases are there? Is there a way we can always
identify unsafe cases during planning? ... are the sorts of questions
someone implementing this would be faced with.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: IS NOT DISTINCT FROM statement

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 01:25, Artur Zając  wrote:
> I made some workaround. I made function:
>
> CREATE OR REPLACE FUNCTION smarteq(v1 int,v2 INT) RETURNS BOOL AS
> $BODY$
> SELECT (CASE WHEN v2 IS NULL THEN (v1 IS NULL) ELSE v1=v2 END);
> $BODY$ LANGUAGE 'sql' IMMUTABLE PARALLEL SAFE;

> explain analyze select id from sometable where smarteq(id1,NULL);
>   QUERY PLAN
> --
>  Bitmap Heap Scan on sometable  (cost=19338.59..57045.02 rows=882343 width=4) 
> (actual time=116.236..306.304 rows=881657 loops=1)
>Recheck Cond: (id1 IS NULL)
>Heap Blocks: exact=9581
>->  Bitmap Index Scan on sometable_index1  (cost=0.00..19118.00 
> rows=882343 width=0) (actual time=114.209..114.209 rows=892552 loops=1)
>  Index Cond: (id1 IS NULL)
>  Planning time: 0.135 ms
>  Execution time: 339.229 ms
>
> It looks like it works, but I must check if it will still works in plpgsql (I 
> expect some problems if query is prepared).

I think with either that you'll just be at the mercy of whether a
generic or custom plan is chosen.  If you get a custom plan then
likely your case statement will be inlined and constant folded away,
but for a generic plan, that can't happen since those constants are
not consts, they're parameters.   Most likely, if you've got an index
on the column you'll perhaps always get a custom plan as the generic
plan would result in a seqscan and it would have to evaluate your case
statement for each row.  By default, generic plans are only considered
on the 6th query execution and are only chosen if the generic cost is
cheaper than the average custom plan cost + fuzz cost for planning.
PG12 gives you a bit more control over that with the plan_cache_mode
GUC, but... that's the not out yet.

However, possibly the cost of planning each execution is cheaper than
doing the seq scan, so you might be better off with this.  There is a
risk that the planner does for some reason choose a generic plan and
ends up doing the seq scan, but for that to happen likely the table
would have to be small, in which case it wouldn't matter or the costs
would have to be off, which might cause you some pain.

The transformation mentioned earlier could only work if the arguments
of the IS NOT DISTINCT FROM were Vars or Consts. It couldn't work with
Params since the values are unknown to the planner.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: IS NOT DISTINCT FROM statement

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 00:30, Artur Zając  wrote:
> Is there any reason that „NOT DISTINCT FROM” can’t be autotransformed to „=” 
> when value on right side of expression is not NULL or is this any way to use 
> index with „IS NOT DISTINCT FROM” statement?

Probably nothing other than nobody has done it yet.  It might be
reasonable to have some transformation stage called from
distribute_restrictinfo_to_rels() when adding single rel RestrictInfos
to RTE_RELATION base rels.  It's only these you can check for NOT NULL
constraints, i.e. not so possible with rtekinds such as RTE_FUNCTION
and the like.

It becomes more complex if you consider that someone might have added
a partial index on the relation that matches the IS NOT DISTINCT FROM
clause.  In this case, they might not be happy that their index can no
longer be used. Fixing that would require some careful surgery on
predicate_implied_by() to teach it about IS NOT DISTINCT FROM clauses.
However, that seems to go a step beyond what predicate_implied_by()
does for now. Currently, it only gets to know about quals. Not the
relations they belong to, so there'd be no way to know that the NOT
NULL constraint exists from there.  I'm not sure if there's a good
reason for this or not, it might be because it's not been required
before.  It gets more complex still if you want to consider other
quals in the list to prove not nullness.

In short, probably possible, but why not just write an equality
clause, if you know NULLs are not possible?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-23 Thread David Rowley
On Sun, 24 Feb 2019 at 10:06, Gunther  wrote:
> I am using an SQL queue for distributing work to massively parallel workers. 
> Workers come in servers with 12 parallel threads. One of those worker sets 
> handles 7 transactions per second. If I add a second one, for 24 parallel 
> workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can add 
> to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble. But 
> that isn't even so much my problem rather than the fact that in short time, 
> the performance will deteriorate, and it looks like that is because the queue 
> index deteriorates and needs a REINDEX.

It sounds very much like auto-vacuum is simply unable to keep up with
the rate at which the table is being updated.   Please be aware, that
by default, auto-vacuum is configured to run fairly slowly so as not
to saturate low-end machines.

vacuum_cost_limit / autovacuum_vacuum_cost limit control how many
"points" the vacuum process can accumulate before it will perform an
autovacuum_vacuum_cost_delay / vacuum_cost_delay.

Additionally, after an auto-vacuum run completes it will wait for
autovacuum_naptime before checking again if any tables require some
attention.

I think you should be monitoring how many auto-vacuums workers are
busy during your runs. If you find that the "queue" table is being
vacuumed almost constantly, then you'll likely want to increase
vacuum_cost_limit / autovacuum_vacuum_cost_limit. You could get an
idea of how often this table is being auto-vacuumed by setting
log_autovacuum_min_duration to 0 and checking the logs.  Another way
to check would be to sample what: SELECT query FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'; returns. You may find that all of the
workers are busy most of the time.  If so, that indicates that the
cost limits need to be raised.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Why isn't an index scan being used?

2019-02-19 Thread David Rowley
On Wed, 20 Feb 2019 at 13:11, Abi Noda  wrote:
> However, when I index the closed column, a bitmap scan is used instead of an 
> index scan, with slightly slower performance. Why isn't an index scan being 
> used, given that the exact same number of rows are at play as in my query on 
> the state column?

That's down to the planner's cost estimates. Likely it thinks that
either doing a bitmap scan is cheaper, or close enough that it does
not matter.

> How do I index closed in a way where an index scan is used?

The costing does account for the size of the index. If the
"closed_index" index is large than the "state_index", then doing an
Index scan on "closed_index" is going to be costed higher.

Most of this likely boils down to random_page_cost being a guess. You
may want to check your effective_cache_size is set to something like
75% of the machine's memory, and/or tweak random page cost down, if
it's set to the standard 4 setting.  modern SSDs are pretty fast at
random reads. HDDs, not so much.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread David Rowley
On Thu, 7 Feb 2019 at 02:34, Mariel Cherkassky
 wrote:
> As I said, I set the next settings for the toasted table :
>
>  alter table orig_table set (toast.autovacuum_vacuum_scale_factor = 0);
>
>   alter table orig_table set (toast.autovacuum_vacuum_threshold  =1);

These settings don't control how fast auto-vacuum runs, just when it should run.

> Can you explain a little bit more why you decided that the autovacuum spent 
> it time on sleeping ?

Yeah, if you look at the following settings.

 vacuum_cost_limit   | 200
 vacuum_cost_page_dirty  | 20
 vacuum_cost_page_hit| 1
 vacuum_cost_page_miss   | 10
 autovacuum_vacuum_cost_delay| 20ms

I've tagged on the default setting for each of these.  Both vacuum and
auto-vacuum keep score of how many points they've accumulated while
running. 20 points for dirtying a page, 10 for a read that's not found
to be in shared_buffers, 1 for reading a buffer from shared buffers.
When vacuum_cost_limit points is reached (or
autovacuum_vacuum_cost_limit if not -1) auto-vacuum sleeps for
autovacuum_vacuum_cost_delay, normal manual vacuums sleep for
vacuum_cost_delay.

In one of the log entries you saw:

> buffer usage: 4808221 hits, 6404148 misses, 6152603 dirtied
> avg read rate: 2.617 MiB/s, avg write rate: 2.514 MiB/s
> system usage: CPU 148.65s/70.06u sec elapsed 19119.55 sec

Doing a bit of maths to see how much time that vacuum should have slept for:

postgres=# select (4808221 * 1 + 6404148 * 10 + 6152603 * 20) / 200.0
* 20 / 1000;
  ?column?

 19190.1761

That's remarkably close to the actual time of 19119.55 sec. If you do
the same for the other 3 vacuums then you'll see the same close match.

> I see the autovacuum statistics from the logs, how can I check that the 
> workers are busy very often ?

It would be nice if there was something better, but periodically doing:

SELECT count(*) FROM pg_stat_activity where query like 'autovacuum%';

will work.

> My vacuum limit is 200 right now, basically If vacuum runs on my toasted 
> table and reached 200 but it didnt finish to clean all the dead tuples, after 
> the nap, should it continue cleaning it or wait until the vacuum_threshold 
> hit again ?

You're confusing nap time is something else, Maybe you're confusing
that with speed of vacuum?  Napping is just the time auto-vacuum will
wait between checking for new tables to work on.  Having the
auto-vacuum run so slowly is a probable cause of still having dead
tuples after the vacuum... likely because they became dead after
vacuum started.

I'd recommend reading the manual or Tomas Vondra's blog about vacuum
costs. It's not overly complex, once you understand what each of the
vacuum settings does.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread David Rowley
On Thu, 7 Feb 2019 at 00:17, Laurenz Albe  wrote:
>
> On Wed, 2019-02-06 at 12:29 +0200, Mariel Cherkassky wrote:
> > Now the question is how to handle or tune it ? Is there any change that I 
> > need to increase the cost_limit / cost_delay ?
>
> Maybe configuring autovacuum to run faster will help:
>
> alter table orig_table set (toast.autovacuum_vacuum_cost_limit = 2000);
>
> Or, more extreme:
>
> alter table orig_table set (toast.autovacuum_vacuum_cost_delay = 0);

Going by the block hits/misses/dirtied and the mentioned vacuum times,
it looks like auto-vacuum is set to the standard settings and if so it
spent about 100% of its time sleeping on the job.

It might be a better idea to consider changing the vacuum settings
globally rather than just for one table.

Running a vacuum_cost_limit of 200 is likely something you'd not want
to ever do with modern hardware... well maybe unless you just bought
the latest Raspberry PI, or something.  You should be tuning that
value to something that runs your vacuums to a speed you're happy with
but leaves enough IO and CPU for queries running on the database.

If you see that all auto-vacuum workers are busy more often than not,
then they're likely running too slowly and should be set to run more
quickly.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Zero throughput on a query on a very large table.

2019-01-24 Thread David Rowley
On Fri, 25 Jan 2019 at 19:24, Tom Lane  wrote:
> PS: On the third hand, you mention having created new indexes on this
> table with apparently not a lot of pain, which is a tad surprising
> if you don't have the patience to wait for a sort to finish.  How
> long did those index builds take?

It would certainly be good to look at psql's \d tmp_outpatient_rev
output to ensure that the index is not marked as INVALID.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Parallel stats in execution plans

2019-01-24 Thread David Rowley
On Thu, 17 Jan 2019 at 00:31, David Conlin  wrote:
> How the time values combine with parallelism. For example, each execution of 
> the sort node takes an average of 48.5s, over three loops. This makes a total 
> running time of 145.5s. Even if this was perfectly distributed between the 
> two workers, I would expect this to take 72.75s, which is more than the total 
> execution time, so it can't take this long.
>  How the row numbers combine with those in the "Workers" subkey. For example, 
> in the Sort node, worker #0 produces 3,277,980 rows, while worker #1 produces 
> 3,385,130 rows. The Sort node as a whole produces 3,333,337 rows per loop, 
> for a total of 10,000,010 (the value in the gather merge node). I would have 
> expected the number of rows produced by the two workers to sum to the number 
> produced by the Sort node as a whole, either per loop or in total.
> How the "Actual Loops" values combine with those in the "Workers" subkey. For 
> example, the "Sort" node has 3 loops, but each of the workers inside it have 
> 1 loop. I would have expected either:
>
> each of the workers to have done 3 loops (since the sort is executed 3 
> times), or
> the number of loops in the two workers to sum to three (if the three 
> executions of the sort are distributed across the two workers)

It's important to know that all of the actual row counts and actual
time are divided by the number of loops, which in this case is 3, one
per process working on that part of the plan.  There are two workers,
but also the main process helps out too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-09 Thread David Rowley
On Thu, 10 Jan 2019 at 01:55, Abadie Lana  wrote:
> The other query suggested by D.Rowley has the same issue : still swap 
> activity is higher.
> explain analyze select 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW', 
> c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val 
> from sample c where c.channel_id in (select channel_id from channel where 
> name ='BUIL-B36-VA-RT-RT1:CL0001-2-ABW')   order by c.smpl_time  desc limit 5;

This is not the query I suggested.  I mentioned if channel.name had a
unique index, you'd be able to do WHERE c.channel_id = (select
channel_id from channel where name = '...').  That's pretty different
to what you have above.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
On Fri, 4 Jan 2019 at 02:20, Abadie Lana  wrote:
> > From: David Rowley 
> > Sent: 03 January 2019 14:01
> Right, so you need to check your indexes on sample_ctrl_year and 
> sample_buil_year. You need an index on (channel_id, smpl_time) on those.

> These indexes exist already

That's interesting. The \d output indicates that the indexes are not
INVALID, so it's not all that obvious why the planner would choose a
lesser index to provide the required rows. One thought is that the
more suitable index is very bloated.  This would increase the
estimated cost of scanning the index and reduce the chances of the
index being selected by the query planner.

If you execute:

select indrelid::regclass as table_name, indexrelid::Regclass as
index_name,pg_size_pretty(pg_relation_size(indrelid))
table_size,pg_size_pretty(pg_relation_size(indexrelid)) index_size
from pg_index
where indrelid in('sample_ctrl_year'::regclass, 'sample_buil_year'::regclass)
order by indrelid::regclass::name, indexrelid::regclass::name;

This should show you the size of the tables and indexes in question.
If the sample_time_cy_idx and sample_time_by_idx indexes are very
large when compared with the size of their table, then it is likely
worth building a new index for these then dropping the old index then
retrying the re-written version of the query.  If this is a live
system then you can build the new indexes by using the CREATE INDEX
CONCURRENTLY command.  This will allow other DML operations to work
without being blocked. The old indexes can then be dropped with DROP
INDEX CONCURRENTLY.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
> From: David Rowley 
> Sent: 03 January 2019 14:01
> That's surprisingly slow. Can you share the EXPLAIN (ANALYZE, BUFFERS) of 
> that?
>
> explain (analyze,buffers) select 
> 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW',c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val
>  from sample c WHERE c.channel_id = (SELECT channel_id FROM channel WHERE 
> name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW') order by c.smpl_time desc limit 5;


>->  Index Scan Backward using smpl_time_bx2_idx on 
> sample_buil_year c_5  (cost=0.56..2023925.30 rows=3162364
> width=320) (actual time=15167.330..15167.330 rows=0 loops=1)
>  Filter: (channel_id = $0)
>  Rows Removed by Filter: 50597834
>  Buffers: shared hit=25913147 read=713221
>->  Index Scan Backward using sample_time_cm_idx on 
> sample_ctrl_month c_6  (cost=0.56..1862587.12 rows=537562
>  width=77) (actual time=0.048..0.048 rows=0 loops=1)
>  Index Cond: (channel_id = $0)
>  Buffers: shared read=4
>->  Index Scan Backward using smpl_time_cmx2_idx on 
> sample_ctrl_year c_7  (cost=0.57..3186305.67 rows=2094186
>  width=68) (actual time=25847.549..25847.549 rows=0 loops=1)
>  Filter: (channel_id = $0)
>  Rows Removed by Filter: 79579075
>  Buffers: shared hit=49868991 read=1121715

Right, so you need to check your indexes on sample_ctrl_year and
sample_buil_year. You need an index on (channel_id, smpl_time) on
those.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
On Fri, 4 Jan 2019 at 01:57, Abadie Lana  wrote:
> 4) name is unique, constraint and index created. Right index is picked up and 
> query time is rather constant there 40sec.

That's surprisingly slow. Can you share the EXPLAIN (ANALYZE, BUFFERS) of that?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



  1   2   >